A SYLLABLE BASED SEGMENT VOCODER -...
Transcript of A SYLLABLE BASED SEGMENT VOCODER -...
A SYLLABLE BASED SEGMENT VOCODER
A THESIS
submitted by
SADHANA CHEVIREDDY
for the award of the degree
of
MASTER OF SCIENCE
(by Research)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
DECEMBER 2008
THESIS CERTIFICATE
This is to certify that the Thesis entitled A Syllable based Segment Vocoder
submitted by Sadhana Chevireddy to the Indian Institute of Technology Madras,
for the award of the degree of Master of Science (by research), is a bonafide record
of the research carried out by her under our supervision and guidance. The con-
tents of this thesis, in full or in parts, have not been submitted to any other
Institute or University for the award of any degree or diploma.
Dr. Hema A. Murthy Dr. C. Chandra Sekhar
Chennai-600036
Date:
To my late grand parents
Sri B. Pulla Reddy and Smt B. Jayamma
for their enduring love that still touches me
i
ACKNOWLEDGEMENTS
First and foremost, I would like to express my sincere gratitude towards my guides
Prof. Hema A. Murthy and Prof. C. Chandra Sekhar for their constant support
and guidance. Hema Ma’am has always been my role model and whatever progress
I have made, both personally and academically in these three years of stay, I owe it
to her. Working under her provides an immense scope to learn and a few minutes
of discussion with her opens up a new insight into the problem. Personally, she has
showered her motherly love whenever I was dejected. I cherish all the moments
I spent with her. Thank you Ma’am. I thank Chandra Sekhar sir for providing
his invaluable feedback during the correction of my papers and Thesis. I always
remember the two most important interactions with him - one which made me
come regularly and sharply at 8 A.M to lab. The other which made me write my
Thesis better. I am extremely grateful to you sir, for your remarks always made
me stronger and humble.
I thank my GTC members, Prof. B. Ravidran and Prof. Andrew Thangaraj
for their valuable suggestions in my GTC and Synopsis meetings. I am thankful to
the Head of the Department, Prof. Timothy A. Gonsalves for his encouragement
and nice words in my M.S Seminar. I also thank him for providing excellent
facilities in the Department.
I thank Sarada Ma’am for her kind assistance whenever required, Natarajan
Sir and Radhai Ma’am for the their co-operation in scheduling my GTC meetings.
ii
I am indebted to Prema Ma’am for her love and for presenting me with one of her
beautiful paintings. Most importantly, I would like to thank every person in the
department who helped directly or indirectly in successfully completing my course
work.
I would like to thank my loving friends, Ramya who had always been there
for me during my difficult times. More than a friend, she had been a wonderful
critic who helped in understanding my work better. And Sravanti with whom I
enjoy chatting, for she is the one who listens to me with utmost concentration and
makes me happy. I would like to thank someone who always says I am correct
(mostly to please me), Deivapalan. I always remember the light hearted moments
shared with him. And Pappu sir, who is right there to extend his help whenever
required.
I would like to thank my friend Abhijit for his valuable suggestions regard-
ing coding. The discussions with him made me understand my work better.
I thank all my friends Deepti, Venu, Ranbir sir, Chaitanya, Vinodh and Ra-
jesh for providing friendly environment in Lab. I am extremely thankful to my
friends Syam, Deepthy, Bama, Ramya, Sahiti, Radhika, Neetha, Jyothi, Suneetha,
Samaja, Harini and Mrudula for making my stay in campus a memorable one.
On the personal front, I firstly thank Madhu uncle for nurturing the art of
thinking through never ending discussions on whatnot on earth. Ramesh uncle for
his all good words during my tough times. I thank my doctor uncle, Srinath mama
for ensuring that my health is fine during my stay here. And Sujatha atta for her
immense love towards me. I thank all my best friends Shallu, Suni, Summu, Siri
iii
and Navya for their love and affection. Not even once they made me feel that I
stay away from them.
I am extremely grateful to my dearest friend Akku who had been everything to
me during my stay in the campus. She is still the one who takes all my nonsense
with lot of patience.
I thank my loving Chaakha for bothering Gods on behalf of me whenever I am
worried and Chinanna for his love and care towards me.
I am extremely fortunate to have wonderful parents, my strength. Daddy has
been my first guru from whom I learnt the value of education and life. Amma
has been my dearest friend, philosopher and guide who always shows me the best
path when I am confused. I express my sincere gratitude to them.
Finally, I thank the Almighty for showering His immense grace upon me by
placing me among these wonderful people and in the most beautiful place, IIT
Madras...
Sadhana
iv
ABSTRACT
Keywords: Very low bit rate speech coding, Segment vocoders, Syllable segmen-
tation, Unsupervised HMM clustering, MELP residual modeling
In the modern day voice communication technology, the allocation of available
bandwidth to increasing number of users is a challenge. This paved way for the
development of very low bit rate speech coders. There has been a progressive
research on how to reduce the transmission bit rates without paying off for the
consumer satisfaction. The low bit rate coders are aimed to achieve good intel-
ligibility at bit rates less than 4.8 Kbps. Segment vocoders are one such class of
very low bit rate coders which tremendously reduce the bit rates below 2.4 Kbps
and still able to retain the intelligibility.
In the present work, a syllable based segment vocoder that operates at 1.4 Kbps
is developed. Implementation of a segment vocoder has many stages: segmenta-
tion of speech into predefined units, preparation of the segment codebook and
residual modeling. Many vocoders use phonemes or diphones as segmental units
and employ appropriate automatic segmentation techniques to obtain the units.
Most of the segmentation techniques used are either iterative or use parametric
models built on a large speech corpus. In the present work, a signal process-
ing technique called automatic group delay based segmentation is used to obtain
syllable like segments. These segments are used as the units for compression. Syl-
v
lable like segments being larger units than phonemes, offer better compression.
A Hidden Markov Model (HMM) based incremental training algorithm is used
for clustering the segmental units. Unlike most of the clustering techniques, in
the proposed clustering technique, HMM training is unsupervised and does not
require any transcription for building the HMMs. The clusters are later used to
form the system codebook. To preserve the naturalness in the synthesized speech,
the residual is modeled using Mixed Excited Linear Prediction (MELP) codec op-
erating at 2.4 Kbps. When the residual is modeled using MELP, an average bit
rate of 1.4 Kbps is achieved for the proposed syllable based segment vocoder. The
synthesized speech quality is compared with that of the MELP codec using the
objective evaluation measure, Perceptual Evaluation of Speech Quality (PESQ).
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ii
ABSTRACT v
LIST OF TABLES x
LIST OF FIGURES xii
ABBREVIATIONS xiii
1 OVERVIEW OF THE THESIS 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . 7
2 SPEECH CODERS 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Waveform Coders . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Vocoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Segment Vocoders . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Issues in Implementing a Segment Vocoder . . . . . . . . . . . . 19
2.6 A Review of Segment Vocoders . . . . . . . . . . . . . . . . . . 22
2.7 Syllable based Segment Vocoder . . . . . . . . . . . . . . . . . . 28
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 SPEECH SEGMENTATION 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Segmentation in Segment Vocoders . . . . . . . . . . . . . . . . 30
vii
3.2.1 Maximum Likelihood (ML) Segmentation . . . . . . . . . 32
3.2.2 Spectral Transition Measure (STM) based Segmentation 34
3.3 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Group Delay based Segmentation . . . . . . . . . . . . . . . . . 36
3.5 Syllable Segmentation for the Syllable based Segment Vocoder . 40
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 SYSTEM CODEBOOK PREPARATION 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Need for System Compression . . . . . . . . . . . . . . . . . . . 42
4.3 Review of Clustering Algorithms . . . . . . . . . . . . . . . . . . 43
4.4 Motivation for HMM based syllable segment clustering . . . . . 46
4.5 System Codebook Preparation . . . . . . . . . . . . . . . . . . . 47
4.5.1 MFS and MFR Techniques . . . . . . . . . . . . . . . . . 48
4.5.2 Clustering of syllable segments . . . . . . . . . . . . . . . 48
4.5.3 Color palette analogy . . . . . . . . . . . . . . . . . . . . 51
4.5.4 Selection of representative syllable segments . . . . . . . 53
4.5.5 Preparation of system codebook . . . . . . . . . . . . . . 57
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 SYLLABLE BASED SEGMENT VOCODER 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Operation of the Vocoder . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 System Quantization . . . . . . . . . . . . . . . . . . . . 62
5.3.3 Source Encoding . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Extraction of system parameters . . . . . . . . . . . . . . 65
5.4.2 Source Modeling . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Synthesis of Speech . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 70
viii
5.7 Issues in the Implementation . . . . . . . . . . . . . . . . . . . . 76
5.7.1 Residual Modeling . . . . . . . . . . . . . . . . . . . . . 76
5.7.2 Syllable Recognition . . . . . . . . . . . . . . . . . . . . 78
5.7.3 Duration Matching . . . . . . . . . . . . . . . . . . . . . 82
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 CONCLUSIONS 88
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Criticism of the work . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . 91
LIST OF PUBLICATIONS 96
LIST OF TABLES
3.1 Syllable boundaries with single and two-level segmentations . . 39
5.1 PESQ scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for four sentences in the single speakerdatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for six sentences in the multi speakerdatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for six sentences of Tamil language inthe multi speaker, multi lingual database . . . . . . . . . . . . . 75
5.5 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for six sentences of Hindi language inthe multi speaker, multi lingual database . . . . . . . . . . . . . 75
5.6 PESQ scores of an utterance after appropriate frame repetition . 85
x
LIST OF FIGURES
1.1 Role of a speech codec in mobile phone technology. . . . . . . . 2
2.1 Block diagram of a waveform coder . . . . . . . . . . . . . . . . 11
2.2 LPC model for speech production . . . . . . . . . . . . . . . . . 12
2.3 Block diagram of a vocoder showing the process in (a) Transmitterand (b) Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Block diagram of a segment vocoder showing the process in (a)Transmitter and (b) Receiver . . . . . . . . . . . . . . . . . . . 18
2.5 Performance of different coders . . . . . . . . . . . . . . . . . . 21
3.1 Group delay based segmentation of a speech signal . . . . . . . 38
3.2 Illustration of the two-level group delay based segmentation (a)Waveform of the speech signal for a sentence (b) Result of the firstlevel of segmentation and (c) Result of the second level of segmen-tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Color palette analogy to explain the proposed HMM based tech-nique for segment clustering. . . . . . . . . . . . . . . . . . . . . 53
4.2 Illustration of selection of a representative syllable segment . . . 54
4.3 Waveforms and Spectrograms of five syllable segments in a clus-ter formed using the proposed clustering algorithm. The clustercontains the syllable segments that sound like /va/. . . . . . . . 56
4.4 System codebook entries. Here Si denotes the representative sylla-ble for the cluster Ci and ni is the duration of Si. . . . . . . . . 58
5.1 Block diagram of the syllable based segment vocoder . . . . . . 60
5.2 Segmentation of a Tamil sentence into syllable like units . . . . 61
5.3 System encoding in the syllable based segment vocoder . . . . . 62
5.4 Block diagram of the MELP encoder used for source encoding inthe syllable based segment vocoder . . . . . . . . . . . . . . . . 63
5.5 Block diagram of the MELP decoder . . . . . . . . . . . . . . . 66
5.6 Comparison of the MELP model residual with the original residual 67
xi
5.7 Waveform of (a) The original speech signal and (b) The clippedsignal of the synthesized waveform . . . . . . . . . . . . . . . . 69
5.8 Block diagram of the modified segment vocoder . . . . . . . . . 70
5.9 Comparison of spectrograms: (a) Original signal and (b) Synthe-sized signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.10 Comparison of the residual for the original speech signal and theresidual used in the vocoder: (a) Residual of the original speechsignal and (b) Residual of the speech signal passed through thefilter from the LP coefficient vectors of the representative syllablesegments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.11 Different combinations of syllables for the utterance /vanakkam/ 79
5.12 PESQ improvement by selecting perceptually close representativesyllable segments . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.13 Comparison of the residual for the (a) original speech signal, (b)first best results of the representative syllable segments and (c)perceptually close representative syllable segments from 3-best re-sults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.14 Comparison of the segment duration of (a) original syllable segmentand (b) representative syllable segment . . . . . . . . . . . . . . 82
5.15 Analysis of spectrograms for (a) Input syllable segment, (b) Synthe-sized syllable segment obtained with middle frame repetition, (c)Synthesized syllable segment with appropriate frame repetition and(d) Synthesized syllable segment with appropriate frame repetitionbut different from the frame repetition used in (c). . . . . . . . . 84
5.16 (a) Synthesized speech obtained using the MELP codec and (b)Synthesized speech obtained using the syllable based segment vocoder. 86
xii
ABBREVIATIONS
ADPCM Adaptive Differential Pulse Code ModulationCELP Code Excited Linear PredictionDP Dynamic ProgrammingDPCM Differential Pulse Code ModulationDTW Dynamic Time WarpingGD Group DelayHMM Hidden Markov ModelLP Linear PredictionLPC Linear Predictive CodingLPCC Linear Prediction Cepstral CoefficientLSF Line Spectral FrequencyMBE Multiband ExcitationMELP Mixed Excited Linear PredictionMFCC Mel Frequency Cepstral CoefficientMFR Multiple Frame RateMFS Multiple Frame SizeML Maximum LikelihoodMLSA Mel Log Spectrum ApproximationMOS Mean Opinion ScorePCM Pulse Code ModulationPESQ Perceptual Evaluation of Speech QualityRELP Residual Excited Linear PredictionRMS Root Mean SquareSTE Short Term EnergySTM Spectral Transition MeasureTTS Text-to-SpeechVLSI Very Large Scale IntegrationVQ Vector Quantization
xiii
CHAPTER 1
OVERVIEW OF THE THESIS
1.1 Introduction
The digital revolution has had its impact on every field of science and technology
over the last two decades. The field of communications in general, and the voice
communication technology in particular, has witnessed drastic changes. Advances
in Very Large Scale Integration (VLSI) and sophisticated speech coding algorithms
have enabled the miniaturization of hand held devices that operate with very low
power and communicate at very low transmission bit rates.
Speech coding is a set of techniques used to compress the digital speech data.
Speech is represented in the form of a code and the code is stored or transmitted.
During the playback or reception, the stored code is converted back to the speech
signal. Speech coding algorithms aim at achieving very high compression rates of
speech without an objectionable loss of speech quality. The high compression in
speech enables less storage space and low transmission rates of communication.
The software or hardware that implements a speech coding algorithm is called
a speech coder. Speech communication between two points is complete when
a speech signal is compressed and coded at the transmitter, and decoded and
regenerated at the receiver. The coding and decoding operations require a speech
coder at the transmitter and a speech decoder at the receiver. The combination
of a speech coder and speech decoder forms a speech codec. Speech codecs find
applications in all forms of voice communications including cellular telephony,
voice over IP etc. The role of a speech codec in mobile phone technology is shown
in Figure 1.1.
CODEC
Decoding algorithms
User at the transmitter
Speechsignal
Speech signal
Digital to AnalogConverter (DAC)
User at the receiver
Channel
CODEC
CompressionBitstream
Bitstream DecodeRegeneration
Channel Converter (ADC)Analog to Digital
Encode
Coding algorithms
Figure 1.1: Role of a speech codec in mobile phone technology.
The implementation of a speech codec essentially means the implementation
of a speech coder. The operation of the speech decoder depends on the method of
coding employed in the speech coder.
With the increase in the number of users day by day, the need for accommo-
dating more and more users within the available bandwidth is unarguably a big
challenge. Thus the primary goal of any speech coder is to represent the speech
data with minimum possible bits. This leads to a decrease in the transmission
bit rates and thereby increases the bandwidth utilization without compromising
for the speech quality at the receiver. Normally, as more compression is achieved,
the distortion in the regenerated speech increases, resulting in a decrease in the
quality of speech. Therefore, there is always a trade-off between speech quality
and bit rate.
Speech coders are broadly classified into two groups: waveform coders and
source coders [1]. The two classes differ in the methods used for speech repre-
sentation while coding. A waveform coder does not use the knowledge of signal
(speech) generation while coding the input signal. A waveform coder tries to
code the signal in such a way that the reproduced signal at the receiver is as
2
close as possible to its original. Therefore, the waveform coders have very high
bit rates and yield high quality speech. Source coders, also called vocoders, use
the knowledge of speech signal generation while coding. Prior to coding, vocoders
represent the speech signal in the form of a model. They try to extract the param-
eters of the model which are then encoded and transmitted. At the receiver, the
vocoder decodes the parameters, rebuilds the model using the decoded parameters
and synthesizes the speech using the model. Generally, Linear Predictive Coding
(LPC) model is used to realize the vocoders. Low bit rates less than 8 Kbps are
achieved in these vocoders by compressing the system and source information to
maximum extent. However, the quality of the synthesized speech signal is poor
when compared to that of the waveform coders. Segment vocoders are the class
of vocoders where the system compression is done at the segment level instead
of at the frame level. Compression is carried out on source by modeling it with
minimum possible number of bits. This high compression of system and source
enables the vocoders to encode speech at very low bit rates (less than 2.4 Kbps)
which in turn reduces the speech quality. Focus of research in segment vocoders
is on improving the speech quality at very low bit rates.
Many very low bit rate coders use a segment based approach to coding [2][3][4][5][6]
[7][8]. The first phase in building such a coder is the segmentation task. Here,
the utterances in large speech corpora are automatically segmented into spectrally
stable units. The second phase is the development of codebooks for the segments
obtained in the first phase. Assuming a source-system model for speech produc-
tion, each segment is deconvolved into its source and system components. In LPC
based analysis, the residual corresponds to the source and the LP coefficients cor-
respond to the system [9]. The system and source components of a segment are
separately coded. To prepare the codebook of system parameters, the sequences
of spectral parameter vectors like Mel Frequency Cepstral Coefficients (MFCCs),
Linear Prediction Cepstral Coefficients (LPCCs), and Line Spectral Frequencies
(LSFs) are clustered. The codebook for the source is prepared by quantizing the
3
parameters such as pitch, gain and voicing. The third and final phase in a segment
vocoder is the encoding of the speech input. The input speech utterance is seg-
mented into variable length segments. Each segment is quantized using the system
codebook. The codebook index that best matches the given segment is transmit-
ted. The residual is encoded using any technique namely LPC10/MELP [10]. At
the receiver, the speech is synthesized using the sequence of vectors corresponding
to the codebook index received, and the decoded residual.
1.2 Scope of the Thesis
The two components of speech, source and system, differ in their characteristics.
Hence, the extent of compression that can be achieved for the two components
is different. The source, which is fast varying, is generally quantized frame by
frame. Since the system characteristics vary slowly, quantization can be achieved
over a set of frames or a segment. This facilitates better compression. In segment
vocoders, the savings in bandwidth primarily comes from the choice of the segment;
the larger is the segment, the lower is the bit-rate. The segment should not be
so large that it cannot be modeled properly. Therefore, choosing an appropriate
segmental unit which is reasonably stable and is not too small becomes an issue.
Generally, segments which are acoustically stable within the segment duration
are chosen. For example, phonetic vocoders are built using phonemes [3] as basic
segmental units. Attempts are also made to use diphones and multi-gram [5] units
as basic segments. Automatic segmentation methods like Maximum Likelihood
(ML) segmentation and Spectral Transition Measure (STM) based segmentation
are used to automatically segment a speech utterance into phoneme like units [11].
The above methods assume piecewise stationarity of speech as an acoustic criterion
for determining the segments. The criterion fails when there is a significant ‘intra-
segmental distortion’. Further, the number of segment boundaries is fixed in the
4
ML segmentation. The phoneme rate can vary quite significantly from one speech
utterance to another. Segmentation is also done by building parametric models
for the segmental units. This method requires the models to be built using a
large database to effectively capture the variations in the segmental units. The
variations in the units increase with an increase in size of the segmental unit. In
the present work, a signal processing technique called Group Delay (GD) based
segmentation which has been successfully used to give syllable like units [12][13],
is applied for the segmentation task. The algorithm does not require any prior
information or prior processing to locate the boundaries. Hence, an attempt is
made in this work to use syllable as a basic segmental unit for compression. The
unit is also large enough to enable better compression.
The system codebook is prepared by clustering the segmental units obtained
from a large speech corpus. Conventional methods include vector quantization
(VQ) and its modifications [2][14][15][16][17][18] to cluster the segmental units
in vector space. In VQ, the clusters are formed based on the spatial variation
among the features of the segments. Duration information is not modeled in
VQ. Hence, duration and mapping analysis have to be addressed in the clustering
process. When clustering using VQ, the sequence information is completely lost.
Syllables being larger units, the sequence information is crucial. Hidden Markov
Models (HMMs) are also used to model the segment characteristics acoustically
[6][4][5][7][8]. However, most of the methods are supervised, need transcription
and enough examples for training HMMs. Therefore, in this work, an acoustic
clustering technique called unsupervised Hidden Markov Model (HMM) based
clustering method [19] is used. In this clustering technique, the acoustically similar
syllables are grouped into one cluster. Each cluster is identified by an HMM model.
A codebook is built using the HMMs of clusters generated. The compression of
the system information is around 100 bps. The source (residual) is coded using the
residual modeling technique implemented in Mixed Excitation Linear Prediction
(MELP) codec [20][21][22]. A bit rate of 1.2 Kbps resulted in encoding the residual
5
using MELP. When the duration information of the syllable segments is encoded,
a bit rate of 1.4 Kbps is obtained. Residual modeling using the MELP ensures
naturalness in the synthesized speech.
1.3 Organization of the Thesis
Chapter 2 starts with a brief introduction to speech coding. The different speech
coding techniques are discussed and the significance of low bit rate speech coding
is established. Later, segment vocoders which are capable of reducing the bit rates
to less than 2 Kbps are discussed in detail. Different segment vocoders proposed
in the literature are briefly described and the motivation for developing a syllable
based segment vocoder is presented.
Chapter 3 describes the group delay based segmentation technique, which is the
first task in the implementation of the proposed vocoder. Different automatic seg-
mentation algorithms used in segment vocoders are discussed. Then the phoneme
segmentation techniques like ML segmentation and STM based segmentation are
presented. The importance of group delay based segmentation for getting syllable
like units is also discussed.
Chapter 4 explains the preparation of system codebook. An overview of ap-
proaches to system codebook preparation is presented. The proposed technique
to cluster the large corpus of segmental units called the automatic HMM based
unsupervised clustering algorithm, and the system codebook preparation are de-
scribed.
Chapter 5 presents the proposed syllable based segment vocoder. The process
of encoding the system and source is explained. Since, the MELP is used to
encode the residual, a brief overview of MELP codec is also presented. Finally, the
decoding and synthesis of speech are explained. The issues in the implementation
of the proposed coder are also discussed in detail.
6
1.4 Contributions of the Thesis
The following are the major contributions of the thesis:
1. A novel approach to build a segment vocoder is proposed. In this approach,
syllables are considered as the segmental units for compression.
2. An unsupervised clustering algorithm, which does not require any transcrip-
tions or predefined examples to train HMMs, is used to prepare the system
codebook for the vocoder. A system compression of 100 bps is achieved
using the proposed the clustering technique.
3. A syllable based segment vocoder capable of producing natural sounding
speech at 1.4 Kbps, is built using the proposed segmentation and clustering
techniques.
7
CHAPTER 2
SPEECH CODERS
2.1 Introduction
Speech coders are used to give an efficient digital representation of telephone
bandwidth speech, particularly in narrow band telephone communication. Often,
the bandwidth of speech is limited to 3.4 KHz and sampled at 8 KHz. The
objective of any speech coder is to represent the digital speech with as few bits as
possible, while producing the reconstructed speech that sounds identical or almost
identical to the original speech. However, in practice, there is always a trade-off
between the bit rate and the speech quality. The four main issues to be addressed
in implementing a speech coder are as follows:
1. Speech quality: Speech quality depends on the method used for coding.
Depending on the application at hand, an estimate of the required quality
is decided and the method for coding is chosen accordingly. Generally, the
high bit rate coders provide a high quality speech.
2. Bit rate: The coding efficiency of a speech coder is expressed in terms of
bits per second (bps). The bit rate depends on the method used for coding.
3. Communication delay: The time taken to process the input speech for
coding is called the communication delay. Delay is an important issue in
real time implementation of the speech coder.
4. Complexity: The complexity is expressed in terms of memory requirement
and processing capability. A high complexity may lead to a high power
consumption in the hardware. The complexity involved in realizing the
coder should be as less as possible to incorporate it in any hardware such as
a cell phone.
In this chapter, different types of speech coders are briefly reviewed. The per-
formance evaluation of coders in terms of speech quality and bit rates is presented.
As the focus of the present research work is not on the implementation of the real-
time coder, emphasis is not given to the delay and complexity factors. Sections 2.2
and 2.3 explain the waveform coders and the vocoders respectively. The need for
low bit rate coders is established in Sections 2.4 and 2.5 thereby paving a way to
the area of interest, the segment vocoders. Section 2.6 gives a brief review of some
of the segment vocoders proposed in the literature. The chapter concludes with
a discussion on the proposed syllable based segment vocoder and its objectives in
Section 2.7.
Speech coders are broadly classified into two groups: waveform coders and
vocoders. Waveform coders operate at high bit rates and yield high quality speech
at the receiver. Vocoders or source coders operate at very low bit rates, but the
quality of speech at the receiver is reduced. Hybrid coders form a sub-category
of coders that use techniques from both source coding and waveform coding to
produce good quality speech at intermediate bit rates. A detailed description of
the coders is presented in the following two sections.
2.2 Waveform Coders
Waveform coders do not use the knowledge of signal generation while coding the
input speech signal. In other words, a waveform coder tries to reproduce the signal
whose waveform (in time domain) or spectrum (in frequency domain) is as close
as possible to that of its original. In the time domain, a waveform coder quantizes
and encodes the speech signal sample by sample. The coder requires only the
9
information about the sample value at the time of encoding. The complexity of
coding is thus very low. Each sample is coded using at least two bits (in delta
modulation) and hence the bit rates are greater than 16 Kbps. The simplest form
of waveform coding called Pulse Code Modulation (PCM), samples the speech
signal at 8 KHz and encodes each sample at 16 bits has a bit rate of 128 Kbps.
The output speech quality of such a coder is very high and indistinguishable from
the original speech. The only error present between the input speech signal and
the reconstructed speech signal is the quantization error.
Several modifications are made to the basic technique to reduce the bit rates
without affecting the speech quality. The reduction in bit rates is achieved by using
the redundancy in the speech signal. Redundancy in speech enables prediction
of the value of speech signal at a particular time instant based on the values
at preceding time instants. Differential Pulse Code Modulation (DPCM) and
Adaptive DPCM (ADPCM) are the modifications to PCM that use prediction for
coding the speech signal.
In summary, waveform coders are low complexity coders that code the speech
signal sample by sample. The coders yield very high bit rates and produce speech
with a high quality. The block diagram of a PCM coder that shows the encoding
and decoding processes involved is shown in Figure 2.1. The input speech signal
is sampled using a sampling circuit. The vertical lines denote the speech samples.
Each sample of speech is fed to a quantizer. The quantized value of the sample
is binary encoded and transmitted as a bit stream. At the receiver, the decoder
extracts the sample information from the bit stream. Samples are regenerated
using the quantized values. A digital-to-analog converter is used to convert the
digital sample stream to an analog speech signal. If the sampling rate of the
sampler is S samples per second and each sample is encoded using m bits, the bit
rate of the coder is mS bits/s or mS bps.
10
Stream of bits
converter
Sampler Quantizer
Decoder
Transmitter
Receiver
Samples
Stream of bits
Samples
speechInput
Output speech
Digital to Analog
Figure 2.1: Block diagram of a waveform coder
2.3 Vocoders
Vocoders, also called source coders use the knowledge of the speech signal gener-
ation while coding. Prior to coding, vocoders represent the speech signal as the
output of a physical model. The parameters of the model are extracted, quan-
tized, encoded and transmitted. At the receiver, the decoder extracts the encoded
parameters, rebuilds the model using the decoded parameters and synthesize the
speech using the model. The model used for coding is based on the human speech
production mechanism.
Human speech production involves a series of excitation pulses (puffs of air)
passing through a filter (vocal tract) to produce the sound. The type of sound
produced depends on the type of excitation as well as the vocal tract shape. The
human speech production system can be modeled using linear prediction as shown
in Figure 2.2. The model represents the digital speech signal as the output of a dig-
ital filter (called LPC filter) whose input (excitation) is either a train of impulses
or a white noise sequence. From the figure, it is seen that speech is synthesized
by sending either a series of impulses or a white noise sequence through a filter.
The type of input to the filter depends on the nature of the sound unit. A voiced
sound is synthesized with an impulse train as input, and an unvoiced sound is
11
X
Impulse
G
Time−vary− s(n)
Vocal Tract Parameters
Voiced/Unvoicedtrain
Generator
Random Noise
Generator
ing digitalfilter
switch
Pitch Period
e(n)
h(n)
Figure 2.2: LPC model for speech production
synthesized with noise as input. The LPC model is also called the source-system
model of speech. In this model, the LPC filter represents the system and the
excitation (either train of impulses or white noise sequence) represents the source.
The process of synthesizing speech using the LPC filter and excitation is called
LP synthesis. In vocoders, the system and source information of the input speech
signal is encoded and transmitted. An LP synthesis filter is used to synthesize
the speech at the receiver. The process of extracting the system and source in-
formation from the speech is done using LP analysis. In LP analysis, the speech
input is given to an LPC filter to get residual as the output. The LPC filter coef-
ficients represent the system characteristics of the speech input and the residual,
also called prediction error, represents the source information. Since, speech is a
quasi-stationary signal, the system and source characteristics change rapidly. For
all practical purposes, speech is assumed to be stationary within a duration of
approximately 25 milliseconds, called a frame. The system and source parameters
are extracted over a frame of speech. In vocoders, the system and source parame-
ters obtained from each frame of speech are quantized using predefined codebooks.
The codebooks are prepared separately for system and source. The system infor-
mation for a frame of speech is generally captured by some spectral parameter
vector. The codebook for system is generated using the spectral properties of the
12
speech signal. Vector Quantization (VQ) is popularly used for codebook genera-
tion. The codebook for system is prepared by clustering the spectral parameter
vectors computed for each frame of speech. Line spectral Frequencies (LSFs),
LP Cepstral Coefficients (LPCC), Mel Frequency Cepstral Coefficients (MFCC)
are the commonly used spectral parameters for clustering. The centroids of the
clusters formed are considered as codewords. The source information is defined
in terms of parameters like pitch, gain and voicing. Multiple codebooks are pre-
pared for source because a separate codebook is prepared for every parameter.
Simple scalar quantization techniques are used to prepare the codebooks for these
parameters. The process of encoding and decoding the speech using a vocoder
is shown in Figure 2.3. The speech input is fed frame by frame through an LP
analysis filter to get the system and source information in the form of LP coef-
ficient vector and residual respectively. The LPC vector is quantized using the
system codebook. The index corresponding to the codebook entry which matches
with the input spectral parameter vector is encoded. Similarly, for encoding the
residual, parameters like pitch, gain and voicing are quantized using the respec-
tive codebooks. High compression is achieved in vocoders because the system and
source information of the input speech is represented by one of the entries in the
respective codebooks and the indices corresponding to the entries are encoded.
The bit rates of vocoders depends on the codebook sizes. The calculation of bit
rate for a typical vocoder is shown below:
Number of frames per second = N
Size of the system codebook = Cs
Size of the source codebook = Ce
Number of bits needed to encode the system,ms = dlog2(Cs)e
Number of bits needed to encode the source,me = dlog2(Ce)e
Total Bit rate = N × (ms + me) bps
13
Cs
Ce
Ce
Cs
LPAnalysis
quantizationSystem
Sourcequantization
binary coded indices
1
1
2
2
Spectral parameter vector
Index
Residual
residual
ofLPCs
LP Synthesis
modeling
12
21
LP coefficientvector sequence
Index
Source
parametersResidual
Index
Source
System
Output speechBitstream
Index
Source
System
Binary coded indices
Bitstream
(a) Transmitter
(b) Receiver
Codebook
Codebook
Codebook
Codebook
Sequence
Modeled
SystemDecoder
Input speechframes
Figure 2.3: Block diagram of a vocoder showing the process in (a) Transmitterand (b) Receiver
14
The decoding process at the receiver to synthesize speech follows the same series of
operations performed at the transmitter, but in reverse order as shown in Figure
2.3(b). The input bit stream is decoded to find the codebook indices. Using the
indices, the system and source parameters are obtained from the respective code-
books. The excitation signal is modeled using the quantized source parameters
obtained from the source codebook. An LP synthesis filter is formed using the
system information obtained from the system codebook. The excitation signal is
passed through the LP synthesis filter to give the synthesized speech output. The
quality of speech obtained from this type of coding is less. Since the LP synthesis
filter is formed from the quantized parameters, the speech synthesized using this
filter differs significantly from the original speech. The other reason for deteriora-
tion of speech quality in vocoders is the modeled excitation signal. The excitation
signal modeled from source parameters is very different from the residual obtained
from the LP analysis filter at the transmitter. Ideally, the excitation to the LP
synthesis filter should be identical to the residual, for high quality speech output.
The inability in modeling the excitation signal is due to the complex nature of the
residual signal. The closer the modeled excitation signal is to the residual, the
higher is the quality of the synthesized speech. The focus of research in vocoders
is to achieve a high quality speech at low bit rates. The emphasis is laid on better
clustering processes to compress the system and source information further, and
efficient modeling of the excitation signal.
There is a sub-class of coders referred to as ‘hybrid coders’ where an attractive
trade-off between waveform coding and vocoding is achieved, both in terms of
speech quality and transmission bit rate. They are also referred to as analysis-
by-synthesis coders. The Code Excited Linear Prediction (CELP) is an example
of hybrid coder. Hybrid coders attempt to fill the gap between the waveform and
source coders. As described above, waveform coders are capable of providing good
quality speech at bit rates of 16 Kbps, but are of limited use at rates below this.
Vocoders, on the other hand, can provide intelligible speech at 2.4 Kbps and below,
15
but cannot provide natural sounding speech. Hybrid coders use the same linear
prediction filter model of the vocal tract as found in LPC vocoders. However,
instead of using the pitch and voiced/unvoiced decisions to model the excitation
input to the LP synthesis filter, the excitation signal is chosen by attempting to
match the reconstructed speech waveform as closely as possible to the original
speech waveform as done in waveform coders. Such a process ensures a better
speech quality when compared to the vocoders.
2.4 Segment Vocoders
In segment vocoders, the system compression is performed at segment level instead
of at the frame level. Typically, a segment is much larger than a frame. The num-
ber of frames in a segment depends on the type of the segmental unit. However,
the source compression in segment vocoders is generally done at frame level to
account for the rapid variations in the source characteristics. The implementation
of a segment vocoder is carried out in the following four stages.
1. Segmentation:
Segmentation is done on utterances of a huge corpus of speech data to yield
a large number of variable length segments.
2. System codebook and source codebook preparation:
A system codebook is prepared using the large corpus of variable length
segments. Each entry in the codebook corresponds to a sequence of vectors
of Linear Prediction Coefficients (LPCs). A source codebook is also prepared
for the residual parameters.
3. Encoding and transmission:
The input speech utterance is segmented into variable length segments. Sys-
tem quantization is done on each of these segments using a system codebook.
16
The best-match codebook index is binary coded and transmitted. Some bits
are also allocated for the duration information of the input segment. The
residual is encoded using the source codebook.
4. Speech synthesis at the receiver
The LPCs corresponding to the transmitted system codebook indices and
the excitation modeled from the quantized source parameters are used for
synthesizing the speech using the LP synthesis filter.
The block diagram that shows the working of a segment vocoder is shown in
Figure 2.4. The input speech is divided into variable length speech units called
‘segments’. For each segment, the system characteristics are quantized using the
system codebook. The residual from the LP analysis filter is encoded using the
source codebook. Since, the segments are of variable length, the duration infor-
mation of each segment is also encoded. At the receiver, the decoder decodes
the codebook indices, obtains the system information from the system codebook
to form a sequence of LP coefficient vectors and models the excitation signal us-
ing the decoded source parameters. The excitation signal is passed through the
LP synthesis filter formed using LP coefficient vector sequence from the system
codebook to produce the speech output.
17
Segmentationquantization
quantizationframe by frame
speechInput
Segments
Encoded system codebook
indices
Encoded source codebook indices
Spectral parametervector sequence
System
Residual Source
Residual parameters
Index
Bit stream
12
C
12
s
e
modeling
stream
properties
12
12
C
C
Ce
s
Bit
LP coefficientvector sequence
System codebook
System codebook
LP vector sequence
Synthesized speech
Decoder
Source
Index
Residual
Source codebook
Durationquantization
DurationDecoder
(a) Transmitter
(b) Receiver
LP analysis
Index
LP synthesis
Modeled residual
Index
Figure 2.4: Block diagram of a segment vocoder showing the process in (a) Trans-mitter and (b) Receiver
18
The quality of synthesized speech is drastically reduced in the case of segment
vocoders because the unit of compression in segment vocoders is larger than a
frame. Phonemes and diphones are generally considered as the segmental units.
As the speech is a quasi-stationary signal, the spectral parameters vary for every
frame. Units like phonemes have variations within the segment duration. The
method used for system codebook preparation for such units should be able to
capture the variations present within the segments. Hence, the focus of research
in segment vocoders is on the system compression aspect of coding. With a high
compression of system and source information, segment vocoders give a synthetic
quality speech at average bit rates much less than 2.4 Kbps. The average bit rate
calculation for a segment vocoder is given below:
Average number of segments per second = M
Frames per second used for source coding = N
System codebook size = Cs
Source codebook size = Ce
Number of bits required to encode the system,ms = dlog2(Cs)e
Number of bits required to encode the source,me = dlog2(Ce)e
Average bit rate = M × ms + N × me bps
Since the number of segments in a second of speech varies, the bit rate of segment
vocoders is expressed as the average bit rate.
2.5 Issues in Implementing a Segment Vocoder
The segment vocoders differ mainly in the method used for segmentation to obtain
the segment boundaries and the method used for the system compression and
the system codebook preparation. The residual (source) is coded using similar
techniques as used in frame vocoders. The naturalness in the quality of output
19
speech depends on the quality of modeled residual (excitation) at the receiver.
The following are the issues in the implementation of a segment vocoder:
1. Choice of segmental unit:
(a) The segmental unit should be large enough to enable better compres-
sion.
(b) The segmental unit should be small enough to be modeled effectively.
(c) The segmental unit should be obtained automatically.
2. Segmentation
(a) Depending on the segmental unit chosen, segmentation should yield the
boundaries of segmental units in an utterance.
(b) The segmentation algorithm should not be computationally expensive.
3. System codebook
(a) Segmental units have varying spectral characteristics within the seg-
ment duration. The clustering procedure used for codebook prepara-
tion should effectively capture the variations in the segment.
(b) The clustering algorithm should also be able to handle the differences
in the duration of the segments.
(c) The clustering mechanism should allow a high compression to generate
a codebook of small size for bit rate reduction.
20
BAD
POOR
FAIR
GOOD
EXCELLENT
1 2 4 8 16 32 64
Speech Quality
waveformcoders
Bit rate (Kbps)
Hybrid coders
vocoderssegment
vocoders
Figure 2.5: Performance of different coders
Figure 2.5 shows the summary of the performance of different types of coders
[1]. The figure shows the performance of different speech coders in terms of the
speech quality obtained at respective bit rates of operation. It can be seen that the
speech quality increases with an increase in the bit rate. The vocoders operating at
very low bit rates have a poor speech quality. Further, it is also seen that segment
vocoders are poorer in quality. The waveforms generated using PCM, ADPCM,
MELP and the proposed segment vocoder are incorporated under the directory
named ‘Chapter-2’ in the CD-ROM attatched with the Thesis. The next section
presents a review of some approaches to build the segment vocoders. The review
gives a brief overview of the methods used for the segment vocoder implementation.
The performance of the coders is discussed in terms of the output speech quality
and average bit rates of operation. The detailed review of segmentation and
clustering techniques used in the segment vocoders are later presented in the
respective chapters.
21
2.6 A Review of Segment Vocoders
Roucos, et al., [2] have proposed an early segment vocoder that operates at
150 bps. Spectrally steady state regions of speech are chosen as segmental units.
An automatic segmentation method based on a heuristic algorithm is used to
search the segment boundaries. System codebook is generated using a binary
clustering technique based on K-means clustering. For the single speaker mode, a
bit rate of about 150 bps is obtained. The 110 bps are used for system represen-
tation and 44 bps are used for encoding source parameters like pitch and gain for
each segment. The output speech signal is evaluated as intelligible but synthetic.
It is also observed that there are intonation changes in the output speech.
The poor speech quality of the above mentioned segment vocoder is avoided
by using template waveforms to synthesize the speech at the receiver [15]. The
encoding part of the waveform segment vocoder is similar to the earlier segment
vocoder. A system compression of 170 bps has been achieved. The information
regarding pitch, gain and duration of the input segment is encoded using another
130 bps to obtain an overall bit rate of 300 bps. During synthesis, the waveform
segment corresponding to the spectral parameter vector, called template wave-
form is considered instead of the matched spectral parameter vector from the
codebook. The template waveform segment is modified to match the correspond-
ing input parameters like pitch, gain and duration. The synthesis is performed
using the Residual Excited Linear Prediction (RELP) filter whose inputs are the
LP coefficient vector sequence and the residual of the modified template waveform.
The vocoder when implemented for speaker independent case produced an output
speech quality better than the earlier vocoder with reduced buzziness. However,
the speech is chopped due to the discontinuities at the segment boundaries.
Shiraki, et al., [14] have developed a variable length segment vocoder based
on joint segmentation and quantization. The coding problem is to search for
22
the segment boundaries and the sequence of code segments so as to minimize
the spectral distortion measure for the given interval. The problem is solved
by a dynamic programming algorithm. An iterative algorithm for designing the
variable length segment quantizer is also proposed. When pitch information is also
coded, the coder resulted in bit rates of around 150 bps for a single male speaker
database. It has been shown that the coder offers sufficient intelligibility at such
low rates.
Bardenhagen, et al., [23] have used fenone as unit of segmentation. The quasi-
stationary regions in speech are termed as fenones. A system codebook is prepared
using 3-way split VQ technique over spectral envelopes. A harmonic speech model
is used to realize the coder. The vocal tract is represented by the spectral envelope.
In harmonic speech model, the spectral envelope is represented by the harmonic
magnitudes. The glottal excitation is modeled by a mixture of voiced and unvoiced
components. The voiced component is the impulse train with the period modified
based on the pitch lag. The unvoiced component is modeled using the white noise.
For a speaker independent case, the coder achieved good intelligibility and quality
at bit rates around 1.8 Kbps.
Siva Kumar, et al., [24] have developed a segment based MBE speech coder
at 1000 bps. A Multiband Excitation (MBE) speech model that provides natural
sounding speech and robustness to acoustic background noise is considered for very
low bit rate coding based on speech segmentation. An ML segmentation algorithm
is used to obtain the segment boundaries. The system codebook is prepared by
clustering LSF vector sequences of the segments using split-vector quantization.
When pitch, gain and voicing are coded, a bit rate of 1000 bps is achieved. Unlike
other segment coders whose bit rate varies according to the segment rate, this
vocoder has fixed bit rate. The average PESQ score for the vocoder is evaluated
to be 2.69 for TIMIT database.
Apart from the vector quantization techniques, a new recognition/synthesis
23
paradigm has evolved in the implementation of segment vocoders where the system
codebook is prepared using the parametric models like Hidden Markov Models
(HMMs). The system codebook contains the parametric models built for the
segmental units.
Picone, et al., [3] have proposed a phonetic vocoder with phonemes as seg-
mental units. The system compression of less than 200 bps is achieved. The
codebook is built using the HMMs trained for phonemes. The models are trained
for each phoneme using contextually rich examples and transcriptions from the
TIMIT database. The trained models are also used to obtain the segment bound-
aries. Source information such as pitch and energy are also coded to retain the
naturalness in the output speech. However, the output speech is reported to have
synthetic quality.
Cernocky, et al., [5] developed a segment vocoder with units obtained from
an automatic segmentation algorithm. Segmentation uses an iterative algorithm
called temporal decomposition that gives the unigram, bigram and trigram units.
The units are modeled using the HMMs in the unsupervised mode. In the single
speaker mode, output speech is reported to have intelligibility at 211 bps. The
system compression is done at 120 bps. Source parameters are not modeled except
for energy which is varied according to the energy of the input segment. The
output quality is observed to be good for short words or digits. The intelligibility
is reduced as the length of the input segments is increased. Good intelligibility is
obtained for segments whose length is close to that of phonemes.
Felici, et al., [25] proposed a diphone based segment vocoder for Italian lan-
guage which operates at bit rates of 300 bps. The codebooks are prepared for
each speaker separately. Diphone segmentation is carried using a neural network
based diphone recognizer. The codebook of diphone segments is prepared in a
supervised mode. For each diphone, the codebook contains examples featuring
energy and pitch close to the average values of that class. No particular clustering
24
mechanism is used for codebook preparation. The grouping of diphone examples
is done until the segment distortion measure is below a particular threshold. The
source information like pitch and energy are coded separately. The speech quality
is assessed using Mean Opinion Score (MOS) which is around 2.
Ismail, et al., [7] proposed a segment vocoder where a continuous speech rec-
ognizer is used to transcribe the incoming speech as a sequence of subword units
termed as acoustic segments. Prosodic information is combined with the segment
identity to form a serial data stream suitable for transmission. A rule-based system
maps the segment identity and the prosodic information to parameters suitable for
driving a parallel formant speech synthesizer. Acoustic segment HMMs are used
to build a recognizer . A segment error rate of 3.8% was achieved in a speaker-
dependent and task-dependent configuration. An average data rate of 262 bps was
obtained.
Tokuda, et al., [6] proposed a vocoder using the HMMs based synthesis. The
basic segmental unit chosen is the phoneme. The system codebook is a phoneme
recognizer built using a large phoneme database. At the receiver, instead of ob-
taining the spectral parameter vector from the codebook, it is generated using
the HMM corresponding to the codebook index. The synthetic quality speech is
obtained using Mel Log Spectrum Approximation (MLSA) filter formed from the
spectral parameter vector generated. The coder operates for a single speaker at
a bit rate of 150 bps. Hoshiya, et al., [8] improved the performance of the above
vocoder by transmitting the extra information regarding formants and state du-
ration. The coder is again tested in a single speaker mode.
Low bit rate coders are also developed based on Text-to-Speech (TTS) synthe-
sis scheme. In a TTS system, a large database of waveform units is prepared based
on the text information. Synthesis is done by concatenating the waveform units
corresponding to the input text. Prosody modifications are later incorporated on
the synthesized speech to make the speech sound natural. In a TTS based coder,
25
this large database of waveform units can be seen as a codebook. The earliest
work in this direction is by Benbassat, et al., [26], where a text message and spo-
ken utterance are jointly used to provide a TTS input stream. A small number
of prototypes for pitch patterns and duration patterns are also included in the
codebook for prosody coding. The codebooks are prepared using a single male
speaker database. The bit rates achieved are around 150 bps. The output speech
quality is reported to be intelligible. Similar kind of work is reported by Vepyek,
et al., [27] where the TTS is used to generate the synthetic speech from text and
then the speech conversion is used to convert the voice characteristics including
the speaking style and emotion. The coder is reported to operate at 300 bps. The
above mentioned two coders require transcription for the input speech sentence
for coding. The transcription is carried out at the phoneme level. To improve on
this, a speech coder based on automatic speech recognition and TTS synthesis is
developed. Chen, et al., [28] used the HMM based phoneme recognition and Pitch
Synchronous Overlap Addition (PSOLA) based TTS for the implementation of
speech coder at 750 bps. The individual segments recognized by the HMMs are
quantized using a phonetic inventory. The output speech quality is reported to
have a MOS score of 3.0. Later Lee, et al., [4] developed a unit selection based
TTS coder operating at 1000 bps on a single speaker database, whose speech
quality is comparable to that of a MELP coder. A large TTS labeled database is
used as the codebook. The codebook contains the identities of phonemes, their
durations and their pitch contours. The synthesis performance is increased using
an acoustic target cost function and a concatenation cost. The coder is reported
to have a speech quality close to that of a conventional MELP coder at 2.4 Kbps.
A modified version of the coder [29] is also proposed where the input speech signal
is segmented using a joint segmentation/classification scheme. The segments are
coded and synthesized using the TTS based coding technique described in [4]. In
single speaker tests, the coder gave an intelligible and natural sounding speech at
an average bit rate of about 580 bps.
26
It may be noted that almost all the above segment vocoders are implemented
for the single speaker case. The segmentation techniques are designed to obtain
steady state regions in speech which eventually turn out to be phone like units.
The automatic segmentation techniques use high complexity algorithms which are
mostly iterative in nature. Moreover, the segmentation methods are not capable
of generating segments which are larger than phonemes. For example, in [5] multi-
gram units that are larger than phonemes are considered. But it is observed that
intelligibility is preserved only when the units are close to phonemes. Segmen-
tation is also accomplished using parametric models trained over the segmental
units. But, this also requires the models to be prepared using sufficient number
of examples. The models should also be able to capture the prosody variations
in the same segmental unit. If HMMs are used for obtaining segmental units (as
in CSR), large amount of data is needed to train the models properly. Since the
segmental units have prosody variations, HMMs should be trained for wide range
of prosody variations inside a segment. Then only the models will be able to rec-
ognize the segment properly in the input speech utterance. In the coders which
use joint segmentation and recognition (while encoding), the models are built in a
highly supervised manner. In [5], an unsupervised clustering is used for modeling
HMMs but prior to that vector quantization is performed on the automatically
generated segmental units to group similar units into one cluster. This ensures
that the HMM is trained with sufficient number of examples of the same segmen-
tal unit. The models may fail to recognize the boundaries in case of the prosody
variations in the segments which are not captured by the HMMs while training.
Moreover, it is also observed that in all the vocoders, the emphasis was laid on the
system compression than on the modeling of source characteristics. Most of the
system compression techniques involve clustering under supervised conditions. In
the coders, where the source characteristics are modeled well, the output speech
preserves the naturalness. Attempts are being made to get the synthesized speech
as natural as possible by coding the pitch, gain and fundamental frequency (F0)
27
characteristics for each segment. When source characteristics are modeled well,
the bit rates of the vocoders are comparably higher than in the coders without
sophisticated source modeling. It is also observed that when the units of com-
pression are larger, source characteristics have to be modeled well, to preserve the
quality of the output speech.
2.7 Syllable based Segment Vocoder
Based on the observations made, the objective of the present work is to develop a
vocoder which
• provides a simple and flexible automatic segmentation algorithm that is ca-
pable of obtaining segmental units without any prior information.
• develops a system codebook without supervision and that does not require
any transcription or knowledge about the segmental units for clustering.
• works in a speaker independent environment and produces a natural sound-
ing speech.
With these objectives, in the proposed segment vocoder, a much larger unit,
syllable is chosen as the segmental unit. A simple and flexible signal processing
based segmentation algorithm called group delay based algorithm proposed earlier
to get syllable boundaries [30] is used. Since the chosen segmental unit is large,
two similar segments have significant prosody variations. Therefore, modeling the
parameters of such a big segment is an issue. Vector quantization techniques that
are popularly used for clustering may not be suitable because the methods cannot
capture the sequence information which is crucial in the case of larger segments
like syllables. Also, clustering is difficult due to the huge duration mismatch
among similar units. The HMM based parametric modeling is one alternative
that can be used to model the syllable like units. Parametric modeling using
28
HMMs for syllable like segments has been successfully used in Automatic Speech
Recognition(ASR) and TTS based synthesis systems [30][31][32][33][34]. In the
present work, the emphasis is laid on the system compression aspect. The system
is compressed to less than 100 bps using an unsupervised HMM based clustering
algorithm [19]. For source compression, the MELP is used as it preserves the
necessary source characteristics for a better quality of the synthesized speech. In
[35][36], the importance of good residual modeling for natural quality speech is
discussed. It is argued that for good quality speech the pitch, gain, voicing and
jitter are to be modeled well. MELP generates a mixed excitation signal which
is close to the residual of the input signal. Also, in order to check whether the
proposed methods of segmentation and clustering techniques work, a well modeled
residual is needed. Hence, in this work the residual is modeled using MELP. This
results in a bit rate of 1.4 Kbps.
2.8 Summary
In this chapter, an introduction to speech coding methods is given. Different
coding schemes are discussed in detail. High bit rate speech coders yield very
good speech quality and are less complex. Vocoders which revolutionized the
present day communication are made to operate at much lower bit rates but
still preserve the intelligibility in speech. Segment vocoders form a section of
vocoders, where compression is done at the segment level thereby reducing the
bit rates further. Due to high compression, segment vocoders produce a synthetic
quality speech. Implementation of segment vocoders addresses issues regarding the
automatic segmentation and clustering methods. The need for the development of
a segment vocoder that uses a flexible segmentation algorithm and an unsupervised
clustering algorithm for system codebook preparation are instrumented in the
formulation of the proposed approach to build a segmental vocoder.
29
CHAPTER 3
SPEECH SEGMENTATION
3.1 Introduction
Segmentation is the process of dividing a speech utterance into small predefined
units. Any segment vocoder operation starts with the segmentation task. Seg-
mentation is used twice in the implementation of a segment vocoder. In preparing
the system codebook, a large corpus of speech data is used to obtain the segments.
Once the codebook is prepared, it is fixed for the vocoder. During the encoding
operation of the vocoder, the speech utterance is segmented into predefined units.
In this chapter, the importance of automatic segmentation and an overview of
the earlier approaches to segmentation are presented in Section 3.2. The need for
choosing syllable as the segmental unit and the proposed automatic segmentation
algorithm called group delay based segmentation are explained in Sections 3.3
and 3.4 respectively. The chapter is concluded with Section 3.5 by presenting the
results obtained using the segmentation algorithm.
3.2 Segmentation in Segment Vocoders
In segment vocoders, the initial step towards encoding the speech sentence is to
divide it into segments. Hence, the development of an automatic segmentation
algorithm is necessary [11][37][38][39]. As discussed in Section 2.5, the segment
vocoders differ in the type of segmentation method used. The method of segmen-
tation depends on the type of segmental unit chosen. Prior to the development
of segment vocoders, compression of speech was done at frame level where each
frame is typically of 25 millisecond duration. Going beyond the frame level, a clear
definition of the segment is necessary to obtain segment boundaries. Phonemes
and diphones are the most common segments chosen for compression. Depend-
ing on the segmental unit chosen, several techniques for segmentation have been
proposed in the literature.
In [3][6][7][8], vocoders are built with phone as the segmental unit. Here, the
TIMIT database is used where the transcriptions are provided for phones. Using
these transcriptions, the HMMs are trained on the entire database which are then
used to recognize the phone sequence as in continuous speech recognition.
Cernocky, et al., [5] found the segmental units automatically using an iterative
technique called temporal decomposition and vector quantization. The iterative
technique resulted in the units which are eventually phonetic in nature. The
transcriptions are prepared using the boundary information obtained.
Felici, et al., [25] have proposed a neural network based segmentation algorithm
to get the diphone boundaries. A neural network is trained with 30 examples of
each diphone. The input speech is decoded against the neural network model
to track the boundaries. Here again, the neural network model is trained in a
supervised manner.
The above segment vocoders use the parametric models to obtain the segment
boundary information. The segmentation and segment quantization is performed
in a single step. Vocoders are also developed wherein the segmentation and seg-
ment quantization are done in two different steps. In such vocoders, the automatic
segmentation is performed on the input speech to get desired segments and the
obtained segments are later used for codebook design and segment quantization.
Brandenhagen, et al., [23] introduced fenone as a segmental unit which is
defined as an acoustically homogeneous unit of speech. Quasi-stationary regions
of speech are represented by fenones. It is shown that fenones can capture fine
31
discriminatory differences in speech. An automatic segmentation algorithm uses
the frame-based correlation measure to group the frames based on their quasi-
stationary acoustic characteristics.
Roucos, et al., [2][16] proposed an early segment vocoder where the segmental
unit is a region between the middle points of two consecutive steady state regions.
The segmentation algorithm is a heuristic algorithm that uses a set of thresholds
on two spectral derivatives to determine the spectral steady state regions in the
input speech.
Ramasubramanian, et al., [11] discuss different automatic segmentation tech-
niques to derive phone and diphone boundaries. In their work, they used two
automatic segmentation techniques - Maximum Likelihood (ML) segmentation
and Spectral Transition Measure (STM) based segmentation. It is shown that the
phone like units from the ML segmentation outperformed the diphone like units
obtained using the STM based segmentation. These two methods are popularly
used to obtain the segment boundaries automatically. A brief overview of the two
techniques is presented below.
3.2.1 Maximum Likelihood (ML) Segmentation
The ML segmentation assumes piecewise stationarity of speech as the criterion
to obtain segments which are acoustically homogeneous within their boundaries.
Homogeneity here refers to the similarity in spectral properties of the speech signal.
Speech is a quasi-stationary signal. For all practical purposes, speech is assumed
to be stationary for a duration of about to 25 milliseconds, which is often termed
as a ‘frame’. In other words, it is assumed that the spectral properties of speech
do not vary significantly. Similarly, the ML segmentation checks for homogeneity
within the segment, i.e., absence of significant variation in the spectral properties
within the segment duration. Since a segment is made up of more than one
32
frame, there will be spectral differences from one frame to the other. The ML
segmentation looks for segments where the overall spectral distortion within the
segment is small. The spectral distortion is measured in terms of ‘intra segmental
distortion’, given by the sum of distances from the frames that span the segment,
to the centroid of the frames comprising the segment. The distance is computed
between the feature vectors that represent the spectral properties of each frame.
The ML segmentation algorithm [11] is as follows:
1. The speech utterance of T frames is represented by a sequence of T vectors
X = (x1, x2, · · · , xT ) where xn is a p-dimensional parameter vector for the
nth frame.
2. The aim of the ML segmentation algorithm is to find m consecutive segments
in the observation sequence X. Here, m is predefined based on the desired
segment rate.
3. Initially, some m boundaries are arbitrarily marked. The boundary sequence
is denoted by B = (b0, b1, · · · , bm). bm refers to the boundary at T th frame.
4. The distortion measure is defined in terms of a distance measure between a
frame in the segment and the centroid of the frames comprising the segment.
For the Euclidean distance measure, the centroid is the average of all the
frames in the segment. The total distortion measure is computed for all the
frames within the segment.
5. The solution is obtained by dynamic programming and the boundaries are
recovered by backtracking after the first pass like Viterbi decoding. This is
termed as an optimal boundary sequence. The intra segmental distortion
for a typical boundary sequence is given by
33
D(B) =m∑
i=1
bi∑
n=bi−1+1
d(xn, µi) (3.1)
where, D(B) is the total distortion of an m-segment segmentation of X, µi
is the centroid of the ith segment and d(xn, µi) is the distortion measure
between the nth frame xn and the centroid of ith segment, µi.
6. The optimal segment boundaries are obtained using a dynamic programming
algorithm given in [38].
The ML segmentation results in the phoneme boundaries when m is fixed to
the phone rate of speech. However, sometimes the resulting segments are as short
as one frame or as long as one word. This makes modeling the segments a difficult
task. In order to control the duration of segments formed, a duration constraint
is also added to the ML segmentation algorithm.
3.2.2 Spectral Transition Measure (STM) based Segmen-
tation
The STM based segmentation is the earliest method used for speech segmentation.
The segmentation is based on the principle of measuring the spectral deviation at
every frame instant. For phone like units, boundary is detected when there is a
large spectral deviation between two frames. The STM segmentation gives both
phone and diphone boundaries. Many algorithms for STM based segmentation
are available. One of these algorithms is described here [11].
1. The STM is defined as follows: Let xn be the parameter vector for the nth
frame. The STM at frame n, di(n), is given by di(n) = ‖xn − xn−i‖2.
34
2. d1(n) as a function of n is the distance between the spectral parameter vector
for frame n and its preceding frame n − 1.
3. d3(n) gives a smooth measure of the spectral derivative.
4. The plot of spectral derivative will have peaks at the places where the spec-
tral transitions are significant and valleys near the steady state regions.
5. Successive peaks of d1(n) or d3(n) locate the phone boundaries and successive
valleys mark the diphone like units.
An extrema picking algorithm [39] is used to pick the peaks and valleys effec-
tively. But, the algorithm uses a threshold δ to detect the peaks which is optimized
for a desired segment rate.
3.3 Syllable Segmentation
The review on segmentation techniques in the previous section suggests that, the
techniques for segmentation can be broadly classified into two categories. In the
first category, the parametric models are built for each segmental unit. This re-
quires that the models be built using a large training dataset. The second category
uses automatic segmentation algorithms that continuously try to find the regions
of interest in speech based on some rules. The algorithms used are either iterative
in nature as in ML segmentation or use predefined thresholds as in STM seg-
mentation that have to be optimized to obtain correct boundaries. Further, some
parameters in the algorithms are assigned values based on the segment rate. The
values need to be optimized for different segment rates. In 2-pass ML segmenta-
tion, the segment rate is fixed to get the boundaries close to phoneme boundaries.
But, the phoneme rate can vary quite significantly from one speech utterance to
another. Also, the techniques are computationally intensive. So, there is a need
to develop a simple and flexible algorithm capable of giving the desired segments.
35
In the proposed segment vocoder, syllable is chosen as segmental unit. The pri-
mary reason to choose syllable is that a signal processing algorithm called group
delay based segmentation can be used to get the boundaries. Syllable is a larger
unit than phoneme or diphone, and hence enables more compression. Also, it has
been suggested that the syllable has a better representational and duration stabil-
ity relative to phoneme [40]. Systems for Automatic Speech Recognition (ASR)
and Text to Speech (TTS) synthesis that use syllable as a basic unit have been
developed successfully for Indian languages which are syllable timed in nature
[30][31][32][33][34].
Before going into the working of the proposed segmentation algorithm for ob-
taining syllable boundaries, it is necessary to study the nature of a syllable seg-
ment. A syllable consists of three major parts - onset, nucleus and coda. Generally,
the nucleus is a vowel and the other two are consonants. Spectrally, a syllable dis-
plays a high energy region of the vowel surrounded by low energy regions of the
consonants. This property forms the basis for the boundary demarcation of a
syllable in a given speech utterance.
3.4 Group Delay based Segmentation
Since an automatic segmentation algorithm is important in the implementation of
a segment vocoder, a less complex algorithm is needed. The method for segment-
ing speech into syllable like units uses the short term energy (STE) function of the
speech signal. But the local fluctuations in the consonant regions results in spu-
rious boundaries. It is shown in [41] that if the signal is a minimum phase signal,
the group delay computed on the spectrum will resolve the peaks and valleys well.
The peaks and valleys in the group delay function of the signal will now correspond
to the peaks and valleys in the short term energy function. If we consider a speech
utterance as made up of syllables, the valleys in the energy spectrum correspond
36
to the syllable boundaries. If group delay is computed on such a signal, the valleys
of the group delay function correspond to the syllable boundaries. To resolve the
boundaries better, the inverted group delay is computed where the peaks mark
the syllable boundaries. The group delay based method for segmentation [30] is
as follows:
1. Let s(n) be the speech signal.
2. Short term energy (STE), E(n), is calculated with overlapped windows for
s(n).
3. This energy function is viewed as some arbitrary magnitude spectrum E(K).
4. Since any real signal should have even symmetry in its magnitude spectrum,
E(K) is symmetrised along the Y-axis. Now, the magnitude spectrum E(K)
of a real signal is obtained.
5. E(K) is inverted to get Ei(K). This step reduces the dynamic range and
thus prevents large excursion near peaks.
6. IDFT of the sequence Ei(K) is computed. The resultant signal ei(n) is
called the root cepstrum. The causal portion of this signal, which has the
properties of a minimum phase signal, is used to compute the group delay
function.
7. Group delay is computed using a certain window length ‘Nc’ called the cep-
stral lifter window. Let the group delay function be Egd(K).
8. Now, the locations of the positive peaks of Egd(K) give the approximate
syllable boundaries.
The block diagram of the group delay based segmentation method is given
in Figure 3.1. Though the group delay segmentation on speech is successful in
locating many syllable boundaries, 30% of boundaries are missed [42]. It is also
37
Figure 3.1: Group delay based segmentation of a speech signal
shown that the group delay based segmentation [12][43] on shorter segments of
speech resulted in more accurate boundaries. Therefore, in order to retrieve the
boundaries missed in the single level segmentation, a duration criterion is employed
to find the segments of speech where the boundaries might be missing. After the
first level of segmentation, segments that are larger than the expected duration
for a syllable are segmented again [32]. It is found that the two-level segmentation
improves the segmentation performance by 3 − 5% [32]. For completeness sake,
3.1 is reproduced from [32] which shows the performace of single and two level GD
based segmentation. The two-level group delay based segmentation of a speech
utterance is illustrated in Figure 3.2.
38
Table 3.1: Syllable boundaries with single and two-level segmentations
Speaker Type Tamil Female Tamil Male Hindi FemaleSegmentation type 1-level 2-level 1-level 2-level 1-level 2-level
Syllable count 14268 14787 6931 7585 7688 7865Insertions 352 394 66 165 83 52Deletions 1464 987 1390 835 1450 1275
Performance 88.19% 91.02% 82.36% 87.89% 83.99% 85.18%
0.5 1 1.5 2 2.5 3
x 104
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.5 1 1.5 2 2.5 3−0.5
0
0.5
1
Time in msx 10
4
10 20 30 40 50 60 70
−0.2
0
0.2
0.4
0.6
0.8
10 20 30 40 50 60 70 80
−0.2
0
0.2
0.4
0.6
0.8
10 20 30 40 50 60 70
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Resegmentation
Polysyllable boiundaries
tinnnArurAIyAcharmAItuvelAInU yitLi mudala
(a)
(b)
(c)
GD
GD
Figure 3.2: Illustration of the two-level group delay based segmentation (a) Wave-
form of the speech signal for a sentence (b) Result of the first level of
segmentation and (c) Result of the second level of segmentation
39
Figure 3.2(a) shows a Tamil sentence with the corresponding syllable tran-
scription on top of it. Figure 3.2(b) shows the first level of segmentation where
the vertical lines represent the boundaries. Figure 3.2(c) shows the second level
of segmentation of three words /mudala/, /urAlyA/ and /tinnAr/ for which the
boundaries are missed in the first level of segmentation.
3.5 Syllable Segmentation for the Syllable based
Segment Vocoder
The two level group delay based segmentation method is used to get the syllable
segments for the proposed segment vocoder. Initially, a large number of syllable
segments are obtained for preparation of the system codebook. For this purpose,
three databases are created using Tamil and Hindi languages. The databases are
taken from the Database for Indian Languages (DBIL) [44]. The DBIL consists
of news bulletins in eight Indian languages. The first database includes the data
for a single female speaker, the second database includes the data for four male
and four female Tamil speakers. The third database is a multi speaker and multi
language database. 12 speakers are considered for preparing the database in Tamil
and Hindi languages. The two-level group delay based segmentation for the three
databases resulted in 1683, 9465 and 17739 syllable segments respectively. An
example Tamil sentence, the syllable like units obtained using group delay based
segmentation and the duration file that shows the segment boundaries are incorpo-
rated under the directory named ‘Chapter-3’ in the CD-ROM attatched with the
Thesis. The syllable segments are clustered for preparation of the system code-
book. Segmentation is also used during the encoding operation of the segment
vocoder, where an input sentence is segmented into syllable like segments.
40
3.6 Summary
In this chapter, the importance of segmentation in the implementation of a seg-
ment vocoder is discussed. It is observed that the method of segmentation depends
on the choice of segmental unit. Most of the segmentation techniques either use
parametric modeling or some automatic segmentation techniques based on spectral
variation. Parametric modeling requires models to be built on a large database.
The models should be able to capture the variations among the segmental units
to obtain the segment boundaries. As the segment becomes large, the variations
among the same segmental units also increase. The second type of techniques that
consider the spectral variation among adjacent frames to detect the boundaries
use iterative algorithms. In such techniques, some of the parameters are tuned
ahead to obtain the segment boundaries. It is seen that phoneme has been ex-
tensively used as the basic segmental unit of compression. Earlier techniques for
segmenting the speech signals are statistical or iterative in nature. In the present
work a simple signal processing based algorithm called group delay based segmen-
tation is used. The algorithm is tested to successfully locate the syllable segment
boundaries in a given speech signal. Since a larger unit allows better compression,
the syllable is chosen as the unit of compression.
41
CHAPTER 4
SYSTEM CODEBOOK PREPARATION
4.1 Introduction
Segment vocoders differ mostly in the type of segmental unit chosen, segmentation
method and system codebook preparation as discussed earlier. In the present
work, a syllable is chosen as the basic segmental unit for compression. The two-
level group delay based segmentation algorithm is used to obtain the syllable like
segments. In order to compress the system characteristics and prepare a system
codebook, an unsupervised HMM based clustering algorithm [19][45][46] is used
to form the clusters of syllable segments. Each cluster consists of acoustically
similar syllable segments. A ‘representative’ syllable segment is identified from
each cluster to build the system codebook.
The need for system compression is discussed in Section 4.2. An overview
of clustering techniques is given in Section 4.3. The motivation for clustering
method used and the system codebook preparation is presented in sections 4.4
and 4.5 respectively.
4.2 Need for System Compression
Vocoders are built based on the fact that the speech signal has redundancy. Two
frames of a speech signal resemble in their spectral characteristics (features).
Vocoders exploit the similarities in these features to compress the system. Gen-
erally, MFCCs, LPCCs or LSFs are used to represent the spectral characteristics.
Speech is a random process, and hence the spectral features keep changing. Since
the system codebook is prepared using the spectral features, there is a need to
design a sophisticated clustering algorithm that captures the similarities and dis-
similarities in the spectral features to the maximum possible extent and at the
same time allows good compression to reduce the bit rates.
In the segment vocoders, the segmental units are compressed instead of frames.
Compression at the segment level is possible because the same sounds occur at
different instants due to the finite structure of a language. But, the same segments
may sound differently depending on the context. Hence, the spectral properties of
same sound units are not the same. Not only that, the durations of the segments
are different. Therefore, the clustering process should consider variable length
segments and still should be able to cluster the similar sound units.
4.3 Review of Clustering Algorithms
Many clustering techniques are employed for system codebook preparation. Ma-
jority of these methods are based on vector quantization (VQ) and its advance-
ments, or parametric modeling of HMMs or neural networks.
Roucos, et al., [2] had used a two-pass vector quantization based clustering
algorithm. In the first pass, called binary clustering, the clusters are formed and
in the second pass templates are selected for each cluster. In the clustering phase,
the training examples are divided into two clusters using the K-means clustering
algorithm and each cluster is represented by its mean vector. Later, the cluster
which has the largest total quantization error is subdivided into two clusters. Eu-
clidean distance is used as the distance measure. The process is repeated until
the desired number of clusters are formed. For selecting the template of each
cluster, either the mean vector or the vector which is closest to the mean vector
is selected. The codebook is built using the templates of the clusters. The au-
thors also proposed a waveform segment vocoder [15] in which templates are the
43
waveform segments corresponding to the spectral vectors. The source characteris-
tics of the template waveform are changed according to the input segment source
characteristics during synthesis.
Shiraki, et al., [14] developed a segment vocoder using the variable length seg-
ment quantization. Here, the variable length code segments are obtained by time-
warping the fixed length code segments. A dynamic programing algorithm is used
to determine the best codebook segment with the desired duration. The speech
parameter sequence is generated by concatenating the variable length spectra in
such a way that the spectral distortion between the original and the synthesized
speech is minimized.
Bardenhagen, et al., [23] introduced fenones as the basic segmental units for
coding. Fenones are defined as acoustically homogeneous units of speech. The
system quantization is accomplished using a 3-way split vector quantization for
LSF parameters of the fenonic unit.
After the development of speech recognizers based on parametric modeling
techniques like HMMs and neural networks, the recognition/synthesis paradigm
has evolved for the design of segment vocoders. The units for modeling in speech
recognizers like phones and diphones are used as units for compression in the seg-
ment vocoders. The system codebooks are prepared using the trained parametric
models of segmental units.
In [3], Picone, et al., introduced a phonetic vocoder with phonemes as seg-
mental units. To model the spectral characteristics of phonemes for codebook
preparation, a high performance HMM based recognition scheme is used.segmentis
clustering technique, the seed models for each phoneme are built using the TIMIT
transcription. For each phoneme, codebooks of size 2, 4 and 8 are built using
contextually rich phonemes. Once the initial phoneme models are built, they are
re-estimated using the entire training database in a supervised mode.
44
In [25], Felici, et al., proposed a diphone recognizer as the codebook for en-
coding diphone segments. Here, each diphone cluster is referred to as a class. To
form a class of diphone, contextually rich examples of diphones are clustered until
the segment distortion measure within the cluster is below a threshold value. For
a large diphone class, multiple templates are identified based on the duration and
prosody differences. For a small diphone class, a single template that represents
the diphone class is selected.
Cernocky, et al., [5] also used the HMM based training method to build the
codebook for units that are larger than phonemes. The models are obtained in two
phases - first generation and next generation. In the first generation, the HMMs
are built using the labeled training corpus of automatically derived segments.
In the next generation phase, models formed in first generation are re-estimated
using more number of examples. The HMMs formed after a few iterations of re-
estimation are used for codebook preparation. Each these final HMMs contains
8 examples. While matching the input segment, the first 4 examples (out of
8 examples) whose durations are well matched with the duration of the input
segment are chosen. Out of the 4 examples, the example that has the minimum
Dynamic Time Warping (DTW) distance to the input segment is chosen as the
template segment for synthesis.
In [6] [8] [47], a phoneme recognizer is used as codebook. The codebook is gen-
erated from phonetically balanced sentences. No specific clustering procedure is
used for HMM training. Models are trained using the examples identified for each
phoneme. A total of 34 phonemes are identified for modeling. No template vector
of waveform is selected for a model as mentioned in earlier methods. Instead, a
HMM based synthesis technique is developed where, the spectral parameter vector
sequence for a given duration is generated using the model parameters. The spec-
tral vector sequence of the recognized phoneme derived from the corresponding
HMM is used for synthesis.
45
Another way to represent the system codebook is to use a large TTS labeled
database[26][27][28][4]. The identities of the segmental units (generally phonemes),
their durations, their pitch contours and all speech coding parameters are included
in the database. Different algorithms are developed to choose the appropriate unit
to represent the input speech segment. For example, Dynamic Programming (DP)
is applied to unit selection [4] in which two cost functions, an acoustic target cost
and a concatenation cost, are optimized while choosing the best representative
units for the input segmental units.
4.4 Motivation for HMM based syllable segment
clustering
From the overview given in the previous section, it is seen that the clustering
techniques either use VQ or parametric modeling. The reason for choosing the
parametric modeling over VQ for syllable segment clustering is already discussed
in Section 2.7. The sequence information which is crucial in modeling larger
segments like syllables is not captured in VQ. A modified K-Means for VQ can be
used for clustering purpose. In that method, the distance matrix is prepared using
Dynamic Time Warping (DTW) distace between the segment. Since, DTW is the
priliminary step in developing HMMs, HMM based parametric modeling is chosen
over other kinds of clustering including VQ. But, it is seen that the parametric
modeling is mostly done in a supervised mode. Supervised training consumes a
considerable amount of time and effort to segment and label the syllable segments
manually before clustering. Also, there is a need to search for sufficient number
of examples before training the HMMs. The proposed method of clustering for
system codebook preparation is based on the fact that the primary motive of
any segment vocoder is to convey the information at the receiver without an
objectionable loss in intelligibility. Therefore, even if there is some error in syllable
46
recognition while encoding, it should not affect the intelligibility of the synthesized
speech at the receiver.
To justify the proposed method of clustering, an example of a speech utterance
in Tamil language is chosen. Let the input speech utterance be /vanakkam/. The
syllable segments that resulted from the segmentation task are /va/, /nak/ and
/kam/. The following two scenarios are considered:
Case I: Suppose the syllable /va/ is replaced by /ba/. The output speech will
be /banakkam/.
Case II: Suppose the syllable /kam/ is replaced by /gam/. The output speech
will be /vanakgam/.
In both the scenarios, a native Tamil speaker is able to understand the utter-
ance as /vanakkam/. This extra information comes from the fact that depending
on the context, the person might make out the spoken work without any difficulty.
This particular tendency has been exploited in the present clustering task where
an unsupervised training algorithm is used to group similar sounding syllables.
By the end of the clustering process, the similar sounding syllable segments fall
into one cluster.
4.5 System Codebook Preparation
System codebook is prepared from the syllable segments obtained from the seg-
mentation process. The codebook is generated in the following three stages.
1. Clustering of syllable segments
2. Selection of representative syllable segment
3. Preparation of system codebook
47
Before discussing the clustering method, the Multiple Frame Size (MFS) and
Multiple Frame Rate (MFR) techniques for speech signal processing [46] that
enable the clustering to be an unsupervised method are briefly explained.
4.5.1 MFS and MFR Techniques
In speech signal processing, the features are extracted by windowing the signal at
frame level, typically of 25 milliseconds. This is because of the fact that the speech
is assumed to be stationary in that interval. This method of feature extraction
works in most of the cases. However, this may face problems when the pitch
frequency or speaking rate for the test speaker is very different from that of the
speakers data used in training [46]. Another problem is that such a technique for
feature extraction may not capture the sudden changes in the spectral information.
To avoid these effects, the MFS and MFR techniques are proposed. In the present
work, MFS and MFR techniques are used to generate multiple examples from
a single training example. Physically, it is like considering same example with
different resolutions. These have been first used in a spoken language identification
task, where the models are initialized with a single training example. In this work
[19], it is shown that the MFS feature extraction ensures a reasonable variance for
each Gaussian mixture in the models. Additionally, the speaking rate varies for
different speakers. If the rate of speaking for the test speaker is different from that
of the trained speakers, the models generated with the single frame size technique
may not be able to capture the changes. It is shown that along with the MFS,
the MFR technique also should be used to build the models.
4.5.2 Clustering of syllable segments
The syllable segments obtained from a large speech corpus are automatically clus-
tered using an unsupervised and incremental HMM based clustering algorithm
48
proposed in [19]. After clustering is performed, syllable segments which are acous-
tically close to one another are grouped together into one cluster. A Separate
HMM is built for each cluster.
The clustering is accomplished in two stages - initial cluster selection and
unsupervised incremental training. The description of these two stages is adapted
from [19][48].
Initial cluster selection
Initial cluster selection chooses a set of syllable clusters on which the entire clus-
tering process takes place. The steps in the procedure for initial cluster selection
are given below.
1. All the M syllable segments obtained from the training data are used for
clustering. The silence regions are added to each of the syllable segments
at the beginning and at the end. This is called silence normalization [31].
Segmentation might result in segments which have silence regions at the
beginning or at the end. So, when silence normalization is done, a sepa-
rate state is assigned to the silence portion at the syllable boundaries while
modeling.
2. An HMM is built for each syllable segment. To generate sufficient number
of examples for model initialization, multiple frame size (MFS) and multi-
ple frame rate (MFR) techniques are used. In this technique, features (13
MFCC/LPCC + 13 delta + 13 acceleration) are extracted from the same
syllable segment with different frame size and frame rate to generate mul-
tiple examples. Both MFCC and LPCC are able to capture the spectral
information. Hence, either of them can be used. There is no significant
change in the number of clusters formed when any of the features is used.
3. Similarly, M HMMs are initialized for M syllables. The initial models which
49
depend solely on a single syllable segment are first obtained.
4. Each of the M syllable segments is recognized using M HMMs. For any
segment Si, the model HMMi is expected to give the largest likelihood score
because the model HMMi is trained with the examples generated from Si. In
order to initiate the clustering process, the model HMMj, j 6= i, that gives
the second largest score for Si is also considered. This process is repeated
for all the M segments.
5. If the HMMj gives the second highest score for any segment Si, i 6= j, then
the model corresponding to Sj is not used further. This is done to reduce
the number of clusters in the subsequent iterations. Let M ′ be the number
of clusters at the end of this step.
6. For each of the M ′ segments identified in the previous step, a new HMM is
trained. The examples used for training the new model HMMi are derived
from the segments Si and Sj, where Sj is the segment of the model that
gave the second highest score for Si in step 4. Multiple examples are derived
from Si and Sj using MFS and MFR techniques.
7. Steps 4 to 6 are repeated form m iterations. After m iterations, the number
of segments associated with each HMM is equal to 2m. For small values of m
(m = 2), the number of syllable segments associated with each model is small
and the differences among the segments are expected to be insignificant.
At the end of this stage, CI initial clusters are formed where CI < M. Initial
cluster selection is done to ensure fast convergence of the incremental training
algorithm used in the following stage.
Unsupervised incremental training
1. For each of the CI clusters formed in the initial cluster selection, an HMM is
trained using the examples derived from the syllable segments in the cluster.
50
2. Each of the M segments is decoded or recognized using the CI HMMs trained
in step 1. It is expected that the similar syllable segments will be recognized
by the same HMM. All the segments for which a particular HMM gives the
highest likelihood score are grouped into a cluster.
3. The clusters with less than three syllable segments in them are removed.
However, the segments in these clusters are not excluded from the clustering
process.
4. An HMM is re-trained for each of the clusters formed, using the examples
derived from segments in the cluster.
5. Steps 2 to 4 are repeated until the convergence is met. Convergence is
reached when there is no migration of syllable segments from one cluster to
another cluster [49]. This can be described as follows: Since, the HMMs
are re-estimated every time, there is possibility that the models may change
after each iteration. When the models change continuously, the syllables
which match against them also change. But at a certain point, the models
stop changing and hence the syllables in that cluster remain the same.
At the end of this stage, C clusters are formed where C < CI . Each of the C
clusters is represented by an HMM.
4.5.3 Color palette analogy
The proposed method for clustering syllable segments is explained by using a color
palette analogy with the help of a diagram shown in Figure 4.1.
Each of the M syllable segments is represented by a unique color from the
color palette. In this explanation that follows, a color is analogous to a syllable
segment. Variations in the syllable segments are analogous to variations in the
color. For example, two variations of a syllable segment /va/ are considered as the
51
syllable segments /va1/ and /va2/. The differences in the syllable segments are
due to the mismatch of their duration and spectral characteristics. Analogously
in the color palette, two different shades of a color X are considered for syllable
segments /va1/ and /va2/.
Initially, a color model is prepared for each color. The color model is analogous
to an HMM in the clustering algorithm. The color model is assumed to be prepared
by considering some of the color properties like the percentage of red, blue and
green colors present in it. It is also assumed that a color model cannot be prepared
from a single color, hence different examples of the same color are aggregated to
estimate its properties. A color model is built for each color. In the next step, each
color is compared against the color model. The color model that best matches the
input color is considered. Initially, the color models are prepared from a single
color, hence, the input color is matched to its own model. So, the second best
match is also considered. The color which closely resembles the original color will
be the second best match. The two colors which are similar are paired up. Since,
the second color is already paired up with the first one, it is not considered in
the next iteration. This procedure is repeated for all the colors. After the first
iteration, the color and its closest match are grouped together. The properties of
these two colors are used to re-estimate the color model. In this process, the basic
properties of the color model are not changed but the parameters are varied to
account for the changes in the two colors. The colors which have similar properties
(for example the two shades of dark green), are clustered into one group and the
color model (of dark green) is re-estimated using both the shades. Once the colors
are clustered, they are not used again for clustering in that iteration. This results
in a reduced number of clusters in the next iteration. This process of combining
similar colors is carried out for some number of iterations till a set of distinct colors
is obtained. It is seen that color models of green and color models of red can never
mix with each other. If the modeling is good enough, even the color models of light
green and dark green will not mix. Hence, this ensures that the basic properties
52
of syllables will not change. The syllables which are very different will never occur
in one cluster. But, there is a chance for confusability. For example, if there is a
brown color, it may be confused with dark red and dark brown. Instead of falling
into the brown cluster, it may fall into the dark red cluster. This is the reason
why the recognition of syllable segments is not always accurate. The final color
models are shown in Figure 4.1. It can be seen that all shades of blue are modeled
by a single blue color model, all shades of green by a single green color model and
so on. Similarly, in the HMM based segment clustering process, similar syllable
segments are clustered into one group and modeled by a single HMM.
MMM
M
MM
M
1
23
k
M 123
M c−1c
Intermediatemodels
S
Final color models
color Models
colors
s−1MM s
Figure 4.1: Color palette analogy to explain the proposed HMM based techniquefor segment clustering.
4.5.4 Selection of representative syllable segments
Clustering enables to compress the system information. Now, in order to build
a codebook from the clusters, a representative for each cluster should be chosen.
The cluster contains similarly sounding segments. In order represent the cluster,
53
������������������������������������
������������
���������������������������������������������������� ����������
��������� � � � ���������������������������������������� ����������������������������������������
������������������������������������������������������������������������
������������������������������������������ ������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������
������������������ ��������������� � � � � � S
S S
S
SS
S
S
cluster K
k1k2
k3
ki
kl
kj knkm
syllable segmentRepresentative
Figure 4.2: Illustration of selection of a representative syllable segment
one syllable segment is to be chosen from each cluster. That syllable segment is
called the ‘representative syllable segment’. It represents the acoustic properties
of all the other syllable segments in the cluster. The other syllable segments are
variations of the representative syllable segment.
The selection of a representative syllable segment is illustrated in Figure 4.2.
A blue cluster that resulted from the color palette analogy is considered. It is seen
that all examples are of blue shade. The different sizes represent the differences
in the durations of the syllable segments. The different textures represent the
different spectral characteristics. Among these color elements, the element which
has properties close to all the remaining elements is chosen as the representative
color element. Similarly, to build a system codebook, a ‘representative’ syllable
segment is chosen for each cluster. The procedure for identifying the representative
syllable segments is given below:
1. Let C be the number of clusters obtained. Let Ci denote the ith cluster and
Ni denote the number of syllable segments present in that cluster. Let Sij
denote the jth syllable segment in cluster Ci,
54
2. For each syllable segment Sij, the 10th order LP Ceptral Coefficient (LPCC)
vectors using a frame size of 20 milliseconds and a frame shift of 10 millisec-
onds are computed.
3. The average Dynamic Time Warping (DTW) distance [9] Dij for Sij is com-
puted as follows:
Dij =1
Ni − 1
Ni∑
k=1,k 6=j
DTWdistance(Sij, Sik)
The DTW distance is used to compute the dissimilarity between two vector
sequences of different lengths.
4. The syllable segment which has the minimum average DTW distance is
chosen as the representative syllable segment for the cluster Ci.
Figure 4.3 shows the waveforms and corresponding spectrograms of the syllable
segments in one of the clusters. There are five syllables in the cluster which are
variations of the syllable /va/. The duration mismatch among the waveforms is
clearly observed. On the other hand, spectrograms show the similarities in the
spectral properties of the syllable segments. It is observed that all the five syllable
segments have the same formant structure. The segment /va3/ is chosen as the
representative syllable segment of the cluster.
55
Waveform Spectrogram
Time (in seconds)
Ampli
tude
Frequ
ency
Time (seconds)
(a) segment /va1/
Waveform Spectrogram
Ampli
tude
Frequ
ency
Time (in seconds) Time (in seconds)
(b) segment /va2/
Waveform Spectrogram
Time (in seconds)
Ampli
tude
Frequ
ency
Time (in seconds)
(c) segment /va3/
Waveform Spectrogram
Ampli
tude
Time (in seconds)
Frequ
ency
Time (in seconds)
(d) segment /va4/
Waveform Spectrogram
Time (in seconds)
Ampli
tude
Frequ
ency
Time (in seconds)
(e) segment /va5/
Figure 4.3: Waveforms and Spectrograms of five syllable segments in a clusterformed using the proposed clustering algorithm. The cluster containsthe syllable segments that sound like /va/.
56
4.5.5 Preparation of system codebook
The codebook of size C is prepared using the C representative syllable segments.
The sequence of LP coefficient vectors for the representative segments are used
to synthesize the speech at the receiver. Therefore, the sequence of LP coefficient
vectors corresponding to the representative syllable segments are stored in the
codebook. Since MELP is used for residual modeling, the frame size used in
MELP analysis, i.e., 22.5 milliseconds is used in the LP analysis. As the duration
differs from one syllable segment to another, the number of LP coefficient vectors
corresponding to each syllable segment is different. The information regarding the
duration of a representative syllable segment is also made available in the system
codebook. Figure 4.4 illustrates the system codebook. The widths of the codebook
entries are shown to be different to emphasize the difference in the durations of
the representative syllable segments.
As already mentioned in the previous chapter, three databases are created us-
ing Tamil and Hindi news bulletins from DBIL. The HTK took kit [50] is used for
implementing the clustering procedure. For a single speaker database, 1683 sylla-
ble segments are used to form 295 clusters. For the multi speaker Tamil database,
9465 syllable segments are used to form 1583 clusters. The multi speaker, multi
lingual database that has 17739 syllable segments resulted in 2980 clusters. Hence,
the codebook sizes for the three databases are 295, 1583 and 2980 respectively.
The codebooks are incorporated at both the transmitter and receiver sections of
the vocoder.
57
n 2
n 1
n 3
n 4
S1
S2S3S4
S i n i
n c−1n cSc
Sc−1
Figure 4.4: System codebook entries. Here Si denotes the representative syllablefor the cluster Ci and ni is the duration of Si.
4.6 Summary
This chapter discusses the preparation of system codebook for the proposed seg-
ment vocoder. The different scheme for codebook preparation proposed in the
literature use either vector quantization or HMM modeling. It is observed that
the syllable segments cannot be modeled properly by vector quantization. The
HMM training has been chosen for syllable segment clustering. Most of the meth-
ods in literature are supervised and need transcriptions for HMM training. In
the present chapter, an unsupervised clustering algorithm is used for clustering
the syllable segments into groups of acoustically similar segments. Representative
syllable segments that represent the common properties of all the segments in re-
spective syllable clusters are identified to build the system codebook used in the
segment vocoder.
58
CHAPTER 5
SYLLABLE BASED SEGMENT VOCODER
5.1 Introduction
The working of a segment vocoder involves encoding the speech signal at the
transmitter, decoding and regenerating the speech back at the receiver. Encoding
of speech involves segmentation of the speech signal, quantizing the system and
source characteristics of the segment using codebooks, and then transmitting the
binary encoded codebook indices. At the receiver, the transmitted bit stream is
decoded to retrieve the codebook indices of each segment. System and source
characteristics of each segment are modeled using parameters derived from the
respective codebooks. The modeled parameters are used to synthesize speech. In
this chapter, the operation of proposed syllable based segment vocoder is discussed.
Initially, the operation of proposed vocoder, encoding and decoding are pre-
sented in Sections 5.2, 5.3 and 5.4 respectively. Later, a sanity check is performed
on the proposed methods by passing the residual as it is during synthesis. A mod-
ification of the vocoder along with the reasons for modification are discussed in
Section 5.5. The implementation of the proposed syllable based segment vocoder
and the performance evaluation are presented in Section 5.6. Finally, the chapter
concludes by discussing some of the implementation issues in Section 5.7.
5.2 Operation of the Vocoder
The structure of the syllable based segment vocoder is shown in Figure 5.1. The
input speech signal is segmented using group delay based segmentation. The seg-
ments are recognized against the HMMs that are trained using a large corpus
of syllable like segments using the unsupervised clustering algorithm discussed in
Chapter 4. The index of the HMM which decodes the input syllable segment is
encoded and transmitted. Additionally, duration information of each segment is
also encoded for transmission. Simultaneously, the input speech signal is passed
through an LP analysis filter to obtain the residual. The parameters of the residual
are encoded using MELP. At the receiver, residual is modeled using the decoded
source parameters. System information in the form of LP coefficients is obtained
from the system codebook. The modeled residual is passed through the LP synthe-
sis filter formed from LP coefficients to obtain the synthesized speech. Description
of encoding and decoding processes is given in the following sections.
Transmission Channel
InputSpeech Group delay
Segmentation
LPAnalysis
segments
cluster models
indicesBinary
Encoding
indices
ResidualEncoding
Segment indices
SegmentCodebook
ResidualCodebook
ResidualDecoding
representative syllable segments
LPSynthesis
DurationAnalysis
Ouput speech
Syllable
Syllable Duration DurationEncoding
Duration bits
Modeled residual
Syllable Segment Bitstream of
Residual bitstream
Sequence of LPCs ofDecoding of
Figure 5.1: Block diagram of the syllable based segment vocoder
60
5.3 Encoding
Encoding in a speech coder is the process of representing the input speech signal in
the form of a bit stream. The stages involved in encoding the input speech signal
in the proposed syllable based segment vocoder are explained in this section.
5.3.1 Segmentation
The input speech signal is segmented using the two-level group delay based seg-
mentation algorithm to obtain syllable like units. The duration of each syllable
is calculated in terms of the number of frames with the frame size being 22.5
milliseconds. If the number of frames is not a whole number it is rounded off
to the next integer. The speech signal waveform for a sentence, the syllable like
segments obtained from the segmentation and the durations of the segments are
shown in Figure 5.2. Vertical lines in the figure represent syllable boundaries. The
transcription and duration of each syllable segment are also given.
til vi la hi niR pa de^n Rum
(frames)7 5 7 5 5 6 117 5 8
sa ma dU rat
6 6
Syllable:
Duration:
Figure 5.2: Segmentation of a Tamil sentence into syllable like units
61
5.3.2 System Quantization
System quantization of each syllable segment is accomplished using the HMMs
formed from the clustering task. The system encoding of a syllable segment is
shown in Figure 5.3. Each syllable segment is recognized against the cluster mod-
els. The index of the HMM which decodes the syllable segment is used for en-
coding. At the same time, the duration of the syllable segment is also encoded.
HMMs
1Index ofmatchedHMM
Duration Encoding
Feature
Syllable segment stream
Bit stream
System quantization
SyllableSegment
extraction i
2
i*
sa ma dU rat til vi la hi nir pa de^n Rum
C−1C
Figure 5.3: System encoding in the syllable based segment vocoder
5.3.3 Source Encoding
The input speech signal is passed through an LP analysis filter to obtain the
residual. Pitch, gain and voicing are the main source parameters that are to be
encoded. The MELP residual coding is used for this purpose. Block diagram
showing the MELP encoding process [20] is given in Figure 5.4. Residual encod-
ing in MELP is done at frame level with the duration of each frame being 22.5
milliseconds.
62
LP analysis
Voicing
Bandpass filtering
0 500 1000 2000 3000 4000 f(Hz)
Bi t streamQuantization
and
Encoding
Low pass filter Input speech signal
Pitch estimation
decisions
Gain estimation
Residual
Fourier magnituedes Calculation of
LP analysis
Residual
Figure 5.4: Block diagram of the MELP encoder used for source encoding in thesyllable based segment vocoder
63
The input signal is first filtered to remove any low frequency noise components.
Initial pitch estimation is carried out on the filtered speech. The next step is to
perform bandpass voicing analysis. The speech signal is passed through five filters
with the following frequency bands: 0 − 500 Hz, 500 − 1000 Hz, 1000 − 2000 Hz,
2000− 3000 Hz and 3000− 4000 Hz. The voicing decisions (voiced/unvoiced) are
made for each band. The extent of voicing in a frequency band is determined by
the strength of periodicity in that frequency band. Later, the peakiness of each
frame of the residual is calculated. The peakiness refers to the presence of samples
having relatively high magnitudes (peaks) with respect to the average magnitude
of the group of samples. The peakiness measure helps in estimating the voicing
strengths as voiced frames generally have high peakiness values when compared to
unvoiced frames. Later, final pitch is estimated over the low pass filtered residual
signal. A series of pitch processing techniques results in final pitch estimation
for a frame of speech. Once the pitch is determined, the gain is estimated. The
estimated gain is the Root Mean Square (RMS) value of the input signal in dB.
Before encoding all the parameters, Fourier magnitudes are calculated from the
residual. The Fourier Magnitudes are the first ten pitch harmonics extracted
from the residual signal. They are identified by using the spectral peak picking
algorithm on the Fourier Transform of the residual signal. Fourier magnitudes are
obtained only if the speech frame is voiced. For unvoiced frames, the unoccupied
bit positions are used for forward error correction. All the source parameters are
encoded using the respective codebooks.
5.4 Decoding
Decoding is the process of retrieving the useful information from the coded input.
The bit stream transmitted by the encoder contains the information regarding
the indices of system and source codebooks. The bit stream is decoded to get
the codebook indices. System information and duration are decoded segment
64
by segment using the system codebook and the duration codebook respectively.
Source information is decoded for every frame. Residual is modeled from the
decoded source parameters using MELP. The decoded system parameters and the
modeled residual are used for synthesizing speech using an LP synthesis filter. A
detailed explanation of system and source decoding is given in this section.
5.4.1 Extraction of system parameters
The system parameters are extracted from the system codebook. The system
codebook contains the LP coefficient vector sequences for the representative syl-
lable segment of each cluster. System codebook preparation is already explained
in section 4.5.5. The decoding process involves extraction of system parameters.
The bit stream is decoded to get system codebook indices. The LP coefficient
vector sequences corresponding to the decoded indices are used to form an LP
synthesis filter. The duration mapping between the input syllable segment and
the corresponding representative syllable segment is done by the duration match-
ing section. The duration matching technique used in this vocoder is discussed in
Section 5.5.
5.4.2 Source Modeling
Source is modeled using MELP. The MELP is developed as an extension for LPC
vocoder [20], but with an objective to synthesize speech which sounds natural and
close to the human speech. The synthesized speech from an LPC vocoder sounds
synthetic because the excitation used is either a series of impulses or random
noise depending on the nature of speech (voiced/unvoiced). But, the excitation of
human speech cannot be represented by a perfect impulse train in voiced regions
and a pure random noise in unvoiced regions. So, in order to mimic the human
speech, MELP introduces ‘mixed excitation’ which is calculated as a combination
65
of both impulses and noise. The features which make the MELP synthesized
speech sound more natural are: (i) mixed pulse and noise excitation, (ii) periodic
and aperiodic pulses and (iii) adaptive spectral enhancement. The above three
features are accommodated in the decoder part of MELP, to produce a more
natural excitation signal that resembles the residual. Modeling the excitation
signal requires source parameters like pitch, gain, voicing and Fourier magnitudes
which are quantized, encoded and transmitted. The excitation modeling using
MELP is shown in Figure 5.5.
Decodebit stream
voicinggain
PitchExtract
Impulsetrain
generator
Pulsegenerationfilter
Pulse shaping filter
noisegenerator
shapingfilter
mixedexcitation
AdaptiveSpectral Enhancement
MELP modeled residual
White Noise
voiced
jitter
pitch
If
If
unvoicedinformation
Figure 5.5: Block diagram of the MELP decoder
The decoder retrieves the source information in the form of indices correspond-
ing to the respective source codebooks. The decoded values of source parameters
are used to generate a mixed excitation signal. All the decoded parameters are
interpolated pitch synchronously. The mixed excitation is generated as the sum
of filtered pulse and noise excitations. Pulse excitation is generated using the
pitch information. The noise excitation is generated by a uniform random number
generator. These excitations are filtered and added together. Mixed excitation
removes the buzziness in the synthesized speech. One more major contribution in
66
the MELP residual modeling is the use of aperiodic pulses. This is because the
voiced regions of speech do not have perfect periodicity. There are slight variations
in the periodicity within a voiced region. In MELP, the periodicity in a voiced
region is modeled to mimic the erratic glottal pulses produced by humans. This
is accomplished by varying the pitch period length using the pulse position jitter.
The mixed excitation is then passed through an adaptive spectral enhancement
filter. The coefficients of the filter are calculated using the interpolated LP co-
efficient vector sequences. Adaptive spectral enhancement helps in matching the
synthesized speech with the original speech in formant regions. The excitation sig-
nal filtered through adaptive spectral enhancement filter is used for synthesizing
speech.
The residual modeled using the MELP increases the overall bit rate of the
coder as the major portion of the bit rate is occupied by the residual bits. To
check the sanity of the proposed techniques care is taken that residual is well
modeled. Hence, the MELP is chosen for residual coding. Figure 5.6 shows the
residual modeled by MELP. It can been seen that the MELP modeled residual is
very close to the original residual.
0 20 40 60 80 100 120 140 160 180−3000
−2000
−1000
0
1000
2000
3000
4000
5000
6000
Time (in samples)
Am
plitu
de
(a) Original residual
0 20 40 60 80 100 120 140 160 180−2000
−1000
0
1000
2000
3000
4000
5000
Time (in samples)
Am
plitu
de
(b) MELP modeled residual
Figure 5.6: Comparison of the MELP model residual with the original residual
67
5.5 Synthesis of Speech
After decoding the system and source parameters, an LP synthesis filter is used
to synthesize speech sentence. Prior to this, duration adjustment is to be made
to match the input segment duration to the duration of the representative syl-
lable segment. The duration of the representative syllable segment is obtained
from the system codebook. In order to validate the proposed algorithms, initially
the residual is passed as is at the synthesizer. Ideally, the synthesized waveform
should match the original signal. But, the duration of the representative sylla-
ble segment is very different from the actual segment duration. The duration
mismatch is compensated by repeating or deleting the middle frames of the LP
coefficient vector sequence of the representative syllable segment so that the num-
ber of frames as required from the segment duration of the input syllable segment
is same as that of the representative syllable segment. When speech is synthesized
using the duration matched LP coefficient vector sequences of the representative
syllable segments and the original residual, clipping is observed in some sentence
waveforms. Figure 5.7 shows the waveforms for the original speech signal and the
clipped signal of the synthesized speech. This clipping effect shows that there is
still a mismatch between the residual and LP coefficient vector sequence obtained
from representative syllables. In order to remove the mismatch, a modified version
of the segment vocoder shown in Figure 5.8 is implemented. In this vocoder, the
residual which is to be encoded and later modeled by the MELP, is obtained by
passing the input speech signal through the filter formed by LP coefficients of the
representative syllable segments. This ensures matching because the residual is
the prediction error obtained when the speech signal is passed through the LP
analysis filter. When such a residual signal is filtered through the inverse filter
formed by the same set of LP coefficients, the original signal is retrieved. To match
the durations, the frames in the middle of the LP coefficient vector sequence of
the representative syllable segment are repeated or deleted.
68
0 2000 4000 6000 8000 10000 12000 14000 16000−3
−2
−1
0
1
2
3x 10
4
Time (in samples)
Am
plitu
de
(a)
0 2000 4000 6000 8000 10000 12000 14000 16000−3
−2
−1
0
1
2
3x 10
4
Time (in samples)
Am
plitu
de
(b)
Figure 5.7: Waveform of (a) The original speech signal and (b) The clipped signalof the synthesized waveform
69
Transmission Channel
InputSpeech Encoding
ResidualInverse LP Filter
as a bit streamGroup delaySegmentation
segments
InformationDuration
Syllable codebook(sequence of LPC
Duration information
Residual(MELP)
DecodingSegment indices
representative syllable segments
LPSynthesis
Duration
Ouput speech
Information
Residual
using MELP
Syllable codebook(sequence of LPCCoefficient Vectors)
SyllableSegment indices
Residual bitstream
Sequence of LPCs of
Modeling Modeled residual
coefficient vectors)
Duration (4 bits)
Figure 5.8: Block diagram of the modified segment vocoder
5.6 Experiments and Results
When the input speech utterance to be coded is given, group delay based segmen-
tation is done to get the syllable segments. The segments are recognized using the
HMM cluster models. The indices corresponding to the recognized cluster models
are encoded. The speech signal is then passed through the filter formed from the
LP coefficient vector sequence of the representative syllable segments after the du-
ration adjustment i.e, any duration mismatch present between the input syllable
segment and its representative syllable segment is removed by either repeating or
deleting of frames in the middle sections of the LP coefficient vector sequence of
the representative syllable segment. The residual obtained from the filter is coded
using the MELP residual coding algorithm [51]. The residual is coded for every
22.5 milliseconds using 28 bits. This results in 1.2 Kbits for residual coding alone.
A syllable rate of of 8 syllables/s is assumed. If a maximum syllable duration of
70
16 frames is assumed where each frame is of 22.5 ms, 4 bits are needed to encode
the duration of each segment and the overall bit rate will be about 1.4 Kbps.
Frame size for source encoding = 22.5 ms
Number of bits required to encode source parameters = 28 bits/frame
Number of frames = 44.4 frames/s
Codebook size for single-speaker database = 295
Number of bits required to encode system parameters = 9 bits/segment
Total bit rate for single-speaker database = 44.4 × 28 + 7 × (9 + 4) ≈ 1400 bps
Codebook size for multi speaker database = 1583
Number of bits required to encode system parameters = 11 bits/segment
Total bit rate for multi-speaker database = 44.4 × 28 + 7 × (11 + 4) ≈ 1400 bps
Codebook size for multi speaker, multi lingual database = 2980
Number of bits required to encode system parameters = 12 bits/segment
Total bit rate for single-speaker database = 44.4 × 28 + 7 × (12 + 4) ≈ 1400 bps
It can be observed clearly that even if the codebook size is increased nearly
ten times the bit rate does not change significantly. But, the computation time
increases with increase in the number of HMMs. This forms a major area of focus
when real time implementation of the coder is considered. The spectrograms of
the original speech and the synthesized speech for a part of a test sentence are
shown in Figure 5.9(a) and Figure 5.9(b) respectively. It is observed that the
formant structure is preserved. The output speech quality of the proposed seg-
mental vocoder is compared with that of the 2.4 Kbps standard MELP codec.
Perceptual Evaluation of Speech Quality(PESQ) [52] score is used for the purpose
of comparison. The PESQ tool described in ITU-T Rec. P.862 is called Percep-
tual Evaluation of Speech Quality. It will provide a rapid and repeatable result
in a few moments. PESQ is an objective measurement tool that predicts the
results of subjective listening tests on telephony systems. PESQ uses a sensory
71
Time (in seconds)
Freq
uenc
y
(a)
Time (in seconds)
Freq
uenc
y
(b)
Figure 5.9: Comparison of spectrograms: (a) Original signal and (b) Synthesizedsignal
72
model to compare the original, unprocessed signal with the degraded signal from
the network or network element. The resulting quality score is analogous to the
subjective ”Mean Opinion Score” (MOS) measured using panel tests according to
ITU-T P.800. The PESQ scores are calibrated using a large database of subjective
tests. PESQ takes into account coding distortions, errors, packet loss, delay and
variable delay, and filtering in analogue network components. The range of values
the PESQ takes is same as that of the MOS. The 5.1 shows the range of values
PESQ can take.
Table 5.1: PESQ scores
MOS Quality Impairment5 Excellent Imperceptible4 Good Perceptible but not annoying3 Fair Slightly annoying2 Poor Annoying1 Bad Very annoying
The results for the segment vocoder are shown separately for each database
in Table 5.2, Table 5.3, Table 5.4 and Table 5.5. The orginal and synthesized
speech waveforms are available in the directory named ‘Chapter-5’ in the CD-ROM
attatched with the Thesis. It can be observed that the waveforms synthesized using
the proposed segment vocoder are natural and intelligible.
73
Table 5.2: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for four sentences in the single speaker database
Sentence Proposed method MELP
1 1.66 2.55
2 1.60 2.87
3 1.45 2.50
4 1.82 2.78
Table 5.3: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for six sentences in the multi speaker database
Sentence Proposed method MELP
1 1.70 2.44
2 1.37 2.95
3 2.05 2.90
4 1.42 2.73
5 1.80 2.58
6 1.61 2.34
74
Table 5.4: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for six sentences of Tamil language in the multispeaker, multi lingual database
Sentence Proposed method MELP
1 1.72 2.63
2 1.45 2.28
3 1.71 2.44
4 1.85 2.50
5 1.52 2.32
6 1.33 2.14
Table 5.5: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for six sentences of Hindi language in the multispeaker, multi lingual database
Sentence Proposed method MELP
1 1.82 2.53
2 1.70 2.61
3 1.75 2.38
4 1.42 2.04
5 1.42 2.59
6 1.38 2.49
75
5.7 Issues in the Implementation
The speech synthesized from the proposed syllable based segment vocoder is in-
telligible and preserves the naturalness. However, it has been observed that there
are several issues which lead to deterioration in the speech quality. The major
issues are:
1. Residual modeling
2. Syllable recognition
3. Duration matching
5.7.1 Residual Modeling
In linear prediction analysis, residual is the error signal between the predicted and
original speech signals. Coefficients of the LP analysis filter are computed such
that error (residual) obtained is minimum. During coding, the MELP uses the
residual signal to extract source parameters, and quantize and encode them for
transmission. In the proposed segment vocoder, the LP analysis filter is formed
from the LP coefficient vectors of the representative syllable segments. This re-
sults in an error signal which is not minimum error that may be obtained if LP
modeling was performed on the original signal. When the representative syllable
segments are very much different from the input syllable segments, as it happens
in recognition, the error signal obtained is larger in amplitude, almost resembling
the speech signal. Since, MELP does compression of a 8000 Hz signal to 2400
bits/s, it is difficult to quantize the residual if the error is more. The residual
almost resembles the speech signal in the areas where the error is large. This
is clearly seen in the following figures. Figure 5.10(a) shows the residual of a
speech signal. Figure 5.10(b) shows the residual obtained by passing the speech
76
0 2000 4000 6000 8000 10000 12000 14000 16000 18000−1.5
−1
−0.5
0
0.5
1
1.5
2x 104
Time (in samples)
Ampli
tude
(a)
0 2000 4000 6000 8000 10000 12000 14000 16000 18000−2
−1.5
−1
−0.5
0
0.5
1
1.5
2x 10
4
Time (in samples)
Ampli
tude
(b)
Figure 5.10: Comparison of the residual for the original speech signal and theresidual used in the vocoder: (a) Residual of the original speechsignal and (b) Residual of the speech signal passed through the filterfrom the LP coefficient vectors of the representative syllable segments
77
signal through the LP analysis filter formed from the LP coefficient vectors of the
representative syllable segment. This residual is used for MELP modeling in the
proposed segment vocoder. It is seen that the error is very high in some regions.
In such cases, the MELP residual modeling may not be able to model the source
parameters effectively. This may also lead to the poor quality of the synthesized
speech.
5.7.2 Syllable Recognition
As discussed previously, unsupervised clustering of syllables gives a syllable recog-
nition accuracy of 48% [31]. The recognition accuracy of the clustering algorithm
is not given much importance in the present work. The reason is that the error
in the recognition is expected to get captured in the residual as we pass the input
speech signal through the inverse filter formed from the representative syllable seg-
ment. The residual, if well modeled is used for synthesis, it is possible to get back
the original signal with minimum distortion. But the first best result of syllable
recognition is not always correct. Because of this, sometimes there will be a huge
mismatch between the recognized and original signal which leads to poor quality
of the synthesized speech waveform. Sentence 2, whose PESQ score is 1.40, in
Table 5.3 is one such example. To improve the quality of the synthesized speech,
instead of selecting only the 1-best result cluster for an input syllable segment,
3-best results are selected. Representative syllable segments are chosen from the
3-best clusters. The index of the cluster whose representative syllable segment is
closer to the input syllable segment is encoded and transmitted. In other words,
the syllable combination that yields the best PESQ score is encoded and trans-
mitted. Figure 5.11 illustrates this method by encoding a part of a test utterance.
The syllables corresponding to the path −×− can be encoded and transmitted.
The above suggested method is implemented for a speech utterance manually.
78
va
va
va na
na
na
1 1 1
2 2 2
3 3 3
rep
rep
rep rep
rep
rep rep
rep
rep
kam
kam
kam
PESQ scores
1.68
1.751.671.672.00
1.947
1.72
Figure 5.11: Different combinations of syllables for the utterance /vanakkam/
The 3-best results for every syllable segment are heard and the representative
syllable segment that is perceptually close to the input syllable segment is used
for encoding. This improved the performance from a PESQ score of 1.30 to a
PESQ score of 1.60. Figure 5.12 shows the selection of best representative syllable
segment that is close to the input syllable segment. It can be seen that except
for the second and fourth syllable, all other syllables are close to their second
and third best results. It is also observed that the residual obtained using these
2 best 2 best 2 best 2 best 2 best 2 best
syl 1 syl 2 syl 3 syl 4 syl 5 syl 6
Recognition results
3 best 3 best
1 best 1 best 1 best
3 best 3 best 3 best 3 best
rep 11 best
rep 2
rep 3
1 best
rep 4
rep 5
rep 6
Input syllable segments
1 best
1 best
rep 2 rep 3 rep 5
Recognition results
1 best 1 best 1 best 1 best 1 best
rep 1 rep 4 rep 6PESQ= 1.30
PESQ = 1.60
Figure 5.12: PESQ improvement by selecting perceptually close representative syl-lable segments
representative syllable segments is close to the original residual than the residual
79
that is obtained from the set of first best representative syllable segments. Fig-
ure 5.13(a) shows the original residual. Figure 5.13(b) contains the residual as
obtained by considering the first best results of recognition, while Figure 5.13(c)
shows the residual obtained from the representative syllables that are manually
selected from the 3-best results. It is clearly observed that the error in Figure
5.13(c) is much small when compared to the error in 5.13(b).
80
0 2000 4000 6000 8000 10000 12000 14000−8000
−6000
−4000
−2000
0
2000
4000
6000
8000
10000
12000
Time (in samples)
Amplit
ude
(a)
0 2000 4000 6000 8000 10000 12000 14000−8000
−6000
−4000
−2000
0
2000
4000
6000
8000
10000
12000
Time (in samples)
Amplit
ude
(b)
0 2000 4000 6000 8000 10000 12000 14000−8000
−6000
−4000
−2000
0
2000
4000
6000
8000
10000
12000
Time (in samples)
Ampli
tude
(c)
Figure 5.13: Comparison of the residual for the (a) original speech signal, (b) firstbest results of the representative syllable segments and (c) perceptu-ally close representative syllable segments from 3-best results.
81
The analysis establishes that there is a need to develop some automatic tech-
nique to recognize the syllable that is close to the input syllable segment.
5.7.3 Duration Matching
The frame repetition and deletion technique used to match the durations of input
syllable segment and the corresponding representative syllable segment may not
yield good results in all the cases. This technique is suitable when the duration
mismatch is small. When the duration mismatch is large, extensive repetition or
deletion in the middle portion may totally change the spectral characteristics of
the syllable segment. In order to demonstrate the mismatch between the actual
syllable segment and the representative syllable segment an example of syllable
/va/ is considered in Figure 5.14(a) and Figure 5.14(b). It is seen that, duration
difference is as large as half of duration of the original speech signal. So, nearly
half of the number of frames should be repeated.
Am
plitu
de
Time (in seconds)
(a)
Time (in seconds)
Ampli
tude
(b)
Figure 5.14: Comparison of the segment duration of (a) original syllable segmentand (b) representative syllable segment
82
The synthesized syllable segment obtained using such a feature vector sequence
has very different characteristics from the original input syllable. This is shown
in the Figure 5.15(a) and Figure 5.15(b). It is observed that the spectrogram
of the synthesized syllable in Figure 5.15(b) is significantly different from the
spectrogram of the input syllable shown in Figure 5.15(a). Alternatively, instead
of repeating the middle frame, an attempt is made to repeat the frames in the
place where the time domain waveforms of two syllables is similar. The syllable
synthesized using such a feature vector showed better similarity to the spectrogram
of the input syllable. Figure 5.15(c) and Figure 5.15(d) are the spectrograms of
the synthesized syllables obtained by using the feature vectors whose frames are
repeated in accordance with the similarities in the time domain waveforms of the
input and representative syllables. It can be observed clearly that the top most
formant lost in Figure 5.15(b), is restored in Figure 5.15(c) and Figure 5.15(d).
83
Freq
uenc
y
Time (in seconds)
(a)Fr
eque
ncy
Time (in seconds)
(b)
Freq
uenc
y
Time (in seconds)
(c)
Freq
uenc
y
Time (in seconds)
(d)
Figure 5.15: Analysis of spectrograms for (a) Input syllable segment, (b) Syn-thesized syllable segment obtained with middle frame repetition, (c)Synthesized syllable segment with appropriate frame repetition and(d) Synthesized syllable segment with appropriate frame repetitionbut different from the frame repetition used in (c).
84
The analysis is also verified by synthesizing a part of utterance. The utterance
chosen here is again “vanakkam”. The utterance consists of three syllable units
- /va/, /nak/ and /kam/. The duration mismatch between the input syllable
segments and the corresponding representative syllable segments is such that, for
/va/ frames are to be repeated, for /nak/ frames are to be deleted and for /kam/
no repetition or deletion is required. The utterance is synthesized in three different
forms:
1. Form 1: The LP coefficient vector sequence of the representative syllable
segments of /va/ and /nak/ are formed by repeating and deleting frames in
the middle.
2. Form 2: The LP coefficient vector sequence of the representative syllable
segments of /va/ and /nak/ are formed by repeating and deleting appropri-
ate frames.
3. Form 3: The LP coefficient vector sequence of the representative syllable
segments of /va/ and /nak/ are formed by repeating and deleting appropri-
ate frames, but the positions of repetition and deletion are slightly altered
from Form 3.
It is seen from Table 5.6 that the PESQ scores of Form 2 and Form 3 are higher
than that for Form 1.
Table 5.6: PESQ scores of an utterance after appropriate frame repetition
Synthesis type PESQ score
Form 1 1.886
Form 2 2.224
Form 3 2.295
85
Moreover, while encoding the duration of the syllable segment, the number of
frames is rounded off to the next integer value. This results in frame mismatch at
the boundary where the last frame of the present syllable segment gets convolved
with the residual of the next syllable segment during synthesis. This results in a
mismatch which manifests itself as clipping in the synthesized waveform. Large
duration mismatch at the boundaries may lead to chopping of the synthesized sig-
nal, which in turn reduces the PESQ score. Figure 5.16(a) and Figure 5.16(b) show
the waveforms synthesized from MELP and the proposed vocoder respectively. It
can be clearly observed that the synthesized waveforms are similar except for the
clipping. This sentence is same as the 4th sentence in Table 5.3 whose PESQ score
is 1.42.
Am
plitu
de
Time (in seconds)
(a)
clipping
Am
plitu
de
Time (in seconds)
(b)
Figure 5.16: (a) Synthesized speech obtained using the MELP codec and (b) Syn-thesized speech obtained using the syllable based segment vocoder.
86
5.8 Summary
The operation of a syllable based segment vocoder is presented. The encoding
operation includes segmentation, system and source quantization, and duration
quantization. The decoding operation involves representing the system character-
istics in the form of LP coefficient vector sequences of the representative syllable
segments, modeling the excitation signal using the MELP, duration matching and
finally synthesizing the speech using an LP synthesis filter. The quality of the
output speech is compared with that of the 2.4 Kbps MELP standard codec. The
output speech is intelligible and natural. It has been observed that the choice of
the representative syllable segment is not always the best. Analysis indicates that
the performance is poor when a large number of syllables are wrongly recognized.
It is also observed that the duration mismatch near the segment boundaries also
deteriorate the quality.
87
CHAPTER 6
CONCLUSIONS
6.1 Summary
The excessive demand for bandwidth in voice communications has triggered the
evolution of low bit rate speech coders. The coders are developed to preserve the
quality of the output speech at very low transmission bit rates. However, there
is always a trade-off between the bit rate and the speech quality. As the bit rate
decreases, the quality of the output speech also decreases. The research in low
bit rate speech coding aims at producing the good quality speech at very low bit
rates. Segment vocoders are such low bit rate coders, that are capable of achieving
bit rates less than 2.4 Kbps. However, the quality of speech is very low and often
synthetic. This is because of the high compression at segment level.
Most of the segment vocoders use phonemes or diphones as the segmental
units for compression. The segmental units are obtained using various segmenta-
tion techniques. The techniques are broadly classified into two categories. One
category uses algorithms which dynamically search for the regions of interest to
locate the boundaries. The algorithms are iterative and complex in nature. The
other category uses parametric modeling, where the models are built from a large
amount of segmental data. The models are then used to locate the boundaries
in the incoming speech signal. Parametric modeling requires the models to be
built on a large amount of data so that the variations of the segmental units are
captured effectively. Hence, there is a need to develop a simple and flexible seg-
mentation algorithm which does not require any prior information to locate the
boundaries. A signal processing technique called group delay based segmentation
is one such technique which is capable of locating the syllable boundaries. Since,
the syllable is a larger unit than a phoneme or a diphone, selecting the syllable
as a segmental unit ensures better compression. Hence, in this work syllable like
units are chosen for compression.
The compression is achieved by quantizing the system and source parame-
ters of the segmental units. Priorly prepared codebooks are used to quantize the
system and source parameters of the segmental units. Generally, the system char-
acteristics are quantized at the segmental level, whereas the source parameters
are quantized at the frame level. For system codebook preparation, Vector Quan-
tization (VQ) and parametric modeling techniques are widely employed. As the
syllable is chosen as the segmental unit, VQ may not be feasible for clustering as it
does not capture the sequence information in syllables. Hence, parametric model-
ing using HMMs is employed. Unlike most of the parametric modeling techniques
that operate in the supervised mode, an unsupervised clustering algorithm is used.
The algorithm results in clusters that have similar sounding segments. For each
cluster, a representative syllable segment that represents the characteristics of all
the other syllable segments in the cluster is selected.
The system codebook contains the LP coefficient vector sequences of the repre-
sentative syllable segments. For source coding, parameters like pitch, voicing and
jitter are quantized using the respective codebooks. In the present work, source
is encoded and modeled using the MELP. Source modeling using MELP preserves
naturality in the synthesized speech. With a system compression of 100 bps and
source compression of 1.2 Kbps, the proposed vocoder operates at 1.4 Kbps.
Speech is synthesized at the receiver using an LP synthesis filter. The filter
is formed from the LP coefficient vector sequences of the representative syllable
segments stored in the system codebook. The durations of the input syllable
segment and the corresponding representative syllable segment are matched using
the frame repetition and deletion technique. The excitation signal is modeled
89
using MELP residual modeling. The modeled residual is passed through the LP
synthesis filter to produce the output speech which is intelligible and natural.
6.2 Conclusions
The following are the main conclusions drawn from the present work:
1. Syllable like units that are much larger than phoneme and diphone units can
be considered as units for compression in the segment vocoders.
2. The complexity of segmentation can be reduced using signal processing based
technique like group delay based segmentation.
3. Syllable recognition and duration matching have a significant effect on the
quality of the output speech.
6.3 Criticism of the work
A novel syllable based segment vocoder is proposed in the present work. The
vocoder is capable of producing the natural sounding speech with good intelligi-
bility. The following drawbacks are observed in the implementation of the vocoder.
1. The issue regarding the selection of the best representative unit is not fully
addressed. An automatic technique has to be developed to obtain a better
recognition unit for synthesis. The method used to find the representative
syllable segment may not be correct.
2. Long silences in the input speech signal have to be modeled effectively.
Presently, some manual correction of duration is done near the long silence
90
regions. Since the database used in the present work is NEWS data, there
are no significantly long silences. But if the coder has to work on casual
speech, long silence regions have to be modeled separately.
3. The frame repetition and deletion technique is not an effective way to match
the durations. The quality of speech drastically reduces when there is a
significant mismatch near the segment boundaries.
6.4 Directions for Future Work
The proposed vocoder can be enhanced by extending the work in following areas:
1. Speech quality can be improved by addressing the issues like syllable recog-
nition and duration matching. A two level encoding and varying the number
of states according to the duration of the syllable segment can be considered
to improve the recognition performance.
2. Instead of using frame repetition, techniques like DTW can be used to ef-
fectively match the durations of the input syllable segment and the corre-
sponding representative syllable segment.
3. Silence regions in speech are to be modeled separately for better quality of
the synthesized speech.
4. The vocoder has to be implemented for multiple languages.
5. Bit rates can be reduced by modeling the source information with less num-
ber of bits.
91
REFERENCES
[1] L. Hanzo, F. C. Somerville, and J. Woodard, Voice and Audio Compressionfor Wireless Communications. Wiley, 2007.
[2] S. Roucos, R. M. Schwartz, and J. Makhoul, “A segment vocoder at 150bits/s,” Proceedings of International Conference on Acoustics, Speech andSignal Processing, vol. 8, pp. 61–64, April 1983.
[3] J. Picone and G. Doddington, “A phonetic vocoder,” Proceedings of Inter-national Conference on Acoustics, Speech and Signal Processing, vol. 1, pp.580–583, May 1989.
[4] K. S. Lee and R. V. Cox, “A very low bit rate speech coder based on recogni-tion/synthesis paradigm,” IEEE Transactions on Speech and Audio Process-ing, vol. 9, pp. 482–491, July 2001.
[5] J. Cernocky, G. B. Baudoin, and G. Chollet, “Segmental vocoder - Goingbeyond the phonetic approach,” Proceedings of International Conference onAcoustics, Speech and Signal Processing, vol. 2, pp. 605–608, May 1998.
[6] K. Tokuda, T. Masuko, J. Hiroi, T. Kobayashi, and T. Kitamura, “A verylow bitrate speech coder using HMM-based speech recognition/synthesis tech-niques,” Proceedings of International Conference on Acoustics, Speech andSignal Processing, vol. 2, pp. 609–612, May 1998.
[7] M. Ismail and K. Ponting, “Between recognition and synthesis - 300 b/sspeech coding,” Proceedings of EUROSPEECH, vol. 1, pp. 441–444, 1997.
[8] T. Hoshiya, S. Sako, H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, andT. Kitamura, “Improving performance of HMM-based very low bitrate speechcoding,” Proceedings of International Conference on Acoustics, Speech andSignal Processing, vol. 1, pp. 800–803, April 2003.
[9] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. PearsonEducation Pvt. Ltd., 2003.
[10] L. Rabiner and R. Schafer, Digital Processing of Speech Signals. Prentice-Hall, Inc., Eaglewood Cliffs, New Jersey, 1978.
[11] V. Ramasubramanian and T. V. Sreenivas, “Automatically derived units forsegment vocoders,” Proceedings of International Conference on Acoustics,Speech and Signal Processing, vol. 1, pp. 473–476, May 2004.
[12] T. Nagarajan and H. A. Murthy, “Group delay based segmentation of spon-taneous speech into syllable-like units,” EURASIP Journal of Applied SignalProcessing, vol. 17, pp. 2614–2625, 2004.
92
[13] A. Lakshmi and H. A. Murthy, “A syllable based continuous speech recognizerfor Tamil,” Proceedings of INTERSPEECH, pp. 1878–1881, September 2006.
[14] Y. Shiraki and M. Honda, “LPC speech coding based on variable-length seg-ment quantization,” IEEE Transactions on Acoustics, Speech and Signal Pro-cessing, vol. 36, pp. 1437–1444, 1988.
[15] S. Roucos and A. M. Wilgus, “A waveform segment vocoder: A new approachfor very low rate speech coding,” Proceedings of International Conference onAcoustics, Speech and Signal Processing, vol. 10, pp. 236–239, April 1985.
[16] S. Roucos, R. M. Schwartz, and J. Makhoul, “Vector quantization for very-low-rate coding of speech,” Proceedings of IEEE Globecom’82, pp. 1074–1078,1982.
[17] D. Wong, B. H. Juang, and A. Gray, “An 800 bit/s vector quantizationlpc vocoder,” IEEE transactions on Acoustics, Speech and Signal Process-ing, vol. 30, no. 5, pp. 770–780, October 1982.
[18] C. Tsao and R. M. Gray, “Matrix quantizer design for lpc speech using thegeneralized lloyd algorithm,” IEEE transactions on Acoustics, Speech andSignal Processing, vol. 33, no. 3, pp. 537–545, June 1985.
[19] T. Nagarajan and H. A. Murthy, “Language Identification using acousticlog-likelihoods of syllable like units,” Speech Communication, vol. 48, pp.913–926, 2006.
[20] A. V. McCree and T. P. Barnwell, “A mixed excitation LPC vocoder modelfor low bit rate speech coding,” IEEE Transactions on Speech and AudioProcessing, vol. 3, no. 4, pp. 242–250, July 1995.
[21] L. M. Supplee, R. P. Cohn, J. S. Collura, and A. V. McCree, “MELP: thenew federal standard at 2400 bps,” Proceedings of International Conferenceon Acoustics, Speech and Signal Processing, vol. 2, pp. 1591–1594, April 1997.
[22] W. C. Chu, Speech Coding Algorithms. Wiley-IEEE, 2003.
[23] S. T. Bardenhagen, K. L. Brown, and R. D. Braun, “Low bit rate speechcompression using hidden Markov models,” Proceedings of MILCOM, vol. 1,pp. 507–511, 1997.
[24] R. S. Kumar, N. Tamrakar, and P. Rao, “Segment based MBE speech cod-ing at 1000 bps,” Proceedings of National Conference on Communication,February 2008.
[25] M. Felici, M. Borgatti, and R. Guerrieri, “Very low bit rate speech codingusing a diphone-based recognition and synthesis approach,” IEE ElectronicLetters, vol. 34, no. 9, pp. 859–860, 1998.
[26] G. Benbassat and X. Delon, “Low bit rate speech coding by concatenationof sound units and prosody coding,” Proceedings of International Conferenceon Acoustics, Speech and Signal Processing, vol. 9, pp. 5–8, March 1984.
93
[27] P. Vepyak and A. B. Bradley, “Consideration of processing strategies for verylow rate compression of wide band speech signal with known text transcrip-tion,” Proceedings of EUROSPEECH, vol. 3, pp. 1279–1282, 1997.
[28] H. C. Chen, C. Y. Chen, K. M. Tsou, and O. T. Chen, “A 0.75 kbps speechcodec using recognition and synthesis schemes,” Proceedings of IEEE Work-shop on Speech Coding in Telecommunications, vol. 3, pp. 27–29, 1997.
[29] K. S. Lee and R. V. Cox, “A segmental speech coder based on a concatenativeTTS,” Speech Communication, vol. 38, no. 1, pp. 89–100, September 2002.
[30] V. Kamakshi Prasad, “Segmentation and Recognition of Continuous Speech,”Ph.D. dissertation, Department of Computer Science and Engineering, IndianInstitute of Technology Madras, India, January 2002.
[31] G. Lakshmi Sarada, “Automatic Transcription of Continuous Speech for In-dian Languages,” Master’s thesis, Department of Computer Science and En-gineering, Indian Institute of Technology Madras, India, December 2005.
[32] A. Lakshmi, “A Syllable based Continuous Speech Recognizer for Indian Lan-guages,” Master’s thesis, Department of Computer Science and Engineering,Indian Institute of Technology Madras, India, April 2007.
[33] Samuel Thomas, “Natural Sounding Text-to-Speech Synthesis based on Syl-lable like Units,” Master’s thesis, Department of Computer Science and En-gineering, Indian Institute of Technology Madras, India, February 2007.
[34] N. Sridhar Krishna, “Text-to-Speech Synthesis system for Indian Languageswithin Festival Framework,” Master’s thesis, Department of Computer Sci-ence and Engineering, Indian Institute of Technology Madras, India, January2004.
[35] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Mixed excitation for HMM-based speech synthesis,” Proceedings of EU-ROSPEECH, pp. 600–604, 2001.
[36] R. Maia, “A novel excitation approach for HMM-based speech synthesis -report IV,” A report by HTS group - Nagoya Institute of Technology, May2007.
[37] Y. J. Kim and A. Conkie, “Automatic segmentation combining an HMMbased approach and spectral boundary correction,” Proceedings of Interna-tional Conference on Spoken Language Processing, pp. 145–148, September2002.
[38] T. Svendsen and F.K.Soong, “On the automatic segmentation of speech sig-nals,” Proceedings of International Conference on Acoustics, Speech and Sig-nal Processing, vol. 12, pp. 77–80, April 1987.
[39] J. P. Martens and L. Depuydt, “Broad phonetic classification and segmenta-tion of continuous speech by means of Neural Networks and Dynamic Pro-graming,” Speech Communication, vol. 10, pp. 81–90, February 1991.
94
[40] S. Wu, E. D. Kingsbury, N. Morgan, and S. Greenberg, “Incorporating infor-mation from syllable length time scales into automatic speech recognition,”Proceedings of International Conference on Acoustics, Speech and Signal Pro-cessing, vol. 2, pp. 987–990, April 1997.
[41] H. A. Murthy and B. Yegnanarayana, “Formant extraction from minimumphase group delay function,” Speech Communication, vol. 10, pp. 209–221,August 1991.
[42] V. K. Prasad, T. Nagarajan, and H. A. Murthy, “Automatic segmentationof continuous speech using minimum phase group delay functions,” SpeechCommunication, vol. 42, pp. 429–446, 2004.
[43] T. Nagarajan, R. M. Hegde, and H. A. Murthy, “Segmentation of speech intosyllable-like units,” Proceedings of EUROSPEECH, pp. 2893–2896, Septem-ber 2003.
[44] “Database for Indian Languages,” Speech and Vision Lab, Indian Instituteof Technology Madras, India, 2001.
[45] T. Nagarajan and H. A. Murthy, “Language identification using parallelsyllable-like unit recognition,” Proceedings of International Conference onAcoustics, Speech and Signal Processing, vol. 1, pp. 401–404, May 2004.
[46] G. L. Sarada, T. Nagarajan, and H. A. Murthy, “Multiple frame size andmultiple frame rate feature extraction for speech recognition,” Proceedings ofSPCOM, pp. 592–595, December 2004.
[47] T. Masuko, K. Tokuda, and T. Kobayashi, “A very low bit rate speech coderusing HMM with speaker adaptation,” Proceedings of 5th International Con-ference on Spoken Language Processing, vol. 2, pp. 507–510, December 1998.
[48] G. L. Sarada, N. Hemalatha, T. Nagarajan, and H. A. Murthy, “Automatictranscription of continuous speech using unsupervised and incremental train-ing,” Proceedings of INTERSPEECH, pp. 405–408, October 2004.
[49] T. Nagarajan and H. A. Murthy, “An approach to segmentation and labelingof continuous speech without bootstrapping,” Proceedings of National Con-ference on Communication, pp. 508–512, January 2004.
[50] “HTK Speech Recognition Toolkit,” http://htk.eng.cam.ac.uk.
[51] “2.4 Kbps MELP: Federal standard speech coder, version 1.2,” Texas Instru-ments, 1996.
[52] “Perceptual Evaluation of Speech Quality (PESQ), ITU-T P.862 (02/2001),”www.pesq.org.
95
LIST OF PUBLICATIONS
1. Sadhana Chevireddy, Hema A. Murthy and C. Chandra Sekhar “Signal pro-
cessing based segmentation and HMM based acoustic clustering of syllable
segments for low bitrate segment vocoder at 1.4 Kbps,” In the proceedings
of 16th European Signal Processing Conference (EUSIPCO-2008), August 25
- 29, Lausanne, Switzerland.
2. Sadhana Chevireddy, Hema A. Murthy and C. Chandra Sekhar “A syllable
based segment vocoder,” in Proceedings of National Conference on Commu-
nication (NCC-2008), pp. 442-445, February 2 - 4, IIT Bombay.
96
CURRICULUM VITAE
1. Name: Sadhana Chevireddy
2. Date of Birth: 13th August, 1984
3. Educational Qualifications:
(a) 2008 - Master of Science (M.S)
(b) 2005 - Bachelor of Technology (B.Tech)
4. Permanant address:
8-186, New Balaji Colony
M R Palle Road
Tirupati - 517502
Andhra Pradesh
Ph No: 0877 - 2243678
email: [email protected]
97