Festival training

Festival TTS Training Material

TTS GroupIndian Institute of Technology Madras

Chennai - 600036India

June 5, 2012

1

Contents

1 Introduction 41.1 Nature of scripts of Indian languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Convergence and divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 What is Text to Speech Synthesis? 52.1 Components of a text-to-speech system . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Normalization of non-standard words . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Grapheme-to-phoneme conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Methods of speech generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5.1 Parametric synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.2 Concatenative synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Primary components of the TTS framework . . . . . . . . . . . . . . . . . . . . . . . 82.7 Screen readers for the visually challenged . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Overall Picture 10

4 Labeling Tool 124.1 How to Install LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Troubleshooting of LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Labeling Tool User Manual 185.1 How To Use Labeling Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 How to do label correction using Labeling tool . . . . . . . . . . . . . . . . . . . . . 255.3 Viewing the labelled file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Control file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.5 Performance results for 6 Indian Languages . . . . . . . . . . . . . . . . . . . . . . . 305.6 Limitations of the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Unit Selection Synthesis Using Festival 316.1 Cluster unit selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Choosing the right unit type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Collecting databases for unit selection . . . . . . . . . . . . . . . . . . . . . . . . . . 326.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.5 Building utterance structures for unit selection . . . . . . . . . . . . . . . . . . . . . 336.6 Making cepstrum parameter files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.7 Building the clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Building Festival Voice 42

8 Customizing festival for Indian Languages 448.1 Some of the parameters that were customized to deal with Indian languages in festival

framework are : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448.2 Modifications in source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9 Trouble Shooting in festival 509.1 Troubleshooting (Issues related with festival) . . . . . . . . . . . . . . . . . . . . . . 509.2 Troubleshooting(Issues might occur while synthesizing) . . . . . . . . . . . . . . . . . 50

10 ORCA Screen Reader 51

2

11 NVDA Windows Screen Reader 5311.1 Compiling Festival in Windows : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

12 SAPI compatibility for festival voice 60

13 Sphere Converter Tool 6213.1 Extraction of details from header of the input file . . . . . . . . . . . . . . . . . . . . 62

13.1.1 Calculate sample minimum and maximum values . . . . . . . . . . . . . . . . 6313.1.2 RAW Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6313.1.3 MULAW Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6313.1.4 Output in encoded format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

13.2 Configfile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

14 Sphere Converter User Manual 6414.1 How to Install the Sphere converter tool . . . . . . . . . . . . . . . . . . . . . . . . . 6414.2 How to use the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6514.3 Fields in Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6614.4 Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6714.5 Example of data in the Config file (default properties) . . . . . . . . . . . . . . . . . 6814.6 Limitations to the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3

1 Introduction

This training is conducted for new members who joined the TTS consortium. The main aim of theTTS consortium is to develop text-to-speech (TTS) systems in all of these 22 official languages inorder to build screen readers which are spoken interfaces for information access which will aid vi-sually challenged people use a computer with ease and to make computing ubiquitous and inclusive.

1.1 Nature of scripts of Indian languages

The scripts in Indian languages have originated from the ancient Brahmi script. The basic units ofthe writing system are referred to as Aksharas. The properties of Aksharas are as follows:

1. An Akshara is an orthographic representation of a speech sound in an Indian language

2. Aksharas are syllabic in nature

3. The typical forms of Akshara are V, CV, CCV and CCCV, thus having a generalized form ofC*V where C denotes consonant and V denotes vowel

As Indian languages are Akshara based, akshara being a subset of a syllable, a syllable based unitselection synthesis system has been built for Indian languages. Further, a syllable corresponds toa basic unit of production as opposed to that of the diphone or the phone. Earlier efforts weremade by the consortium members, in particular, IIIT Hyderabad and IIT Madras do indicate thatnatural sounding synthesisers for Indian languages can be built using the syllable as a basic unit.

1.2 Convergence and divergence

The official languages of India, except (English and Urdu) share a common phonetic base, i.e., theyshare a common set of speech sounds. This common phonetic base consists of around 50 phones,including 15 vowels and 35 consonants. While all of these languages share a common phonetic base,some of the languages such as Hindi, Marathi and Nepali also share a common script known asDevanagari. But languages such as Telugu, Kannada and Tamil have their own scripts.

The property that makes these languages unique can be attributed to the phonotactics in eachof these languages rather than the scripts and speech sounds. Phonotactics is the permissiblecombinations of phones that can co-occur in a language. This implies that the distribution ofsyllables encountered in each language is different. Another dimension in which the Indian languagessignificantly differ is prosody which includes duration, intonation and prominence associated witheach syllable in a word or a sentence.

4

2 What is Text to Speech Synthesis?

Text to Speech Synthesis System converts text input to speech output. The conversion of text intospoken form is deceptively nontrivial. A nave approach is to consider storing and concatenation ofbasic sounds (also referred to as phones) of a language to produce a speech waveform. But, naturalspeech consists of co-articulation i.e., effect of coupling two sound together, and prosody at syllable,word, sentence and discourse level, which cannot be synthesised by simple concatenation of phones.Another method often employed is to store a huge dictionary of the most common words. However,such a method may not synthesise millions of names and acronyms which are not in the dictionary.It also cannot deal with generating appropriate intonation and duration for words in differentcontext. Thus a text-to-speech approach using phones provides flexibility but cannot produceintelligible and natural speech, while a word level concatenation produces intelligible and naturalspeech but is not flexible. In order to balance between flexibility and intelligibility/naturalness,sub-word units such as diphones which capture essential coarticulation between adjacent phonesare used as suitable units in a text-to-speech system.

2.1 Components of a text-to-speech system

A typical architecture of a Text-to-Speech (TTS) system is as shown in Figure below The compo-nents of a text-to-speech system could be broadly categorized as text processing and methods ofspeech generation.

Text processing in the real world, the typical input to a text-to-speech system is text as availablein electronic documents, news papers, blogs, emails etc. The text available in real world is anythingbut a sequence of words available in standard dictionary. The text contains several non-standardwords such as numbers, abbreviations, homographs and symbols built using punctuation characterssuch as exclamation !, smileys :-) etc. The goal of text processing module is to process the inputtext, normalize the non-standard words, predict the prosodic pauses and generate the appropriatephone sequences for each of the words.

2.2 Normalization of non-standard words

The text in real world consists of words whose pronunciation is typically not found in dictionariesor lexicons such as IBM, CMU, and MSN etc. Such words are referred to as non-standard words(NSW). The various categories of NSW are:

1. Numbers whose pronunciation changes depending on whether they refer to currency, time,telephone numbers, zip code etc.

2. Abbreviations, contractions, acronyms such as ABC, US, approx., Ctrl-C, lb.,

3. Punctuations 3-4, +/-, and/or,

4. Dates, time, units and URLs.

2.3 Grapheme-to-phoneme conversion

Given the sequence of words, the next step is to generate a sequence of phones. For languagessuch as Spanish, Telugu, Kannada, where there is a good correspondence between what is writtenand what is spoken, a set of simple rules may often suffice. For languages such as English wherethe relationship between the orthography and pronunciation is complex, a standard pronunciationdictionary such as CMU-DICT is used. To handle unseen words, a grapheme-to-phoneme generatoris built using machine learning techniques.

5

2.4 Prosodic analysis

Prosodic analysis deals with modeling and generation of appropriate duration and intonation con-tours for the given text. This is inherently difficult since prosody is absent in text. For example,the sentences where are you going?; where are you GOING? and where are YOU going?, havesame text-content but can be uttered with different intonation and duration to convey differentmeanings. To predict appropriate duration and intonation, the input text needs to be analyzed.This can be performed by a variety of algorithms including simple rules, example-based techniquesand machine learning algorithms. The generated duration and intonation contour can be used tomanipulate the context-insensitive diphones in diphone based synthesis or to select an appropriateunit in unit selection voices.

2.5 Methods of speech generation

The methods of conversion of phone sequence to speech waveform could be categorized into para-metric, concatenative and statistical parametric synthesis.

6

2.5.1 Parametric synthesis

Parameters such as formants, linear prediction coefficients are extracted from the speech signal ofeach phone unit. These parameters are modified during synthesis time to incorporate co-articulationand prosody of a natural speech signal. The required modifications are specified in terms of ruleswhich are derived manually from the observations of speech data. These rules include duration,intonation, co-articulation and excitation function. Examples of the early parametric synthesissystems are Klatts formant synthesis and MITTALK.

2.5.2 Concatenative synthesis

Derivation of rules in parametric synthesis is a laborious task. Also, the quality of synthesizedspeech using traditional parametric synthesis is found to be robotic. This has led to developmentof concatenative synthesis where the examples of speech units are stored and used during synthesis.

Concatenative synthesis is based on the concatenation (or stringing together) of segments ofrecorded speech. Generally, concatenative synthesis produces the most natural-sounding synthe-sized speech. However, differences between natural variations in speech and the nature of theautomated techniques for segmenting the waveforms sometimes result in audible glitches in theoutput. There are three main sub-types of concatenative synthesis.

1. Unit selection synthesis - Unit selection synthesis uses large databases of recorded speech.During database creation, each recorded utterance is segmented into some or all of the fol-lowing: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, andsentences. Typically, the division into segments is done using a specially modified speechrecognizer set to a forced alignment mode with some manual correction afterward, usingvisual representations such as the waveform and spectrogram. An index of the units in thespeech database is then created based on the segmentation and acoustic parameters like thefundamental frequency (pitch), duration, position in the syllable, and neighboring phones. Atrun time, the desired target utterance is created by determining the best chain of candidateunits from the database (unit selection). This process is typically achieved using a speciallyweighted decision tree.

Unit selection provides the greatest naturalness, because it applies only a small amount ofdigital signal processing (DSP) to the recorded speech. DSP often makes recorded speechsound less natural, although some systems use a small amount of signal processing at thepoint of concatenation to smooth the waveform. The output from the best unit-selectionsystems is often indistinguishable from real human voices, especially in contexts for which theTTS system has been tuned. However, maximum naturalness typically require unit-selectionspeech databases to be very large, in some systems ranging into the gigabytes of recordeddata, representing dozens of hours of speech. Also, unit selection algorithms have been knownto select segments from a place that results in less than ideal synthesis (e.g. minor wordsbecome unclear) even when a better choice exists in the database. Recently, researchers haveproposed various automated methods to detect unnatural segments in unit-selection speechsynthesis systems.

2. Diphone synthesis - Diphone synthesis uses a minimal speech database containing all thediphones (sound-to-sound transitions) occurring in a language. The number of diphonesdepends on the phonotactics of the language: for example, Spanish has about 800 diphones,and German about 2500. In diphone synthesis, only one example of each diphone is containedin the speech database. At runtime, the target prosody of a sentence is superimposed on theseminimal units by means of digital signal processing techniques such as linear predictive coding,PSOLA or MBROLA. Diphone synthesis suffers from the sonic glitches of concatenative

7

synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantagesof either approach other than small size. As such, its use in commercial applications isdeclining,[citation needed] although it continues to be used in research because there are anumber of freely available software implementations.

3. Domain-specific synthesis - Domain-specific synthesis concatenates prerecorded words andphrases to create complete utterances. It is used in applications where the variety of textsthe system will output is limited to a particular domain, like transit schedule announcementsor weather reports.The technology is very simple to implement, and has been in commercialuse for a long time, in devices like talking clocks and calculators. The level of naturalnessof these systems can be very high because the variety of sentence types is limited, and theyclosely match the prosody and intonation of the original recordings. Because these systems arelimited by the words and phrases in their databases, they are not general-purpose and can onlysynthesize the combinations of words and phrases with which they have been preprogrammed.The blending of words within naturally spoken language however can still cause problemsunless the many variations are taken into account.

For example, in non-rhotic dialects of English the r in words like clear /kl/ is usually onlypronounced when the following word has a vowel as its first letter (e.g. clear out is realizedas /klt/). Likewise in French, many final consonants become no longer silent if followed by aword that begins with a vowel, an effect called liaison. This alternation cannot be reproducedby a simple word-concatenation system, which would require additional complexity to becontext-sensitive.

The speech units used in concatenative synthesis are typically at diphone level so that thenatural co-articulation is retained. Duration and intonation are derived either manually orautomatically from the data and are incorporated during synthesis time. Examples of diphonesynthesizers are Festival diphone synthesis and MBROLA. The possibility of storing more thanone example of a diphone unit, due to increase in storage and computation capabilities, has ledto development of unit selection synthesis. Multiple examples of a unit along with the relevantlinguistic and phonetic context are stored and used in the unit selection synthesis. The qualityof unit selection synthesis is found to be more natural than diphone and parametric synthesis.However, unit selection synthesis lacks the consistency i.e., in terms of variations of the quality.

2.6 Primary components of the TTS framework

1. Speech Engine - One of the most widely used speech engine is eSpeak. eSpeak uses formantsynthesis method, which allows many languages to be provided with a small footprint. Thespeech synthesized is intelligible, and provides quick responses, but lacks naturalness. Thedemand is for a high quality natural sounding TTS system. We have used festival speechsynthesis system developed at The Centre for Speech Technology Research, University ofEdinburgh, which provides a framework for building speech synthesis systems and offersfull text to speech support through a number of APIs . A large corpus based unit selectionparadigm has been employed. This paradigm is known to produce intelligible natural soundingspeech output, but has a larger foot print.

2. Screen Readers - The role of a screen reader is to identify and interpret what is being dis-played on the screen and transfer it to the speech engine for synthesis. JAWS is the mostpopular screen reader used worldwide for Microsoft Windows based systems. But the main 30drawback of this software is its high cost, approximately 1300 USD, whereas the average percapita income in India is 1045 USD. Different open source screen readers are freely available.We chose ORCA for Linux based systems and NVDA for Windows based systems. ORCA isa flexible screen reader that provides access to the graphical desktop via user-customizable

8

combinations of speech, braille and magnification. ORCA supports the Festival GNOMEspeech synthesizer and comes bundled with popular Linux distributions like Ubuntu and Fe-dora. NVDA is a free screen reader which enables vision impaired people to access computersrunning Windows. NVDA is popular among the members of the AccessIndia community.AccessIndia is a mailing list which provides an opportunity for visually impaired computerusers in India to exchange information as well as conduct discussions related to assistivetechnology and other accessibility issues . NVDA has already been integrated with Festivalspeech Engine by Olga Yakovleva.

3. Typing tool for Indian Languages - The typing tools map the qwerty keyboard to Indianlanguage characters. Widely used tools to input data in Indian languages are Smart Com-mon Input Method (SCIM)and inbuilt InScript keyboard, for Linux and Windows systemsrespectively. Same has been used for our TTS systems, as well.

2.7 Screen readers for the visually challenged

India is home to the worlds largest visually challenged (VC) population. In todays digital world,disability is equated to inability. Low attention is paid to people with disabilities and social inclusionand acceptance is always a threat/challenge. The perceived inability of people with disability, theperceived cost of special education and attitudes towards inclusive education are major constraintsfor effective delivery of education. Education is THE means of developing the capabilities of peoplewith disability, to enable them to develop their potential, become self sufficient, escape poverty andprovide a means of entry to fields previously denied to them. The aim of this project is to makea difference in the lives of VC persons. VC persons need to depend on others to access commoninformation that others take for granted, such as newspapers, bank statements, and scholastictranscripts. Assistive technologies (AT) are necessary to enable physically challenged persons tobecome part of the mainstream of society. A screen reader is an assistive technology potentiallyuseful to people who are visually challenged, visually impaired, illiterate or learning disabled, touse/access standard computer software, such as Word Processors, Spreadsheets, Email and theInternet.

Before the start of this project, Indian Institute of Technology, Madras (IIT Madras) hadbeen conducting a training programme for visually challenged people, to enable them to use thecomputer using the screen reader JAWS with English as the language. Although, the VC personshave benefited from this programme, most of them felt that:

The English accent was difficult to understand. Most students would have preferred a reader in their native language. They would prefer English spoken in Indian accent. The price for the individual purchase of JAWS was very high.Against this backdrop, it was felt imperative to build assistive technologies in the vernacular.

An initiative was taken by DIT, Ministry of Information Technology to sponsor the developmentof

1. Natural sounding Text-to-speech synthesis systems in different Indian languages

2. To ensure that the TTSes are also integrated with open source screen readers.

9

3 Overall Picture

1. Data Collection - Text crawled from a news site and a site for stories for children.

2. Cleaning up of Data - From the crawled data sentences were picked to maximize syllablecoverage.

3. Recording - The sentences that were picked were then recorded in a studio which was acompletely noise-free environment.

4. Labeling - The wavefiles were then manually labeled using the semi-automatic labeling toolto get accurate syllable boundaries.

5. Training - Using the wavefiles and their transcriptions the indian language unit selectionvoice was built

10

6. Testing - Using the voice built, a MOS test was conducted with visually challenged end usersas the evaluators.

11

4 Labeling Tool

It is a widely accepted fact that the accuracy of labeling of speech files would have a great bearingon the quality of unit selection synthesis. Process of manual labeling is time consuming anddaunting task. It is also not trivial to label waveforms manually, at the syllable level. DONLabelLabeling tool provides an automatic way of performing labeling given an input waveform and thecorresponding text in utf8 format. The tool makes use of group delay based segmentation to providethe segment boundaries. The size of the segment labels generated can vary from monosyllables topolysyllables, as the Window Scale Factor (WSF) parameter is varied from small to large values.Our labeling process make use of:

Ergodic HMM (EHMM) labeling procedure provided by Festival, The group delay based algorithm (GD) The Vowel Onset Point (VOP) detection algorithm.

The Labeling tool displays a panel, which shows the segment boundaries estimated by Group Delayalgorithm, another panel which would show the segment boundaries as estimated by the EHMMprocess and a panel for VOP, which shows how many vowel onset points are present betweeneach segments provided by group delay algorithm. This would help greatly in adjusting the labelsprovided by the group delay algorithm, if necessary, by comparing the labeling outputs of bothEHMM process and VOP algorithm. By using VOP as an additional cue, manual interventionduring the labeling process can be eliminated. It would also improve the accuracy of the labelsgenerated by the labeling tool.

The tool works for 6 different Indian languages namely

Hindi Tamil Malayalam Marathi Telugu Bengali

The tool also displays the text (utf8) in segmented format along with the speech file.

4.1 How to Install LabelingTool

1. Copy the html folder to /var/www folder. If www folder is not there in /var, create afolder named www and extract the html folder into it. So we have the labelingTool code in/var/www/html/labelingTool/

2. Install java compiler using the following commandsudo aptget install sunjava6jdkThe following error may come ==> Reading package lists... DoneBuilding dependency treeReading state information... DonePackage sunjava6jdk is not available, but is referred to by another package.This may mean that the package is missing, has been obsoleted, or is only available fromanother source

12

E: Package sun-java6-jdk has no installation candidatesudo aptget install sunjava6jreThe following error may come ==>Reading package lists... DoneBuilding dependency treeReading state information... DonePackage sunjava6jre is not available, but is referred to by another package.This may mean that the package is missing, has been obsoleted, or is only available fromanother sourceE: Package sun-java6-jre has no installation candidate

One solution is :sudo addaptrepository deb http://archive.canonical.com/ lucid partnersudo addaptrepository deb http://ftp.debian.org/debian squeeze main contrib nonfreesudo addaptrepository deb http://ppa.launchpad.net/chromiumdaily/ppa/ubuntu/ lu-cid mainsudo addaptrepository deb http://ppa.launchpad.net/flexiondotorg/java/ubuntu/ lucidmainsudo aptget update

The other solution is :For Ubuntu 10.04 LTS, the sun-java6 packages have been dropped from the Multiverse sectionof the Ubuntu archive. It is recommended that you use openjdk-6 instead.

If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sun-java6 packages from the Canonical Partner Repository. You can configure your system to usethis repository via command-line:

sudo add-apt-repository deb http://archive.canonical.com/ lucid partnersudo apt-get updatesudo apt-get install sun-java6-jre sun-java6-pluginsudo apt-get install sun-java6-jdksudo update-alternatives config java

For Ubuntu 10.10, the sun-java6 packages have been dropped from the Multiverse section ofthe Ubuntu archive. It is recommended that you use openjdk-6 instead.

If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sun-java6 packages from the Canonical Partner Repository. You can configure your system to usethis repository via command-line:

sudo add-apt-repository deb http://archive.canonical.com/ maverick partnersudo apt-get updatesudo apt-get install sun-java6-jre sun-java6-plugin

sudo apt-get install sun-java6-jdksudo update-alternatives config java

13

If above does not work (for other version of ubuntu) then you can create local repository asfollows:

cd /wget https://github.com/flexiondotorg/oab-java6/raw/0.2.1/oab-java6.sh -O oab-java6.shchmod +x oab-java6.shsudo ./oab-java6.sh

and then run:sudo apt-get install sun-java6-jdksudo apt-get install sun-java6-jre

Source :https://github.com/flexiondotorg/oab-java6/blob/a04949f242777eb040150e53f4dbcd4a3ccb7568/README.rst

3. Install php using the following commandsudo aptget install php5

4. Install apache2 using the following commandsudo aptget install apache2update the paths in the following file/etc/apache2/sitesavailable/defaultSet all path of cgibin to /var/www/html/cgi-bin.Sample default file is attached

5. Install apache2 using the following commandsudo aptget install speech-tools

6. Install tcsh using the following commandsudo aptget install tcsh

7. Enable java script in the properties of the browser usedUse Google chrome or Mozilla firefox

8. Install java plugin for browsersudo aptget install sunjava6pluginCreate a symbolic link to the Java Plugin libnpjp2.so using the following commands

sudo lns /usr/lib/jvm/java-6-sun/jre/plugin/i386/libnpjp2.so /etc/alternatives/mozillajavaplugin.so

sudo ln s /etc/alternatives/mozillajavaplugin.so /usr/lib/mozilla/plugins/libnpjp2.so

9. give full permissions to html foldersudo chmod R 777 html/

10. Add the following code to /etc/java6sun/security/java.policy

14

grant {permission java.security.AllPermission;};

11. In /var/www/html/labelingTool/jsrc/install file, make sure that correct path of javac is pro-videdas per your installation. For example : /usr/lib/jvm/java6sun-1.6.0.26/bin/javacVersion of java is 1.6.0.26 here, it might be different in your installation. Check the path andgive correct values.

12. Install the tool using the following commandGo to /var/www/html/labelingTool/jsrc and run the below commandsudo ./install

It might give the following output which is not an error.Note: LabelingTool.java uses or overrides a deprecated API.Note: Recompile with Xlint:deprecation for details.

13. Restart apache using the following commandsudo /etc/init.d/apache2 restart

14. check if java applet is enabled in the browser by using the following linkhttp://javatester.org/enabled.htmlIn that webpage, in the LIVE box, it should displayThis web browser can indeed run Java appletswait for some time for the display to come.In case it had displayed This web browser can NOT run Java applets, there is some issuewith the java applets. Please browse for how to enable java in your version of browser andfix the issue.

15. Replace the Pronunciation Rules.pl in the /var/www/html/labelingTool folder with your lan-guage specific code (The name should be same Pronunciation Rules.pl )

16. Open the browser and go to the following linkhttp://localhost/main.php

NOTE : VOP algoirthm is not used in the current version of the labelingTool. Soanything related to vop, please ignore on the below sections

4.2 Troubleshooting of LabelingTool

1. When Labelingtool is working fine the following files will be generated in labelingTool/resultsfolderboundarysegmentsspec lowvopwav sig

gd spectrum lowsegments indicatortmp.segvopsegments

15

2. when the boundaries are manually updated, (deleted, added or moved) and saved 2 more filesgets created in the results folder.

ind updsegments updated

3. When after manually updating and saving, if the vopUpdate button is clicked, another newfile gets created in the results folder

vop updated

4. If a file named vop is not getting generated in labelingTool/results folder and the labelit.phppage is getting stuck, you need to compile the vop module.Follow the below steps.

(a) cd /var/www/html/labelingTool/VopLab(b) make f MakeEse clean(c) make f MakeEse(d) cd bin

(e) cp Multiple Vopd ../../bin/

5. If the above files are not getting created, we can try running through command line as followsExecute them from/var/www/html/labelingTool/bin folder.

The command line usage of the WordsWithSilenceRemoval program is as follows

WordsWithSilenceRemoval ctrlFile waveFile sigFile spectraFile boundaryFile thresSi-lence(ms) thresVoiced(ms)

example :./WordsWithSilenceRemoval fewords2.base /home/text 0001.wav ..results/spec ..re-sults/boun 100 100

Two files named spec and boun has to be generated in the results folder.

if not created. try recompiling.cd /var/www/html/labelingTool/Segmentationmake f MakeWordsWithSilenceRemoval cleanmake f MakeWordsWithSilenceRemovalcp bin/WordsWithSilenceRemoval /var/www/html/labelingTool/bin/

The command line usage of the Multiple Vopd program is as followsMultiple Vopd ctrlFile waveFile(in sphere format) segments file vopfile

example :./Multiple Vopd fe-ctrl.ese ../results/wav ..results/segments ../results/vop

The file wav in results folder is already the sphere format of your input wavefile.

16

On running Multiple Vopd binary, a file vop has to be generated in the results folder.

6. If the file wav is not produced in results folder, speech tools are not installedHow to check if speech tools are installed :

Once installing speech tools check if the followingch wave info This command should give the information about that wave file.If speech tools was installed along with festival and there is no link to it in/usr/bin, please make a link to point to ch wave binary file in /usr/bin folder.

7. How to check if tcsh is installed..type command tcsh and a new prompt will come.

8. Provide full permissions to the labelingTool folder and its sub folder so that the new files canbe created and updated without any permission issues.(if required, following command can be used in the labelingTool folderchmod R 777 *chown R root:root * )

9. The java.policy file should be updated as specified in the installation steps, otherwise it mayresult in error Error writing Lab File

10. When the lab file is viewed in the browser, if utf8 is not displaying, enable characterencodingto utf8 for the browserTools->options->content->fonts and colors(Advanced menu)->default character encoding(utf8)Restart browser.

17

5 Labeling Tool User Manual

5.1 How To Use Labeling Tool

The front page of the tool can be taken using the URL http://localhost/main.php A screen shotof the front page is as shown below.

The front page has the following fields

The speech file in wav format should be provided. It can be browsed using the browse button The corresponding utf8 text has to be provided in the text file. It can be browsed using the

browse button the text file that is uploaded should not have any special characters.

The ehmm lab file generated by festival while building voice can be provided as input. Thisis an optional field.

The gd lab file generated by labelingtool in a previous attempt to label the same file. Thisis an optional field. If the user had once labelled a file half way and saved the lab file, it canbe provided as input here so as to label the rest of it or to correct the labels.

The threshold for voiced segment has to be provided in the text box. It varies for each wavfile. The value is in milli seconds. (e.g. 100, 200, 50..)

The threshold for unvoiced segment has to be provided in the text box. It varies for each wavfile. The value is in milli seconds. (e.g. 100, 200, 50..) If the speech file has very long silencesa high value can be provided as threshold value.

WSF (window scale factor) can be selected from the drop down list. The default value givenis 5. Depending on the output user will be required to change WSF values and find the mostappropriate value that provides the best segmentation possible for the speech file.

The corresponding language can be selected using the radio button

18

Submit the details to the tool using submit button.A screen shot of the filled up front page is given below.

Loading PageOn clicking submit button on the front page the following page will be displayed.

Validation for data entered

If the loading of all files were successful and proper values were given for the thresholds inthe front page the message Click View to see the results... will be displayed as shown above.

If the wave file was not provided in the front page the following error will come in the loadingpage Error uploading wav file. Wav file must be entered

If the text file was not provided in the front page the following error will come in the loadingpage Error uploading text file. Text file must be entered

If the threshold for voiced segments was not provided in the front page the following errorwill come in the loading page Threshold for voiced segments must be entered

If the threshold for unvoiced segments was not provided in the front page the following errorwill come in the loading page Threshold for unvoiced segments must be entered

19

If numeric value is not entered for thresholds of unvoiced or voiced segments, in the frontpage the following error will come in the loading page Numeric value must be entered forthresholds

The wav file loaded will be copied to the /var/www/html/UploadDir folder as text.wav The lab file (ehmm label file) will be copied to /var/www/html/labelingTool/lab folder as

temp.lab If error occurred while moving to the lab folder the following error will be displayedError moving lab file.

The gd lab file (group delay label file) will be copied to /var/www/html/labelingTool/resultsfolder with the name gd lab. If error occurred while moving to the lab folder the followingerror will be displayed Error moving gdlab file.

The Labelit Page

On clicking view button on the loading page the labelit page will be loaded. A screenshot ofthis page along with the markings for each panel is given below.

Note: If error message Error reading file http://localhost/labelingTool/tmp/temp.wav ap-pears, it means in place of wav file some other file(eg text file) was uploaded.

Panels on the Labelit PageIt has 6 main Panels

EHMM Panel displays the lab files generated by festival using EHMM algorithm whilebuilding voices

Slider Panel using this panel we can slide, delete or add segments/labels

20

Wave Panel displays the speech waveform in segmented format (Note: The speech wavefile is not appearing, as seen in wavesurfur. This is because of the limitations in java)

Text Panel displays the segmented text (in utf8 format) with syllable as the basic units.

GD Panel draws the group delay curve. This is the result of group delay algorithm.Wherever the peaks appear, is considered to be a segment boundary.

VOP Panel shows the number of vowel onset points found between each segments pro-vided by Group delay. Here green colour corresponds to one vowel onset point. Thatmeans the segment boundary found by group delay algorithm is correct. Red colour cor-responds to zero vowel onset point. That means the segment boundary found by groupdelay algorithm is wrong and that boundary needs to be deleted. Yellow colour corre-sponds to more than one vowel onset points. This means that, between 2 boundariesfound by group delay algorithm there will be one or more boundaries.

ResegmentThe WSF selected for this example is 5. A different wsf will provide a different set ofboundaries. Lesser the wsf, greater the number of boundaries and vice versa. To experimentwith different wsf values, select the WSF from the drop down list and click RESEGMENT.A screen shot for the same text (as in the above figure) with a greater wsf selected is shownbelow

The above figure shows the segmentation using wsf = 12. It gives less number of boundaries.Below figure shows the same waveform with a lesser wsf (wsf =3). It gives more number ofboundaries.

21

So the ideal wsf for the waveform has to be found out. Easier way is to check the textsegments are reaching approximately near the end of the waveform. (Not missing any textsegments nor having many segments without texts).

Menu BarThe menu Bar is just above the EHMM Panel, with a heading Waveform The Menu Barcontains following buttons in that order from left to right

Save button The lab file can be saved using the save button. After making any changes tothe segments (deletion, addition or dragging), if required save button has to be clicked.

Play the waveform The entire wave file will be played on pressing this button

Play the selection Select some portion of the waveform (say a segment) and play justthat part using this button. This button can be used to verify each segment.

Play from selection Play the waveform starting from the current selection to the end.Click the mouse on the waveform and a yellow line will appear to show the selection.On clicking this button, from that selected point to end of the file will be played

Play to selection Plays the waveform from the beginning to the end of the currentselection

Stop the playbackStops the playing of wave file

Zoom to fit Display the selected portion of the wave zoomed in

Zoom 1 Display the entire wave

Zoom in Zoom in on the wave

Zoom out Zoom out on the wave

Update VOP Panel After changing the segments (dragging, adding or deleting) , theVOP algorithm is recalculated on the new set of segments on clicking this button. Aftermaking the changes, the save button must be pressed before updating the VOPpanel.

22

Some screen shots are given below to demonstrate the use of menu bar.Below figure shows how to select a portion (drag using mouse on wavepanel) of the waveformand play that part. The selected portion appears shaded in yellow as shown

The below figure shows how to select a point (click using mouse on wavepanel) and play fromselection to end of file. The selected point appears as a yellow line

3cm

23

Next figure shows how to select a portion of the wave and zoom to fit.

The Next figure shows how the portion of the wave file selected in the above figure is zoomed.

2cm

24

5.2 How to do label correction using Labeling tool

Each segment given by the group delay can be listened to and decided whether the segment is cor-rect or not, whether it is matching the text segmentation, with the help of VOP and EHMM Panels.

Deletion of a SegmentAll the segments appear as red lines in the labeling tool output. A segment can be deletedby right clicking on that particular segment on the slider panel. The figure below shows theoriginal output of labelling tool for the Hindi wave file

The third and fourth segments are very close to each other and one has to be deleted. Ideallywe delete the fourth one. The VOP has given a red colour (indication to delete one) for thatsegment. User can decide whether to delete right or left of red segment after listening.On deletion (right click on slider panel on that segment head) of the fourth segment, the textsegments get shifted and fits after silence segment as shown in the below figure.

25

On listening each segments it is seen that the segment between and is wrong. It has tobe deleted. The VOP gives red colour for the segment and the corresponding peak in thegroup delay panel is below the threshold. Peaks below the threshold in group delay curveusually wont be a segment boundary. But sometimes the algorithm computes it as a bound-ary. Threshold value in GD panel is the middle line in magenta colour.

There are 2 more red columns in VOP. The last one is correct and we have to delete a segment.The second last red column in VOP is incorrect and GD gives the correct segment. Hence itneed not be deleted. is always used as a reference for GD algorithm. It can be wrong in somecases. The yellow colour on VOP usually says to add a new segment, but here the yellowcolour is appearing in the Silence region and we ignore it.

The figure below shows the corrected segments (after deletion)

26

On completion of correcting the labels, the save button have to be pressed. On clicking Savebutton a dialog box appears with the message Lab File Saved Click Next to ContinueA silence segment gets deleted on clicking the right boundary of the silence segment.

Update VOP PanelAfter saving the changes made to the labels the VOP update button has to be clicked torecalculate the VOP algorithm on the new segments. The updated output is shown in belowfigure.

27

Adding A SegmentA segment can be added by right clicking with mouse on the slider panel at the point wherea segment needs to be added. The below figure shows a case in which a segment needs to beadded.

The VOP shows three yellow columns here of which the second yellow column is true. TheGD plot shows a small peak in that segment and we can be sure that the segment has to beadded at the peak only. In the above figure it can be seen that the mouse is placed on theslider panel at the location to add the new segment. The figure below shows the correspondingcorrected wave file and after VOP updation done.

Sliding a SegmentA segment can be moved to left or right by clicking on the head of the segment boundary onthe slider panel and dragging left or right. Sliding can be used if required while correctingthe labels.

Modification of labfile If a half corrected lab file is already present (gd lab file present),upload it from ./labelingTool/labfiles directory in the gd lab file option in the main page.Irrespective of the wsf value, the earlier lab file will be loaded. But if we use resegmentationthe already present labels will be gone and it will be regenerated based on the new wsf valuepresent. After modification, when Save button is pressed same labfile is updated but beforeupdating backup copy of lab file is created.Note: If system creates a lab file with same name that already exists in labfiles directory,system creates the backup copy of that file. But backup copy is by default hidden, to view itjust press CTRL + h.

28

Logfiles Tool generates a seprate log file for each lab file(eg. text0001.log) in ./labeling-Tool/logfiles directory. Please keep cleaning this directory after certain interval.

5.3 Viewing the labelled file

Once the corrections are made and the save button is clicked the lab file is generated in/var/www/html/labelingTool/labfilesdirectory and it can be viewed by clicking on the next link. The following message comes on click-ing the next, Download the labfile: labfile Click on the link labfile. The lab file will appear onthe browser window as below

5.4 Control file

A control file is placed at the location /var/www/html/labelingTool/bin/fewords.base The param-eters in the control file are given below. These parameters can be adjusted by the user to get bettersegmentation results.

windowSize size of frame for energy computation waveType type of the waveform: 0 Sphere PCM 1 Sphere Ulaw 2 plain sample one short

integer per line 3 RAW sequence of bytes each sample 8bit 4 RAW16 two bytes/sample BigEndian 5 RAW16 two bytes/sample Little Endian 6 Microsoft RIFF standard wav format

winScaleFactor should be chosen based on syllable rate choose by trial and error gamma reduces the dynamic range of energy fftOrder and fftSize MUST be set to ZERO!! frameAdvanceSamples frameshift for energy computation medianOrder order of median smoothing for group delay function 1==> no smoothing

29

ThresEnergy, thresZero, thresSpectralFlatness are thresholds used for voiced unvoiced detec-tionWhen a parameter is set to zero, it is NOT used . Examples tested with ENERGY only

Sampling rate of the signal required for giving boundary information in seconds.

5.5 Performance results for 6 Indian Languages

Testing was conducted for a set of test sentences for all 6 Indianlanguages and the percentage ofcorrectness was calculated based on the following formulae. The calculations were done after thesegmentation was done using the tool with the best wsf and threshold values.

[1 (Noofinsertions+noofdeletions)Totalnoofsegments ] 100

Language Percentage of Correctness

Hindi 86.83%

Malayalam 78.68%

Telugu 85.40%

Marathi 80.24%

Bengali 77.84%

Tamil 77.38%

5.6 Limitations of the tool

Zooming is not enabled for VOP and EHMM panels Wave form is not displayed properly as in wavesurfur

30

6 Unit Selection Synthesis Using Festival

This chapter discusses some of the options for building waveform synthesizers using unit selectiontechniques in Festival.

By unit selection we actually mean the selection of some unit of speech which may be anythingfrom whole phrase down to diphone (or even smaller). Technically diphone selection is a simplecase of this. However typically what we mean is unlike diphone selection, in unit selection there ismore than one example of the unit and some mechanism is used to select between them at run-time.

The theory is obvious but the design of such systems and finding the appropriate selection crite-ria, weighting the costs of relative candidates is a non-trivial problem. However techniques like thisoften produce very high quality, very natural sounding synthesis. However they also can producesome very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.

6.1 Cluster unit selection

The idea is to take a database of general speech and try to cluster each phone type into groupsof acoustically similar units based on the (non-acoustic) information available at synthesis time,such as phonetic context, prosodic features (F0 and duration) and higher level features such asstressing, word position, and accents. The actual features used may easily be changed and experi-mented with as can the definition of the definition of acoustic distance between the units in a cluster.

The basic processes involved in building a waveform synthesizer for the clustering algorithm areas follows. A high level walkthrough of the scripts to run is given after these lower level details.

1. Collect the database of general speech.

2. Building the utterance Structure

3. Building coefficients for acoustic distances, typically some form of cepstrum plus F0, or somepitch synchronous analysis (e.g. LPC).

4. Build distances tables, precalculating the acoustic distance between each unit of the samephone type.

5. Dump selection features (phone context, prosodic, positional and whatever) for each unittype.

6. Build cluster trees using wagon with the features and acoustic distances dumped by theprevious two stages

7. Building the voice description itself

6.2 Choosing the right unit type

Before you start you must make a decision about what unit type you are going to use. Note thereare two dimensions here. First is size, such as phone, diphone, demi-syllable. The second type itselfwhich may be simple phone, phone plus stress, phone plus word etc. The code here and the relatedfiles basically assume unit size is phone. However because you may also include a percentage of theprevious unit in the acoustic distance measure this unit size is more effectively phone plus previousphone, thus it is somewhat diphone like. The cluster method has actual restrictions on the unitsize, it simply clusters the given acoustic units with the given feature, but the basic synthesis code

31

is currently assuming phone sized units.

The second dimension, type, is very open and we expect that controlling this will be a goodmethod to attain high quality general unit selection synthesis. The parameter clunit name featmay be used define the unit type. The simplest conceptual example is the one used in the limiteddomain synthesis. There we distinguish each phone with the word it comes from, thus a d fromthe word limited is distinct from the d in the word domain. Such distinctions can hard partitionup the space of phones into types that can be more manageable.

The decision of how to carve up that space depends largely on the intended use of the database.The more distinctions you make less you depend on the clustering acoustic distance, but the moreyou depend on your labels (and the speech) being (absolutely) correct. The mechanism to definethe unit type is through a (typically) user defined feature function. In the given setup scripts thisfeature function will be called lisp INST LANG NAME::clunit name. Thus the voice simply de-fines the function INST LANG NAME::clunit name to return the unit type for the given segment.If you wanted to make a diphone unit selection voice this function could simply be

(define (INST LANG NAME::clunit name i)(string append(item.name i) (item.feat i p.name)))

Thus the unittype would be the phone plus its previous phone. Note that the first part of aunit name is assumed to be the phone name in various parts of the code thus although you makethink it would be neater to return previousphone phone that would mess up some other parts ofthe code.In the limited domain case the word is attached to the phone. You can also consider somedemisyllable information or more to differentiate between different instances of the same phone.

The important thing to remember is that at synthesis time the same function is called to iden-tify the unittype which is used to select the appropriate cluster tree to select from. Thus you needto ensure that if you use say diphones that the your database really does not have all diphones in it.

6.3 Collecting databases for unit selection

Unlike diphone database which are carefully constructed to ensure specific coverage, one of theadvantages of unit selection is that a much more general database is desired. However, althoughvoices may be built from existing data not specifically gathered for synthesis there are still factorsabout the data that will help make better synthesis.

Like diphone databases the more cleanly and carefully the speech is recorded the better thesynthesized voice will be. As we are going to be selecting units from different parts of the databasethe more similar the recordings are, the less likely bad joins will occur. However unlike diphonesdatabase, prosodic variation is probably a good thing, as it is those variations that can make syn-thesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phonecoverage if not complete diphone coverage. Also synthesis using these techniques seems to retainaspects of the original database. If the database is broadcast news stories, the synthesis from itwill typically sound like read news stories (or more importantly will sound best when it is reading

32

news stories).

Again the notes about recording the database apply, though it will sometimes be the case thatthe database is already recorded and beyond your control, in that case you will always have some-thing legitimate to blame for poor quality synthesis.

6.4 Preliminaries

Throughout our discussion we will assume the following database layout. It is highly recommendedthat you follow this format otherwise scripts, and examples will fail. There are many ways toorganize databases and many of such choices are arbitrary, here is our arbitrary layout.

The basic database directory should contain the following directoriesbin/ Any database specific scripts for processing. Typically this first contains a copy of

standard scripts that are then customized when necessary to the particular database

wav/ The waveform files. These should be headered, one utterances per file with a standardname convention. They should have the extension .wav and the fileid consistent with all other filesthrough the database (labels, utterances, pitch marks etc).

lab/ The segmental labels. This is usually the master label files, these may contain moreinformation that the labels used by festival which will be in festival/relations/Segment/.

lar/ The EGG files (larynograph files) if collected.

pm/ Pitchmark files as generated from the lar files or from the signal directly.

festival/ Festival specific label files.

festival/relations/ The processed labeled files for building Festival utterances, held in di-rectories whose name reflects the relation they represent: Segment/, Word/, Syllable/ etc.

festival/utts/ The utterances files as generated from the festival/relations/ label files.

Other directories will be created for various processing reasons.

6.5 Building utterance structures for unit selection

In order to make access well defined you need to construct Festival utterance structures for eachof the utterances in your database. This (in its basic form) requires labels for segments, syllables,words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labeledbut in most cases thats impractical. There are ways to automatically obtain most of these labelsbut you should be aware of the inherit errors in the labeling system you use (including labelingsystems that involve human labelers). Note that when a unit selection method is to be used thatfundamentally uses segment boundaries its quality is going to be ultimately determined by thequality of the segmental labels in the databases.

For the unit selection algorithm described below the segmental labels should be using the samephoneset as used in the actual synthesis voice. However a more detailed phonetic labeling may bemore useful (e.g. marking closures in stops) mapping that information back to the phone labels

33

before actual use. Autoaligned databases typically arent accurate enough for use in unit selection.Most autoaligners are built using speech recognition technology where actual phone boundaries arenot the primary measure of success. General speech recognition systems primarily measure wordscorrect (or more usefully semantically correct) and do not require phone boundaries to be accurate.If the database is to be used for unit selection it is very important that the phone boundariesare accurate. Having said this though, we have successfully used the aligner described in the di-phone chapter above to label general utterance where we knew which phone string we were lookingfor, using such an aligner may be a useful first pass, but the result should always be checked by hand.

It has been suggested that aligning techniques and unit selection training techniques can be usedto judge the accuracy of the labels and basically exclude any segments that appear to fall outsidethe typical range for the segment type. Thus it, is believed that unit selection algorithms shouldbe able to deal with a certain amount of noise in the labeling. This is the desire for researchers inthe field, but we are some way from that and the easiest way at present to improve the quality ofunit selection algorithms at present is to ensure that segmental labeling is as accurate as possible.Once we have a better handle on selection techniques themselves it will then be possible to startexperimenting with noisy labeling.

However it should be added that this unit selection technique (and many others) support whatis termed optimal coupling where the acoustically most appropriate join point is found auto-matically at run time when two units are selected for concatenation. This technique is inherentlyrobust to at least a few tens of millisecond boundary labeling errors.

For the cluster method defined here it is best to construct more than simply segments, durationsand an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasingallow a much richer set of features to be used for clusters. See the Section called Utterance buildingin the Chapter called A Practical Speech Synthesis System for a more general discussion of how tobuild utterance structures for a database.

6.6 Making cepstrum parameter files

In order to cluster similar units in a database we build an acoustic representation of them. Thisis is also still a research issue but in the example here we will use Mel cepstrum. Interestinglywe do not generate these at fixed intervals, but at pitch marks. Thus have a parametric spectralrepresentation of each pitch period. We have found this a better method, though it does requirethat pitchmarks are reasonably identified.

Here is an example script which will generate these parameters for a database, it is included infestvox/src/unitsel/make mcep.

for i in $*dofname=basename $i .wavecho $fname MCEP$SIG2FV $SIG2FVPARAMS otype est binary $i o mcep/$fname.mcep pm pm/$fname.pmwindow type hammingdone

The above builds coefficients at fixed frames. We have also experimented with building param-eters pitch synchronously and have found a slight improvement in the usefulness of the measure

34

based on this. We do not pretend that this part is particularly neat in the system but it does work.When pitch synchronous parameters are build the clunits module will automatically put the localF0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. Thescript in festvox/src/general/make lpc can be used to generate the parameters, assuming you havealready generated pitch marks.

Note the secondary advantage of using LPC coefficients is that they are required any way forLPC resynthesis thus this allows less information about the database to be required at run time.We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should betried. Also a more general duration/number of pitch periods match algorithm is worth defining.

6.7 Building the clusters

Cluster building is mostly automatic. Of course you need the clunits modules compiled into yourversion of Festival. Version 1.3.1 or later is required, the version of clunits in 1.3.0 is buggy andincomplete and will not work. To compile in clunits, add

ALSO INCLUDE += clunits

to the end of your festival/config/config file, and recompile. To check if an installation alreadyhas support for clunits check the value of the variable *modules*.

The file festvox/src/unitsel/build clunits.scm contains the basic parameters to build a clus-ter model for a databases that has utterance structures and acoustic parameters. The functionbuild clunits will build the distance tables, dump the features and build the cluster trees. Thereare many parameters are set for the particular database (and instance of cluster building) throughthe Lisp variable clunits params. An reasonable set of defaults is given in that file, and reasonableruntime parameters will be copied into festvox/INST LANG VOX clunits.scm when a new voiceis setup.

The function build clunits runs through all the steps but in order to better explain what is goingon, we will go through each step and at that time explain which parameters affect the substep.

The first stage is to load in all the utterances in the database, sort them into segment type andname them with individual names (as TYPE NUM. This first stage is required for all other stagesso that if you are not running build clunits you still need to run this stage first. This is done bythe calls

(format t Loading utterances and sorting types\n)(set! utterances (acost:db utts load dt params))(set! unittypes (acost:find same types utterances))(acost:name units unittypes)

Though the function build clunits init will do the same thing.This uses the following parameters

name STRINGA name for this database.

35

db dir FILENAMEThis pathname of the database, typically . as in the current directory.

utts dir FILENAMEThe directory contain the utterances.

utts ext FILENAMEThe file extention for the utterance files

filesThe list of file ids in the database.

For example for the KED example these parameters are(name ked timit)(db dir /usr/awb/data/timit/ked/)(utts dir festival/utts/)(utts ext .utt)(files (kdt 001 kdt 002 kdt 003 ... ))

In the examples below the list of fileids is extracted from the given prompt file at call time. Thenext stage is to load the acoustic parameters and build the distance tables. The acoustic distancebetween each segment of the same type is calculated and saved in the distance table. Precalculatingthis saves a lot of time as the cluster will require this number many times.

This is done by the following two function calls

(format t Loading coefficients\n)(acost:utts load coeffs utterances)(format t Building distance tables\n)(acost:build disttabs unittypes clunits params)

The following parameters influence the behaviour.coeffs dir FILENAMEThe directory (from db dir) that contains the acoustic coefficients as generated by the scriptmake mcep.

coeffs ext FILENAMEThe file extention for the coefficient files

get std per unit

Takes the value t or nil. If t the parameters for the type of segment are normalized by findingthe means and standard deviations for the class are used. Thus a mean mahalanobis euclideandistance is found between units rather than simply a euclidean distance. The recommended valueis t.

ac left context FLOAT

The amount of the previous unit to be included in the the distance. 1.0 means all, 0.0 meansnone. This parameter may be used to make the acoustic distance sensitive to the previous acoustic

36

context. The recommended value is 0.8.

dur pen weight FLOAT

The penalty factor for duration mismatch between units.

F0 pen weight FLOAT

The penalty factor for F0 mismatch between units.

ac weights (FLOAT FLOAT ...) The weights for each parameter in the coefficeint files usedwhile finding the acoustic distance between segments. There must be the same number of weightsas there are parameters in the coefficient files. The first parameter is (in normal operations) F0.Its is common to give proportionally more weight to F0 that to each individual other parameter.The remaining parameters are typically MFCCs (and possibly delta MFCCs). Finding the rightparameters and weightings is one the key goals in unit selection synthesis so its not easy to giveconcrete recommendations. The following arent bad, but there may be better ones too though wesuspect that real human listening tests are probably the best way to find better values.

An example is(coeffs dir mcep/)(coeffs ext .mcep)(dur pen weight 0.1)(get stds per unit t)(ac left context 0.8)(ac weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5))

The next stage is to dump the features that will be used to index the clusters. Remember theclusters are defined with respect to the acoustic distance between each unit in the cluster, but theyare indexed by these features. These features are those which will be available at text-to-speechtime when no acoustic information is available. Thus they include things like phonetic and prosodiccontext rather than spectral information. The name features may (and probably should) be overgeneral allowing the decision tree building program wagon to decide which of theses feature actualdoes have an acoustic distinction in the units.

The function to dump the features is(format t Dumping features for clustering\n)

(acost:dump features unittypes utterances clunits params)

The parameters which affect this function arefests dir FILENAMEThe directory when the features will be saved (by segment type).

feats LISTThe list of features to be dumped. These are standard festival feature names with respect to theSegment relation.

For our KED example these values are(feats dir festival/feats/)(feats

37

(occuridp.name p.ph vc p.ph ctypep.ph vheight p.ph vlngp.ph vfront p.ph vrndp.ph cplace p.ph cvoxn.name n.ph vc n.ph ctypen.ph vheight n.ph vlngn.ph vfront n.ph vrndn.ph cplace n.ph cvoxsegment duration seg pitch p.seg pitch n.seg pitchR:SylStructure.parent.stressseg onsetcoda n.seg onsetcoda p.seg onsetcodaR:SylStructure.parent.accentedpos in sylsyl initialsyl finalR:SylStructure.parent.syl breakR:SylStructure.parent.R:Syllable.p.syl breakpp.name pp.ph vc pp.ph ctypepp.ph vheight pp.ph vlngpp.ph vfront pp.ph vrndpp.ph cplace pp.ph cvox))

Now that we have the acoustic distances and the feature descriptions of each unit the nextstage is to find a relationship between those features and the acoustic distances. This we do usingthe CART tree builder wagon. It will find out questions about which features best minimize theacoustic distance between the units in that class. wagon has many options many of which areapposite to this task though it is interesting that this learning task is interestingly closed. Thatis we are trying to classify all the units in the database, there is no test set as such. However insynthesis there will be desired units whose feature vector didnt exist in the training set.

The clusters are built by the following function

(format t Building cluster trees\n)(acost:find clusters (mapcar car unittypes) clunits params)

The parameters that affect the tree building process are

tree dir FILENAMEthe directory where the decision tree for each segment type will be saved

wagon field desc LISTA filename of a wagon field descriptor file. This is a standard field description (field name plusfield type) that is require for wagon. An example is given in festival/clunits/all.desc which shouldbe sufficient for the default feature list, though if you change the feature list (or the values thosefeatures can take you may need to change this file.

wagon progname FILENAMEThe pathname for the wagon CART building program. This is a string and may also include anyextra parameters you wish to give to wagon.

38

wagon cluster size INTThe minimum cluster size (the wagon stop value).

prune reduce INTThis number of elements in each cluster to remove in pruning. This removes the units in the clusterthat are furthest from the center. This is down within the wagon training.

cluster prune limit INT

This is a post wagon build operation on the generated trees (and perhaps a more reliably methodof pruning). This defines the maximum number of units that will be in a cluster at a tree leaf.The wagon cluster size the minimum size. This is usefully when there are some large numbers ofsome particular unit type which cannot be differentiated. Format example silence segments withoutcontext of nothing other silence. Another usage of this is to cause only the center example units tobe used. We have used this in building diphones databases from general databases but making theselection features only include phonetic context features and then restrict the number of diphoneswe take by making this number 5 or so.

unittype prune threshold INT

When making complex unit types this defines the minimal number of units of that type re-quired before building a tree. When doing cascaded unit selection synthesizers its often not worthexcluding large stages if there is say only one example of a particular demisyllable.

Note that as the distance tables can be large there is an alternative function that does both thedistance table and clustering in one, deleting the distance table immediately after use, thus youonly need enough disk space for the largest number of phones in any type. To do this

(acost:disttabs and clusters unittypes clunits params)

Removing the calls to acost:build disttabs and acost:find clusters.In our KED example these have the values

(trees dir festival/trees/)(wagon field desc festival/clunits/all.desc)(wagon progname /usr/awb/projects/speech tools/bin/wagon)(wagon cluster size 10)(prune reduce 0)

The final stage in building a cluster model is collect the generated trees into a single file anddumping the unit catalogue, i.e. the list of unit names and their files and position in them. This isdone by the lisp function

(acost:collect trees (mapcar car unittypes) clunits params)(format t Saving unit catalogue\n)(acost:save catalogue utterances clunits params)

The only parameter that affect this iscatalogue dir FILENAME

39

the directory where the catalogue will be save (the name parameter is used to name the file).

Be default this is(catalogue dir festival/clunits/)

There are a number of parameters that are specified with a cluster voice. These are related tothe run time aspects of the cluster model. These are

join weights FLOATLISTThis are a set of weights, in the same format as ac weights that are used in optimal coupling tofind the best join point between two candidate units. This is different from ac weights as it is likelydifferent values are desired, particularly increasing the F0 value (column 0).

continuity weight FLOATThe factor to multiply the join cost over the target cost. This is probably not very relevant giventhe the target cost is merely the position from the cluster center.

log scores 1If specified the joins scores are converted to logs. For databases that have a tendency to containnonoptimal joins (probably any nonlimited domain databases), this may be useful to stop failedsynthesis of longer sentences. The problem is that the sum of very large number can lead to over-flow. This helps reduce this. You could alternatively change the continuity weight to a number lessthat 1 which would also partially help. However such overflows are often a pointer to some otherproblem (poor distribution of phones in the db), so this is probably just a hack.

optimal coupling INTIf 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find thebest possible join point. This is computationally expensive (as well as having to load in lots ofcepstrum files), but does give better results. If the value is 2 this only checks the coupling distanceat the given boundary (and doesnt move it), this is often adequate in good databases (e.g. limiteddomain), and is certainly faster.

extend selections INTIf 1 then the selected cluster will be extended to include any unit from the cluster of the previoussegments candidate units that has correct phone type (and isnt already included in the currentcluster). This is experimental but has shown its worth and hence is recommended. This means thatinstead of selecting just units selection is effectively selecting the beginnings of multiple segmentunits. This option encourages far longer units.

pm coeffs dir FILENAMEThe directory (from db dir where the pitchmarks are

pm coeffs ext FILENAMEThe file extension for the pitchmark files.

sig dir FILENAMEDirectory containing waveforms of the units (or residuals if Residual LPC is being used, PCMwaveforms is PSOLA is being used)

sig ext FILENAME

40

File extension for waveforms/residuals

join method METHOD

Specify the method used for joining the selected units. Currently it supports simple, a verynaive joining mechanism, and windowed, where the ends of the units are windowed using a hammingwindow then overlapped (no prosodic modification takes place though). The other two possiblevalues for this feature are none which does nothing, and modified lpc which uses the standardUniSyn module to modify the selected units to match the targets.

clunits debug 1/2With a value of 1 some debugging information is printed during synthesis, particularly how manycandidate phones are available at each stage (and any extended ones). Also where each phone iscoming from is printed.

With a value of 2 more debugging information is given include the above plus joining costs(which are very readable by humans).

41

7 Building Festival Voice

In the context of Indian languages, syllable units are found to be a much better choice than unitslike phone, diphone, and half-phone. Unlike most other foreign languages in which the basic unitof writing system is an alphabet, Indian language scripts use syllable as the basic linguistic unit.The syllabic writing in Indic scripts is based on the phonetics of linguistic sounds and the syllabicmodel is generic to all Indian languages A syllable is typically of the following form: V, CV, VC,CCV, CCCV, and CCVC, where C is consonant and V is Vowel. A syllable could be represented asC*VC*, containing at least one vowel and zero, one or more consonants. Following steps explainshow to build a syllable based synthesis using FestVox.

1. Create a directory and enter into the directory. $mkdir iiit tel syllable $cd iiit tel syllable

2. Creat voice setup $FESTVOXDIR/src/unitsel/setup clunits iiit tel syllable $FESTVOXDIR/src/prosody/setup prosodyBefore going to run build prompts do following steps.

(a) modify your phoneset according to syllables as phonemes in the phoneset file

(b) modify phoneme lable files as syllable lables.

(c) Remove special symbols from tokenizer.

(d) Call your pronunciation directory module from festvox/iiit tel syllable lexicon.scm

(e) The last modification is change the default phonemeset to your language unique syllablesin festival/clunits/all.desc file under p.name field.

3. Generate Prompts:festival b festvox/build clunits.scm (build prompts etc/txt.done.data)

4. Record prompts./bin/prompt them etc/time.data

5. Label Automatically$FESTVOXDIR/src/ehmm/bin/do ehmm help run following steps individually: setup, ph-seq, feats, bw, align

6. Generate Pitch markersbin/make pm wave wav/*.wav

7. Correct the pitch markersbin/make pm fix pm/*.pmTuning pitch markars

(a) Convert pitch marks into label format./bin/make pmlab pm pm/*.pm

(b) After modigying pitch markers convert lable format to pitchmarkers./bin/make pm pmlab pm lab/*.lab

8. Generate Mel Cepstral coefficientsbin/make mcep wav/*.wav

9. Generate Utterance Structurefestival b festvox/build clunits.scm (build utts etc/txt.done.data)

42

10. Open festival/clunits/all.desc and add all the syllables in p.name field.Cluster the unitsfestival b festvox/build clunits.scm (build clunits etc/txt.done.data)

11. open bin/make dur model and remove stepwise./bin/do build do dur

12. Test the voice.festival festvox/iiit tel syllable clunits.scm (voice iiit tel syllable clunits)To synthesize sentence:If you are building voice on local machine:(SayText your text)If you are running voice on remote machine:(utt.save.wave (utt.synth (Utterance Text your text)) test.wav)If you want to see selected units, rum following command(set! utt (SayText your text))(clunits::units selected utt filename)(utt.save.wave utt filename wav)

43

8 Customizing festival for Indian Languages

8.1 Some of the parameters that were customized to deal with Indian languagesin festival framework are :

Cluster Size It is one of the parameters to be adjusted while building a tree. If the numberof nodes for each branch of a tree is very large, it takes more time to synthesize speech as thetime required to search for the appropriate unit is more. We therefore limit the size of thebranch of the tree by specifying the maximum number of nodes, which is denoted by clustersize. When the tree is built, the cluster size is limited by putting the clustered set of unitsthrough a larger set of questions to limit the number of units being clustered as one type.

Duration Penalty Weight (duration pen weight) While synthesizing speech, the durationof each unit being picked is also of importance as units of different durations being clusteredtogether would make very unpleasant listening. The duration pen weight parameter specifieshow much importance should be given to the duration of the unit when the synthesizer istrying to pick units for synthesis. A high value of duration pen weight means a unit verysimilar in duration to the required unit is picked. Else, not much importance is given toduration and importance is given to other features of the unit.

Fundamental pitch penalty weight (F0 pen weight) While listening to synthesized speechan abrupt change in pitch between units is not very pleasing to the ear. The F0 pen weightparameter specifies how much importance is given to F0 while selecting a unit for synthesis.The F0 is calculated by calculating the F0 at the center of the unit which would be approx-imately where the vowel lies, which plays a major role in the F0 contour of the unit. Wetherefore try to select units which have similar values of F0 to avoid fluctuations in the F0contour of the synthesized speech.

ac left context In speech, the way a particular unit is spoken depends a lot on the preceedingand succeeding unit i.e. the context in which a particular unit is spoken. Usually a unit ispicked based on what the succeeding unit is. This ac left context specifies the importancegiven to picking a unit based on what the preceeding unit was.

Phrase Markers It is very hard to make sense out of something that is said without apause. It is therefore important to have pauses at the end of phrases to make what is spoken,intelligible. Hindi has certain units called phrase markers which usually mark the end of aphrase. For the purpose of inserting silences at the end of phrases, these phrase markers wereidentified and a silence was inserted each time one of these was encountered.

Morpheme tags There are no phrase markers in tamil, but there are units called morphemetags which are found at the end of words which can be used to predict silences. The voicewas built using these tags to predict phrase end silences while synthesizing speech.

Handling silences Since there are a large number of silences in the database, the chances ofa silence of a wrong duration in the wrong place is a common problem that is faced. Thereis a chance that a long silence is inserted at the end of a phrase or an extremely short silenceis inserted at the end of a phrase which sounds very inappropriate. The silence units weretherefore quantified into 2 types, i.e. SSIL, the silence at the end of a phrase and LSIL, thesilence at the end of a sentence. The silence at the end of a phrase will be of a short durationwhile the silence at the end of a sentence will be of a long duration.

Inserting commas Just picking phrase markers was not sufficient to make the speech prosod-ically rich. Commas were inserted in the text wherever a pause might have been there and the

44

tree was built using these commas so that the location of these commas could be predictedas pauses while synthesizing speech.

Duration Modeling Was done so as to include the duration of the unit to be used as a featurewhile building the tree and also as a feature to narrow down the the size of the number ofunits selected while picking units for synthesis.

Prosody Modeling This was achieved by phrase markers and by inserting commas in thetext. Prosody modeling was done to make the synthesized speech more expressive so that itwill more usable for the visually challenged persons.

Geminates In Indian languages it is very important to preserve the intra-word pause whilespeaking, as the word spoken without the intra-word pause would have a completely differentmeaning. These intraword pauses are called geminates, and care has been taken to preservethese intraword pauses during synthesis.

8.2 Modifications in source code

1. Add the below 3 lines in the txt.done.data and also add the corresponding wav and lab filesin the respective folder( text 0998 LSIL )( text 0999 SSIL )(text 0000 mono)

2. Inside the bin folder, Do the following Modification in make pm wave filePM ARGS=min 0.0057 max 0.012 def 0.01 wave end lx lf 140 lx lo 111 lx hf 80lx ho 51 med o 0

COMMENT THE ABOVE LINE AND ADD THE FOLLOWING LINE IN THE FILE

PM ARGS=min 0.003 max 0.7 def 0.01 wave end lx lf 340 lx lo 91 lx hf 140lx ho 51 med o 0

3. Open /festvox/build clunits.scm file

=>GoTo Line No:69 (i.e) (ac left context 0.8) change the value 0.8 to 0.1=>GoTo Line No:87 (i.e) (wagon cluster size 20) change the value 20 to 7=>GoTo Line No:89 (i.e) (cluster prune limit 40) change the value 40 to 10

4. Open /festvox/voicefoldername clunits.scm file =>GoTo Line No:136 (optimal coupling 1)change the value 1 to 2

5. Handling SIL For small system this issue is not need to be handled but system with largedatabase multiple occurrence of SIL creates problem. To Solve the issue do the following step

=>GoTo line No:161 the line starts with (define (VOICE FOLDER NAME::clunit name i)Replace the entire function with the following code

45

(define (VOICE FOLDER NAME::clunit name i)(VOICE FOLDER NAME::clunit name i)Defines the unit name for unit selection for tam. The can be modified. It changes the basicclassification of unit for clustering. By default we just use the phone name, but we may wantto make this present phone plus previous phone (or something else).(let ((name (item.name i)))(cond((and (not iitm tam aarthi::clunits loaded)(or (stringequal h# name)(stringequal 1 (item.feat i ignore))(and (stringequal pau name)(or (stringequal pau (item.feat i p.name))(stringequal h# (item.feat i p.name)))(stringequal pau (item.feat i n.name)))))ignore)((stringequal name SIL); (set! pau count (+ pau count 1))(stringappendname (item.feat i p.name) (item.feat i p.p.name)));; Comment out this if you want a more interesting unit name((null nil)name)

;; Comment out this if you want a more interesting unit name

;((null nil); name)

;; Comment out the above if you want to use these rules;((stringequal + (item.feat i ph vc)); (string-append; name; ; (item.feat i R:SylStructure.parent.stress); ; (iiit tel lenina::nextvoicing i))); ((string-equal name SIL); (string-append;name; ; (VOICE FOLDER NAME::nextvoicing i)));(t; (string-append; name;

46

; ; (item.feat i seg onsetcoda); ; ; (iiit tel lenina::nextvoicing i))))))

6. then go to line number 309 and add the following code(define (phrase number word)(phrase number word)phrase number of a given word in a sentence.(cond((null word) 0) ;; beginning or utterance((stringequal ; (item.feat word p.R:Token.parent.punc)) 0) ; end of a sentence((stringequal , (item.feat word p.R:Token.parent.punc)) (+ 1 (phrase snumber (item.prevword)))) ;end of a phrase(t(+ 0 (phrase number (item.prev word))))))

7. GoTo festival/clunits/ folder ===>Replace the all.desc file and copy the syllables and phonesto both p.name and n.name field

8. Generate phoneset units along with features to include in phoneset.scm file by running cre-ate phoneset languageName.pl The Phoneset.scm file contains a list of all units along withtheir phonetic features. The create phoneset.pl script first takes every syllable and breaks itdown into smaller units and dumps their phonetic features into the Phoneset.scm file. Forevery syllable the create phoneset.pl script checks first if the vowel present in the syllable isa short vowel or a long vowel. Depending on whether it is a short vowel or a long vowel aparticular value is assigned to that field. After that the starting and ending consonants ofthe syllable are checked and and depending on the place of articulation of the consonants aparticular value is assigned to that field. Depending on the type of vowel and the type ofbeginning and end consonants we can now assign a value to the type of vowel field as well.The fields for manner of articulation are kept as zero.

9. In VoiceFolderName Phoneset.scm fileUncomment the following line during TRAINING(PhoneSet.silences (SIL))Uncomment the following line during TESTING(PhoneSet.silences (SSIL))

10. In the VoiceFolderName phoneset.scm file, we have to change the phoneset definitions, Re-place the defPhoneSet function with the following code.

(;; vowel or consonant(vlng 1 0);; full vowel(fv 1 0);; syllable type v vc/vcc cv/ccv cvc/cvcc(syll type 1 2 3 4 0);; place of articulation of c1(poa c1 1 2 3 4 5 6 7 0);; manner of articulation of c1(moa c1 + 0);; place of articulation of c2 labial alveolar palatal labiodental dental velar

47

(poa c2 1 2 3 4 5 6 7 0);; manner of articulation of c2(moa c2 + 0))

11. When running clunits i.e., the final step, remove (text 0000 mono) & (text 0000-2 phone)from txt.done.data (if exists)

12. Go to VoiceFolderName lexicon.scm file(Calling parser in lexicon file) Goto line number 137and add the following code in Hand

Festival training

Documents

Transcript of Festival training