6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants....

78
6. TexttoSpeech Synthesis (Most Of these slides come from Dan Jura>y’s course at Stanford)

Transcript of 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants....

Page 1: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

6.  Text-­‐to-­‐Speech  Synthesis  

(Most  Of  these  slides  come  from  Dan  Jura>y’s  course  at  Stanford)  

Page 2: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Tractament  Digital  de  la  Parla   2  

History  of  Speech  Synthesis  •  In 1779, the Danish scientist Christian Kratzenstein builds models of

the human vocal tract that can produce the five long vowel sounds. •  In 1791 Wolfgang von Kempelen (the creator of the Turk chess playing

game) devises the bellows(fuelle, mancha)-operated “automatic-mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants.

•  In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design.

•  In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair.

Page 3: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Von Kempelen: •  Small whistles

controlled consonants •  Rubber mouth and

nose; nose had to be covered with two fingers for non-nasals

•  Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air

From Traunmüller’s web site

Page 4: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Von  Kempelen’s  speaking  machine  

Page 5: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Bell  labs  VOCODER  machine  

Page 6: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Homer Dudley 1939 VODER •  Synthesizing speech by electrical means •  1939 World’s Fair

Page 7: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Homer Dudley’s VODER

• Manually controlled through complex keyboard • Operator training was a problem  

Page 8: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

One  of  the  first  “talking”  computers  

Page 9: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Closer to a natural vocal tract: Riesz 1937

Page 10: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

The UK Speaking Clock

•  July 24, 1936 •  Photographic storage on 4 glass disks •  2 disks for minutes, 1 for hour, one for

seconds. •  Other words in sentence distributed across

4 disks, so all 4 used at once. •  Voice of “Miss J. Cain”

Page 11: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

The 1936 UK Speaking Clock

From  hQp://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm  

Page 12: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

The  UK  speaking  clock  

Page 13: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Gunnar Fant’s OVE synthesizer •  Of the Royal Institute of

Technology, Stockholm •  Formant

Synthesizer for vowels

•  F1 and F2 could be controlled

From Traunmüller’s web site

Page 14: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Cooper’s Pattern Playback

•  Haskins Labs for investigating speech perception

•  Works like an inverse of a spectrograph •  Light from a lamp goes through a rotating

disk then through spectrogram into photovoltaic cells

•  Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band

Page 15: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Cooper’s Pattern Playback

Page 16: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Modern TTS systems •  1960’s first full TTS: Umeda et al (1968) •  1970’s

–  Joe Olive 1977 concatenation of linear-prediction diphones –  Texas Instruments Speak and Spell,

•  June 1978 •  Paul Breedlove

•  1980’s –  1979 MIT MITalk (Allen, Hunnicut, Klatt)

•  1990’s-present –  Diphone synthesis –  Unit selection synthesis –  HMM synthesis

Page 17: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Speak  and  spell  demo  

Page 18: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

TTS Demos (Unit-Selection) •  Cereproc

– Catalan – Spanish

•  Festival (open source) – English

•  ATT – English – South American

Page 19: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Types of Waveform Synthesis

•  Articulatory Synthesis: –  Model movements of articulators and acoustics of vocal

tract •  Formant Synthesis:

–  Start with acoustics, create rules/filters to create each formant

•  Concatenative Synthesis: –  Use databases of stored speech to assemble new

utterances. •  Diphone •  Unit Selection

•  Statistical (HMM) Synthesis –  Trains parameters on databases of speech

Text  modified  from  Richard  Sproat  slides  

Page 20: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

1st gen. synthesis: Formant Synthesis

•  Were the most common commercial systems when computers were slow and had little memory.

•  1979 MIT MITalk (Allen, Hunnicut, Klatt) •  1983 DECtalk system

– “Perfect Paul” (The voice of Stephen Hawking)

– “Beautiful Betty”

Page 21: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

2nd Generation Synthesis

•  Diphone Synthesis – Units are diphones; middle of one phone to

middle of next. – Why? Middle of phone is steady state. – Record 1 speaker saying each diphone – ~1400 recordings – Paste them together and modify prosody.

Page 22: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

3rd GenerationSynthesis •  All current commercial systems.

•  Unit Selection Synthesis –  Larger units of variable length –  Record one speaker speaking 10 hours or more,

•  Have multiple copies of each unit

–  Use search to find best sequence of units

•  Hidden Markov Model Synthesis –  Train a statistical model on large amounts of data.

Page 23: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

General  TTS  architecture  

Raw  text  in  

Text  analysis:  •  Text  NormalizaYon  •  Part-­‐of-­‐speech  tagging  •  Homograph  disambiguaYon  

PhoneYc  analysis  •  DicYonary  lookup  •  Grapheme-­‐to-­‐phoneme(LTS)  

Prosodic  analysis  •  Boundary  placement  •  Pitch  accent  assignment  •  DuraYon  computaYon  

Waveform  synthesis  

Speech  out  

Page 24: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

1. Text Normalization •  Analysis of raw text into pronounceable words •  Sample problems:

–  He  stole  $100  million  from  the  bank  –  It's  13  St.  Andrews  St.  –  The  home  page  is  h?p://www.stanford.edu    –  yes,  see  you  the  following  tues,  that's  11/12/01

•  Steps: –  Identify tokens in text –  Chunk tokens into reasonably sized sections –  Map tokens to words –  Identify types for words

Page 25: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Identify tokens and chunk them

•  Whitespace can be viewed as separators •  Punctuation can be separated from the

raw tokens •  Festival converts text into

– ordered list of tokens – each with features:

•  its own preceding whitespace •  its own succeeding punctuation

Page 26: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

End-of-utterance detection

•  Relatively simple if utterance ends in ?! •  But what about ambiguity of “.” •  Ambiguous between end-of-utterance and

end-of-abbreviation – My place on Winfield St. is around the corner. – I live at 151 Winfield St. – (Not “I live at 151 Winfield St..”)

Page 27: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Identify Types of Tokens, and Convert Tokens to Words

•  Pronunciation of numbers often depends on type. 3 ways to pronounce 1776: – 1776 date: seventeen seventy six. – 1776 phone number: one seven seven six – 1776 quantifier: one thousand seven hundred

(and) seventy six – Also:

•  25 day: twenty-fifth

Page 28: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Step 2: Classify token into 1 of 20 types

•  EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) •  LSEQ: letter sequence (CIA, D.C., CDs) •  ASWD: read as word, e.g. CAT, proper names •  MSPL: misspelling •  NUM: number (cardinal) (12,45,1/2, 0.6) •  NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II •  NTEL: telephone (or part) e.g. 212-555-4523 •  NDIG: number as digits e.g. Room 101 •  NIDE: identifier, e.g. 747, 386, I5, PC110 •  NADDR: number as stresst address, e.g. 5000 Pennsylvania •  NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc •  SLNT: not spoken (KENT*REALTY)

Page 29: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

POS Tagging: Definition

•  The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

the koala put the keys on the table

WORDS TAGS

N V P DET

Page 30: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Part of speech tagging

•  8 (ish) traditional parts of speech – Noun, verb, adjective, preposition, adverb,

article, interjection, pronoun, conjunction, etc – This idea has been around for over 2000

years (Dionysius Thrax of Alexandria, c. 100 B.C.)

– Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

– We’ll use POS most frequently

Page 31: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

POS examples

•  N noun chair, bandwidth, pacing •  V verb study, debate, munch •  ADJ adj purple, tall, ridiculous •  ADV adverb unfortunately, slowly, •  P preposition of, by, to •  PRO pronoun I, me, mine •  DET determiner the, a, that, those

Page 32: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

POS Tagging example WORD tag

the DET koala N put V the DET keys N on P the DET table N

Page 33: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

POS tagging: Choosing a tagset

•  There are so many parts of speech, potential distinctions we can draw

•  To do POS tagging, need to choose a standard set of tags to work with

•  Could pick very coarse tagets – N, V, Adj, Adv.

•  More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags –  PRP$, WRB, WP$, VBG

•  Even more fine-grained tagsets exist

Page 34: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Penn TreeBank POS Tag set

Page 35: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

POS importance

•  Part of speech tagging plays important role in TTS

•  Most algorithms get 96-97% tag accuracy •  Not a lot of studies on whether remaining

error tends to cause problems in TTS

1/5/07

Page 36: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

PhoneYc  analysis  

•  Deals  with  the  pronunciaYon  of  each  word  •  Finds,  for  each  word,  the  way  to  pronounce  it  •  Straigh^orward  method  would  be  a  dicYonary  lookup,  but:  – Unknown  words  and  proper  names  will  be  missing  – Words  in  other  languages  

Page 37: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Phoneme  sets  

Page 38: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

ConverYng  from  words  to  phones  •  Two  methods:  

– DicYonary-­‐based  – Rule-­‐based  (through  leQer-­‐to-­‐sound  -­‐LTS-­‐  rules)  

•  All  early  systems  used  LTS.  The  first  to  use  a  dicYonary  was  MITTalk  (10K  words)  

•  A  good  dicYonary  example  is  CMU  dicYonary  with  127K  words  

Page 39: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

When dictionaries aren’t sufficient

•  Unknown words –  Seem to be linear with number of words in unseen text –  Mostly person, company, product names –  But also foreign words, etc. –  From a Black et al analysis

•  Of 39K tokens in part of the Wall Street Journal •  1775 (4.6%) were not in the OALD dictionary:

•  So commercial systems have 3-part system: –  Big dictionary –  Special code for handling names –  Machine learned LTS system for other unknown words

1/5/07

Page 40: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Learning  LTS  rules  •  Rules  for  Spanish  are  straigh^orward  

– Grapheme(wriQen  form)  is  very  similar  to  the  phoneme  sequence  

•  In  the  past  most  big  systems  employed  many  linguists  (of  PhD  students)  to  manually  write  LTS  rules  

•  Now  they  are  mostly  induced  automaYcally  from  a  dicYonary  of  the  language  – An  important  step  is  the  phoneme-­‐grapheme  alignment  

Page 41: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Homograph  disambiguaYon  •  An  homograph  is  a  word  (or  group  of  words)  that  share  the  same  wriQen  form  but  have  different  meanings.  

•  In  pronunciaYon,  they  might  have  the  same  pronunciaYon  (homonyms)  or  different  (heteronyms).  

•  It  is  very  important  to  detect  and  disambiguate  them  for  correct  pronunciaYon  

•  Examples:  –  English:  wood(madera)/wood(bosque),  bear(aguantar)/bear(barba),  read(presente  leer)/read(pasado  leer)  

–  Spanish:  all  words  are  homonyms  

Page 42: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Next  steps  afer  text  analysis  

•  Afer  Text  Analysis  we  now  have  a  string  of  Phones.  We  also  have  some  semanYc  informaYon  to  do  Prosodic  Analysis,  so  next  steps  are:  

•  Prosody  – Desired  F0  for  enYre  uQerance  – DuraYon  for  each  phone  – Stress  value  for  each  phone,  possibly  accent  value  

•  Generate  Waveforms  

Page 43: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Prosody  Prosody  is  in  charge  of  converYng  words+phones  into  boundaries,  accent,  F0  and  duraYon  informaYon  •  Prosodic  phrasing  

–  Need  to  break  uQerances  into  phrases  –  PunctuaYon  is  useful,  not  sufficient  

•  Accents:    –  PredicYons  of  accents:  which  syllables  should  be  accented  –  RealizaYon  of  F0  contour:  given  accents/tones,  –  generate  F0  contour  

•  DuraYon:    –  PredicYng  duraYon  of  each  phone  

Page 44: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Graphic representation of F0

legumes are a good source of VITAMINS 50

100

150

200

250

300

350

400

time

F0 (i

n H

ertz

)

Slide from Jennifer Venditti

Page 45: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

AlternaYve  accents  

•  We  might  want  to  give  prominence  to  different  parts  of  a  sentence  depending  on  the  context.  

•  For  example,  imagine  answering  a  quesYon  on  the  previous  example  

Page 46: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Q1: What types of foods are a good source of vitamins?

50

100

150

200

250

300

350

400

LEGUMES are a good source of vitamins

Slide from Jennifer Venditti

Page 47: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Q2: Are legumes a source of vitamins?

50

100

150

200

250

300

350

400

Legumes are a GOOD source of vitamins

Slide from Jennifer Venditti

Page 48: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Q3: I’ve heard that legumes are healthy, but what are they a good source of ?

legumes are a good source of VITAMINS 50

100

150

200

250

300

350

400

Slide from Jennifer Venditti

Page 49: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Duration •  Simplest: fixed size for all phones (100 ms) •  Next simplest: average duration for that phone (from training data).

Samples from SWBD in ms: –  aa 118 b 68 –  ax 59 d 68 –  ay 138 dh 44 –  eh 87 f 90 –  ih 77 g 66

•  Next Next Simplest: add in phrase-final and initial lengthening plus stress

•  Lots of fancy models of duration prediction: –  Using Z-scores and other clever normalizations –  Sum-of-products model –  New features like word predictability

•  Words with higher bigram probability are shorter

Page 50: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Waveform synthesis generation

•  Given: – String of phones – Prosody

•  Desired F0 for entire utterance •  Duration for each phone •  Stress value for each phone, possibly accent value

•  Generate: – Waveforms

Page 51: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

The hourglass architecture

Page 52: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Internal Representation: Input to Waveform Synthesis

Page 53: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Waveform  synthesis  

•  concatenaYve  synthesis  systems  – Diphone  synthesis:  join  diphones  to  form  the  waveform.  

– Unit  selecYon  synthesis:  join  the  longest  unit  possible.  

•  HMM-­‐based  sytnthesis  

Page 54: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Diphones  •  mid-­‐phone  is  more  stable  than  edge  •  Need  O(phone2)  number  of  units  

–  Some  combinaYons  don’t  exist  (hopefully)  – ATT  (Olive  et  al.  1998)  system  had  43  phones  

•  1849  possible  diphones  •  PhonotacYcs  ([h]  only  occurs  before  vowels),  don’t  need  to  keep  diphones  across  silence    

•  Only  1172  actual  diphones  – May  include  stress,  consonant  clusters  

•  So  could  have  more  –  Lots  of  phoneYc  knowledge  in  design  

•  Database  relaYvely  small  (by  today’s  standards)  – Around  8  megabytes  for  English  (16  KHz  16  bit)  

Slide  from  Richard  Sproat  

Page 55: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Diphones •  Mid-phone is more stable than edge:

Page 56: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Diphone synthesis •  Training:

–  Choose units (kinds of diphones) –  Record 1 speaker saying 1 example of each diphone –  Mark the boundaries of each diphones,

•  cut each diphone out and create a diphone database

•  Synthesizing an utterance, –  grab relevant sequence of diphones from database –  Concatenate the diphones, doing slight signal

processing at boundaries –  use signal processing to change the prosody (F0,

energy, duration) of selected sequence of diphones

Page 57: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Diphone  labelling  •  Recorded  sentences  need  to  be  labelled  with  phone  start-­‐end  and  middle(stable  parts).  This  can  be  done  automaYcally  with:  – ASR  forced  alignment  – Audio-­‐to-­‐syntheYc  audio  auto-­‐alignment  

Page 58: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Diphone auto-alignment

•  Given – synthesized prompts – Human speech of same prompts

•  Do a dynamic time warping alignment of the two – Using Euclidean distance

•  Works very well 95%+ – Errors are typically large (easy to fix) – Maybe even automatically detected

Slide  from  Richard  Sproat  

Page 59: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Dynamic Time Warping

Slide  from  Richard  Sproat  

Page 60: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Pupng  it  all  together  

Once  we  have  the  diphones  we  want  to  join,  we  need  to:  •  Join  them  together  to  form  the  words  and  sentences  

•  Modify  the  speaking  speed  to  reflect  the  target  duraYon  

•  Modify  the  pitch  to  reflect  the  target  intonaYon  

Page 61: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Joining  diphones  Given  any  group  of  diphones  to  join  together,  we  can  follow  a  set  of  techniques:  •  Dumb:    

–  just  join  -­‐>  we  will  get  many  arYfacts  –  at  zero  crossings  

•  Algorithms  that  allow  joining  and  modifying  the  duraYon  at  the  same  Yme  –  SOLA  –  TD-­‐PSOLA(Time-­‐domain  pitch-­‐synchronous  overlap-­‐and-­‐add)  

– Necessary  prerequisite  in  voiced  signals:  epoch  labelling  

Page 62: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Epoch-labeling

•  An example of epoch-labeling useing “SHOW PULSES” in Praat:

It  can  be  easily  done  using  an  EGG,  or  (less  accurately)  via  signal  processing)  

Page 63: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Epoch-labeling: Electroglottograph (EGG)

•  Also called laryngograph or Lx –  Device that straps on

speaker’s neck near the larynx

–  Sends small high frequency current through adam’s apple

–  Human tissue conducts well; air not as well

–  Transducer detects how open the glottis is (I.e. amount of air between folds) by measuring impedence.

Picture  from  UCLA  PhoneYcs  Lab  

Page 64: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

DuraYon/pitch  modificaYon  

If  we  playback  a  sound  faster  than  usual  (for  example  by  resampling  the  signal)  both  pitch  and  length  gets  affected.  To  only  alter  one  of  them:  •  Time  stretching:  algorithms  used  to  change  the  length/speed  of  a  signal  without  altering  its  pitch  

•  Pitch  scaling/shifing:  algorithms  used  to  change  the  pitch  without  affecYng  the  length/speed.  

Page 65: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Speech as Short Term signals

Alan  Black  

Page 66: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Duration modification •  Duplicate/remove short term signals

Page 67: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Pitch Modification •  Move short-term signals closer together/further apart

Slide  from  Richard  Sproat  

Page 68: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Windowing •  To avoid artifacts at the edges we first apply a

windowing to short-term signals •  y[n] = w[n]s[n]

Page 69: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Windowing

•  Multiply value of signal at sample number n by the value of a windowing function

•  y[n] = w[n]s[n]

Page 70: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Pitch/duraYon  modificaYon  algorithms  •  OLA  (Overlap-­‐add  synthesis):  Applies  a  certain  overlap  to  each  short  

term  signal  (if  we  desire  a  pitch  modificaYon)  and  add  them.  If  we  want  to  lengthen  the  signal  we  just  repeat  some  of  the  periods.  It  can  create  glitches  when  the  short  term  boundaries  are  not  well  defined.  

! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"

"("$("$" '!"#%#&#$"

)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!

! 'A!'() ! "!B!$!!

C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!

!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%

)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!

)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!

!

L,-./+%!14!%.++9*!

OPI!3,7,@,%+!%2%7+-!

!

ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%

Q1--,&3!,&,/2%0%=!

L+,59*0&:!415!%+:-+&7%!

L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!

N91D%709,/!1D7.D7!

I%+5!91--,&3!

-%

-

--

--

--

--

---%

-----%

!!

*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3

! !

! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"

"("$("$" '!"#%#&#$"

)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!

! 'A!'() ! "!B!$!!

C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!

!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%

)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!

)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!

!

L,-./+%!14!%.++9*!

OPI!3,7,@,%+!%2%7+-!

!

ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%

Q1--,&3!,&,/2%0%=!

L+,59*0&:!415!%+:-+&7%!

L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!

N91D%709,/!1D7.D7!

I%+5!91--,&3!

-%

-

--

--

--

--

---%

-----%

!!

*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3

! !

Original  period  Desired  period  

Page 71: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Pitch/duraYon  modificaYon  algorithms  •  PSOLA  (Pitch  synchronous  overlap-­‐add  synthesis):  Through  a  good  

pitch/F0  detecYon  it  defines  the  boundaries  of  the  short  term  signals  to  be  amplitude  peaks  containing  2-­‐3  pitch  periods.  Then  it  uses  OLA  to  join  them.  

•  TD-­‐PSOLA  (Time-­‐domain  PSOLA):  Patented  by  France  Telecom,  is  an  efficient  Yme-­‐domain  version  of  PSOLA  (no  FFT  required,  suitable  for  real-­‐Yme  TTS).  Can  modify  pitch  up  to  2X  or  0.5X  

•  MBR-­‐PSOLA  (MulY-­‐band  re-­‐synthesis  psola):  tries  to  solve  problems  of  TD-­‐PSOLA  in  boundaries  between  diphones  due  to  phase,  pitch  or  spectral  envelope  mismatches  y  normalizing  all  the  database  a  priory.  It  later  turned  into  the  open  project  called  MBROLA,  which  focuses  on  minimizing  the  database  storage.  

•  LP-­‐PSOLA  (Linear  predictors  PSOLA):  modificaYon  of  PSOLA  where  the  LP  coefficients  are  stored  instead  of  the  signal  itself.  

Page 72: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

! ! ! !!

!"#$%&'! !"#$%&&'(#)*+,!-)#.*/01%/#12'3-!45#6787("#$*+,!-)#.*/0#12'3-!4#931&#:7;<=7#:7><=7#:7?<#=#

(! )*+),-./*+.%0+1%23423)5/63.%

"#$%&%'(! )*%! +,%%-*(! +./)*%+01%2! 3.! 45! 6789:! ;<=#'0)*>! 0+! ?@0)%! @/2%'+);/2;3<%(!*@>;/! %;'! ,%'-%0&%+! 0)! ;+! +./)*%)0-A! B@')*%'>#'%(! ;!='%;)! %<#/=;)0#/! #C! +./)*%+01%2! +,%%-*!+0=/;<! 3.! ;220/=! ,%'0#2+(! -;@+%+! )*%! %-*#! %CC%-)A!D#/)';-)0#/! #C! ! )*%! +0=/;<! 2#%+! /#)! -;@+%!@/C;&#@';3<%! ;@20)0&%! ,%'-%,)0#/+A! "#$%&%'(! #>0))0/=!>#'%! ,%'0#2+! 0/! 20,*#/%(! -;@+%+! )*%!<#++!#C!+,%%-*!0/C#'>;)0#/!-#/)%/)A!!

4*%!45!6789:!;<=#'0)*>!$;+!0>,<%>%/)%2!0/!4D9!+-'0,)0/=!<;/=@;=%A!:+!;!'%+@<)!#C!3%<#/=0/=! )#! )*%! ='#@,! #C! 0/)%','%)%2! <;/=@;=%+(! 4D9! 0+! '%<;)0&%<.! +<#$A! E&%/! )*#@=*! )*%!,'#=';>>%!$;+!#,)0>01%2(! )*%! +./)*%+0+!#C!;!+%/)%/-%! <;+)+! +%&%';<! +%-#/2+A! F!+@,,#+%(! )*;)!)*%!@+%!#C!-#>,0<%2!<;/=@;=%!%A=A!DGG(!-#>30/%2!$0)*!*;'2$;'%!,%'C#'>;/-%!='#$)*!0/!)*%!C@)@'%(!-#@<2!+*#')%/!)*0+!)0>%!)#!*@/2'%2+!#C!!>0<0+%-#/2+A!4*%!@/2%/0;3<%!;2&;/);=%!#C!4D9!<;/=@;=%!'%>;0/+!)*%!,#++030<0).!#C!,#')0/=!4D9!+-'0,)+!;>#/=!#,%';)0/=!+.+)%>+A!!

4*%!C@/2;>%/);<!,'#3<%>!#C!)*%!EHI!4D9!-#/+#<%!0+!;!,<%/).!#C!%''#'+!0/!+#@'-%!-#2%A!H;/.!C@/-)0#/+!2#%+!/#)!$#'J!,'#,%'<.(!;/2!+#>%!#C!)*%>(!0/!+,%-0;<!-;+%+(!2#%+!/#)!$#'J!;)!;<<A!4*%'%C#'%!+#>%!#C!)*%>!*;2!)#!3%!/%$<.!0>,<%>%/)%2A!!

5%+,0)%! )*%! C;-)! )*;)! )*%! +,%%-*! -'%;)%2! 3.! -#/-;)%/;)0#/! #C! ! 20,*#/%+! 0+! +@,%'0#'! )#!-#/-;)%/;)0#/!#C! ,*#/%>%+(! )*0+! ;,,'#;-*!2#%+! /#)! %/;3<%! +./)*%+0+! #C! *0=*!?@;<0).!*@>;/!+,%%-*A!F)!0+!-;@+%2!3.!)*%!C;-)(!)*;)!/%0)*%'!;!*@=%!20,*#/%!2;);3;+%!0+!;3<%!)#!-#&%'!;!='%;)!&;'0%).!#C! !*@>;/!+,%%-*A!4*%!@+%!#C!H@<)0K;/2!EL-0);)0#/!M%+./)*%+0+!#/!2;);3;+%(!>0=*)!+<0=*)<.! 0>,'#&%! )*%!?@;<0).!#C! )*%!+,%%-*(!3@)! !;')0-@<;)%!+./)*%+01%'+! '%>;0/! )*%!&0+0#/!C#'!)*%!C@)@'%A!

43!343+)3.%NOP!5@)#0)(!4AQ!:/!F/)'#2@-)0#/!)#!4%L)R)#R+,%%-*!7./)*%+0+(!*)),QSS)-)+AC,>+A;-A3%S+./)*%+0+!NTP! D;++02.(! 7A(! ";''0/=)#/! UAQ! H@<)0R<%&%<! ://#);)0#/! 0/! )*%! EHI! 7,%%-*! 5;);3;+%!

H;/;=%>%/)!7.+)%>(!7,%%-*!D#>>@/0-;)0#/!VV(!TWWT!

NVP! 5@)#0)(!4A(!9%0-*(!"AQ!HKM!6789:!447!7./)*%+0+!K;+%2!#/!;/!HKE!M%R7./)*%+0+!#C!)*%!7%=>%/)+!5;);3;+%(!*)),QSS)-)+AC,>+A;-A3%!

NXP!7.'2;<(!:A!;/2!D#<AQ!45!6789:!Y%'+@+!";'>#/0-!Z#0+%!H#2%<!F/!50,*#/%!K;+%2!7,%%-*!7./)*%+0+(!*)),QSS$$$A10,,.A*#A;))A-#>!

! !

Page 73: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Problems with diphone synthesis

•  Signal processing leave artifacts, making the speech sound unnatural

•  Diphone synthesis only captures local effects – But there are many more global effects

(syllable structure, stress pattern, word-level effects)

Page 74: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Unit Selection Synthesis

•  Generalization of the diphone intuition – Larger units

•  From diphones to sentences

– Many many copies of each unit •  10 hours of speech instead of 1500 diphones (a

few minutes of speech)

– Little or no signal processing applied to each unit

•  Unlike diphones

Page 75: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Why Unit Selection Synthesis

•  Natural data solves problems with diphones –  Diphone databases are carefully designed but:

•  Speaker makes errors •  Speaker doesn’t speak intended dialect •  Require database design to be right

–  If it’s automatic •  Labeled with what the speaker actually said •  Coarticulation, schwas, flaps are natural

•  “There’s no data like more data” –  Lots of copies of each unit mean you can choose just

the right one for the context –  Larger units mean you can capture wider effects

Page 76: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Unit Selection Intuition •  Given a big database •  For each segment (diphone) that we want to synthesize

–  Find the unit in the database that is the best to synthesize this target segment

•  What does “best” mean? –  “Target cost”: Closest match to the target description, in terms of

•  Phonetic context •  F0, stress, phrase position

–  “Join cost”: Best join with neighboring units •  Matching formants + other spectral characteristics •  Matching energy •  Matching F0

!

C(t1n,u1

n ) = Ctarget (i=1

n

" ti,ui) + C join (i= 2

n

" ui#1,ui)

Page 77: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

Search  of  best  units  using  Viterbi  algorithm  

!

ˆ u 1n = argmin

u1 ,...,un

C(t1n,u1

n )

Page 78: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's

HMM  synthesis  •  It  is  a  totally  different  approach  to  concatenaYve  synthesis  –  It  is  much  closer  to  ASR  

•  Hidden  Markov  Models  (HMM)  are  trained  from  labeled  data  to  learn  how  each  phone  is  pronounced  in  each  condiYon  –  It  also  learns  its  prosody  

•  Then,  given  a  desired  phoneme  sequence  and  prosody  paQern,  it  outputs  the  most  probable  audio  sequence.  

Samples  ~2007   Sample  2010  NITech