6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants....

6. Text-‐to-‐Speech Synthesis

(Most Of these slides come from Dan Jura>y’s course at Stanford)

Tractament Digital de la Parla 2

History of Speech Synthesis •  In 1779, the Danish scientist Christian Kratzenstein builds models of

the human vocal tract that can produce the five long vowel sounds. •  In 1791 Wolfgang von Kempelen (the creator of the Turk chess playing

game) devises the bellows(fuelle, mancha)-operated “automatic-mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants.

•  In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design.

•  In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair.

Von Kempelen: •  Small whistles

controlled consonants •  Rubber mouth and

nose; nose had to be covered with two fingers for non-nasals

•  Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air

From Traunmüller’s web site

Von Kempelen’s speaking machine

Bell labs VOCODER machine

Homer Dudley 1939 VODER •  Synthesizing speech by electrical means •  1939 World’s Fair

Homer Dudley’s VODER

• Manually controlled through complex keyboard • Operator training was a problem

One of the first “talking” computers

Closer to a natural vocal tract: Riesz 1937

The UK Speaking Clock

•  July 24, 1936 •  Photographic storage on 4 glass disks •  2 disks for minutes, 1 for hour, one for

seconds. •  Other words in sentence distributed across

4 disks, so all 4 used at once. •  Voice of “Miss J. Cain”

The 1936 UK Speaking Clock

From hQp://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm

The UK speaking clock

Gunnar Fant’s OVE synthesizer •  Of the Royal Institute of

Technology, Stockholm •  Formant

Synthesizer for vowels

•  F1 and F2 could be controlled

From Traunmüller’s web site

Cooper’s Pattern Playback

•  Haskins Labs for investigating speech perception

•  Works like an inverse of a spectrograph •  Light from a lamp goes through a rotating

disk then through spectrogram into photovoltaic cells

•  Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band

Cooper’s Pattern Playback

Modern TTS systems •  1960’s first full TTS: Umeda et al (1968) •  1970’s

–  Joe Olive 1977 concatenation of linear-prediction diphones –  Texas Instruments Speak and Spell,

•  June 1978 •  Paul Breedlove

•  1980’s –  1979 MIT MITalk (Allen, Hunnicut, Klatt)

•  1990’s-present –  Diphone synthesis –  Unit selection synthesis –  HMM synthesis

Speak and spell demo

TTS Demos (Unit-Selection) •  Cereproc

– Catalan – Spanish

•  Festival (open source) – English

•  ATT – English – South American

Types of Waveform Synthesis

•  Articulatory Synthesis: –  Model movements of articulators and acoustics of vocal

tract •  Formant Synthesis:

–  Start with acoustics, create rules/filters to create each formant

•  Concatenative Synthesis: –  Use databases of stored speech to assemble new

utterances. •  Diphone •  Unit Selection

•  Statistical (HMM) Synthesis –  Trains parameters on databases of speech

Text modified from Richard Sproat slides

1st gen. synthesis: Formant Synthesis

•  Were the most common commercial systems when computers were slow and had little memory.

•  1979 MIT MITalk (Allen, Hunnicut, Klatt) •  1983 DECtalk system

– “Perfect Paul” (The voice of Stephen Hawking)

– “Beautiful Betty”

2nd Generation Synthesis

•  Diphone Synthesis – Units are diphones; middle of one phone to

middle of next. – Why? Middle of phone is steady state. – Record 1 speaker saying each diphone – ~1400 recordings – Paste them together and modify prosody.

3rd GenerationSynthesis •  All current commercial systems.

•  Unit Selection Synthesis –  Larger units of variable length –  Record one speaker speaking 10 hours or more,

•  Have multiple copies of each unit

–  Use search to find best sequence of units

•  Hidden Markov Model Synthesis –  Train a statistical model on large amounts of data.

General TTS architecture

Raw text in

Text analysis: •  Text NormalizaYon •  Part-‐of-‐speech tagging •  Homograph disambiguaYon

PhoneYc analysis •  DicYonary lookup •  Grapheme-‐to-‐phoneme(LTS)

Prosodic analysis •  Boundary placement •  Pitch accent assignment •  DuraYon computaYon

Waveform synthesis

Speech out

1. Text Normalization •  Analysis of raw text into pronounceable words •  Sample problems:

–  He stole $100 million from the bank –  It's 13 St. Andrews St. –  The home page is h?p://www.stanford.edu –  yes, see you the following tues, that's 11/12/01

•  Steps: –  Identify tokens in text –  Chunk tokens into reasonably sized sections –  Map tokens to words –  Identify types for words

Identify tokens and chunk them

•  Whitespace can be viewed as separators •  Punctuation can be separated from the

raw tokens •  Festival converts text into

– ordered list of tokens – each with features:

•  its own preceding whitespace •  its own succeeding punctuation

End-of-utterance detection

•  Relatively simple if utterance ends in ?! •  But what about ambiguity of “.” •  Ambiguous between end-of-utterance and

end-of-abbreviation – My place on Winfield St. is around the corner. – I live at 151 Winfield St. – (Not “I live at 151 Winfield St..”)

Identify Types of Tokens, and Convert Tokens to Words

•  Pronunciation of numbers often depends on type. 3 ways to pronounce 1776: – 1776 date: seventeen seventy six. – 1776 phone number: one seven seven six – 1776 quantifier: one thousand seven hundred

(and) seventy six – Also:

•  25 day: twenty-fifth

Step 2: Classify token into 1 of 20 types

•  EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) •  LSEQ: letter sequence (CIA, D.C., CDs) •  ASWD: read as word, e.g. CAT, proper names •  MSPL: misspelling •  NUM: number (cardinal) (12,45,1/2, 0.6) •  NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II •  NTEL: telephone (or part) e.g. 212-555-4523 •  NDIG: number as digits e.g. Room 101 •  NIDE: identifier, e.g. 747, 386, I5, PC110 •  NADDR: number as stresst address, e.g. 5000 Pennsylvania •  NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc •  SLNT: not spoken (KENT*REALTY)

POS Tagging: Definition

•  The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

the koala put the keys on the table

WORDS TAGS

N V P DET

Part of speech tagging

•  8 (ish) traditional parts of speech – Noun, verb, adjective, preposition, adverb,

article, interjection, pronoun, conjunction, etc – This idea has been around for over 2000

years (Dionysius Thrax of Alexandria, c. 100 B.C.)

– Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

– We’ll use POS most frequently

POS examples

•  N noun chair, bandwidth, pacing •  V verb study, debate, munch •  ADJ adj purple, tall, ridiculous •  ADV adverb unfortunately, slowly, •  P preposition of, by, to •  PRO pronoun I, me, mine •  DET determiner the, a, that, those

POS Tagging example WORD tag

the DET koala N put V the DET keys N on P the DET table N

POS tagging: Choosing a tagset

•  There are so many parts of speech, potential distinctions we can draw

•  To do POS tagging, need to choose a standard set of tags to work with

•  Could pick very coarse tagets – N, V, Adj, Adv.

•  More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags –  PRP$, WRB, WP$, VBG

•  Even more fine-grained tagsets exist

Penn TreeBank POS Tag set

POS importance

•  Part of speech tagging plays important role in TTS

•  Most algorithms get 96-97% tag accuracy •  Not a lot of studies on whether remaining

error tends to cause problems in TTS

1/5/07

PhoneYc analysis

•  Deals with the pronunciaYon of each word •  Finds, for each word, the way to pronounce it •  Straigh^orward method would be a dicYonary lookup, but: – Unknown words and proper names will be missing – Words in other languages

Phoneme sets

ConverYng from words to phones •  Two methods:

– DicYonary-‐based – Rule-‐based (through leQer-‐to-‐sound -‐LTS-‐ rules)

•  All early systems used LTS. The first to use a dicYonary was MITTalk (10K words)

•  A good dicYonary example is CMU dicYonary with 127K words

When dictionaries aren’t sufficient

•  Unknown words –  Seem to be linear with number of words in unseen text –  Mostly person, company, product names –  But also foreign words, etc. –  From a Black et al analysis

•  Of 39K tokens in part of the Wall Street Journal •  1775 (4.6%) were not in the OALD dictionary:

•  So commercial systems have 3-part system: –  Big dictionary –  Special code for handling names –  Machine learned LTS system for other unknown words

1/5/07

Learning LTS rules •  Rules for Spanish are straigh^orward

– Grapheme(wriQen form) is very similar to the phoneme sequence

•  In the past most big systems employed many linguists (of PhD students) to manually write LTS rules

•  Now they are mostly induced automaYcally from a dicYonary of the language – An important step is the phoneme-‐grapheme alignment

Homograph disambiguaYon •  An homograph is a word (or group of words) that share the same wriQen form but have different meanings.

•  In pronunciaYon, they might have the same pronunciaYon (homonyms) or different (heteronyms).

•  It is very important to detect and disambiguate them for correct pronunciaYon

•  Examples: –  English: wood(madera)/wood(bosque), bear(aguantar)/bear(barba), read(presente leer)/read(pasado leer)

–  Spanish: all words are homonyms

Next steps afer text analysis

•  Afer Text Analysis we now have a string of Phones. We also have some semanYc informaYon to do Prosodic Analysis, so next steps are:

•  Prosody – Desired F0 for enYre uQerance – DuraYon for each phone – Stress value for each phone, possibly accent value

•  Generate Waveforms

Prosody Prosody is in charge of converYng words+phones into boundaries, accent, F0 and duraYon informaYon •  Prosodic phrasing

–  Need to break uQerances into phrases –  PunctuaYon is useful, not sufficient

•  Accents: –  PredicYons of accents: which syllables should be accented –  RealizaYon of F0 contour: given accents/tones, –  generate F0 contour

•  DuraYon: –  PredicYng duraYon of each phone

Graphic representation of F0

legumes are a good source of VITAMINS 50

100

150

200

250

300

350

400

time

F0 (i

n H

ertz

)

Slide from Jennifer Venditti

AlternaYve accents

•  We might want to give prominence to different parts of a sentence depending on the context.

•  For example, imagine answering a quesYon on the previous example

Q1: What types of foods are a good source of vitamins?

50

100

150

200

250

300

350

400

LEGUMES are a good source of vitamins


Q2: Are legumes a source of vitamins?

50

100

150

200

250

300

350

400

Legumes are a GOOD source of vitamins


Q3: I’ve heard that legumes are healthy, but what are they a good source of ?

legumes are a good source of VITAMINS 50

100

150

200

250

300

350

400


Duration •  Simplest: fixed size for all phones (100 ms) •  Next simplest: average duration for that phone (from training data).

Samples from SWBD in ms: –  aa 118 b 68 –  ax 59 d 68 –  ay 138 dh 44 –  eh 87 f 90 –  ih 77 g 66

•  Next Next Simplest: add in phrase-final and initial lengthening plus stress

•  Lots of fancy models of duration prediction: –  Using Z-scores and other clever normalizations –  Sum-of-products model –  New features like word predictability

•  Words with higher bigram probability are shorter

Waveform synthesis generation

•  Given: – String of phones – Prosody

•  Desired F0 for entire utterance •  Duration for each phone •  Stress value for each phone, possibly accent value

•  Generate: – Waveforms

The hourglass architecture

Internal Representation: Input to Waveform Synthesis

Waveform synthesis

•  concatenaYve synthesis systems – Diphone synthesis: join diphones to form the waveform.

– Unit selecYon synthesis: join the longest unit possible.

•  HMM-‐based sytnthesis

Diphones •  mid-‐phone is more stable than edge •  Need O(phone2) number of units

–  Some combinaYons don’t exist (hopefully) – ATT (Olive et al. 1998) system had 43 phones

•  1849 possible diphones •  PhonotacYcs ([h] only occurs before vowels), don’t need to keep diphones across silence

•  Only 1172 actual diphones – May include stress, consonant clusters

•  So could have more –  Lots of phoneYc knowledge in design

•  Database relaYvely small (by today’s standards) – Around 8 megabytes for English (16 KHz 16 bit)

Slide from Richard Sproat

Diphones •  Mid-phone is more stable than edge:

Diphone synthesis •  Training:

–  Choose units (kinds of diphones) –  Record 1 speaker saying 1 example of each diphone –  Mark the boundaries of each diphones,

•  cut each diphone out and create a diphone database

•  Synthesizing an utterance, –  grab relevant sequence of diphones from database –  Concatenate the diphones, doing slight signal

processing at boundaries –  use signal processing to change the prosody (F0,

energy, duration) of selected sequence of diphones

Diphone labelling •  Recorded sentences need to be labelled with phone start-‐end and middle(stable parts). This can be done automaYcally with: – ASR forced alignment – Audio-‐to-‐syntheYc audio auto-‐alignment

Diphone auto-alignment

•  Given – synthesized prompts – Human speech of same prompts

•  Do a dynamic time warping alignment of the two – Using Euclidean distance

•  Works very well 95%+ – Errors are typically large (easy to fix) – Maybe even automatically detected


Dynamic Time Warping


Pupng it all together

Once we have the diphones we want to join, we need to: •  Join them together to form the words and sentences

•  Modify the speaking speed to reflect the target duraYon

•  Modify the pitch to reflect the target intonaYon

Joining diphones Given any group of diphones to join together, we can follow a set of techniques: •  Dumb:

–  just join -‐> we will get many arYfacts –  at zero crossings

•  Algorithms that allow joining and modifying the duraYon at the same Yme –  SOLA –  TD-‐PSOLA(Time-‐domain pitch-‐synchronous overlap-‐and-‐add)

– Necessary prerequisite in voiced signals: epoch labelling

Epoch-labeling

•  An example of epoch-labeling useing “SHOW PULSES” in Praat:

It can be easily done using an EGG, or (less accurately) via signal processing)

Epoch-labeling: Electroglottograph (EGG)

•  Also called laryngograph or Lx –  Device that straps on

speaker’s neck near the larynx

–  Sends small high frequency current through adam’s apple

–  Human tissue conducts well; air not as well

–  Transducer detects how open the glottis is (I.e. amount of air between folds) by measuring impedence.

Picture from UCLA PhoneYcs Lab

DuraYon/pitch modificaYon

If we playback a sound faster than usual (for example by resampling the signal) both pitch and length gets affected. To only alter one of them: •  Time stretching: algorithms used to change the length/speed of a signal without altering its pitch

•  Pitch scaling/shifing: algorithms used to change the pitch without affecYng the length/speed.

Speech as Short Term signals

Alan Black

Duration modification •  Duplicate/remove short term signals

Pitch Modification •  Move short-term signals closer together/further apart


Windowing •  To avoid artifacts at the edges we first apply a

windowing to short-term signals •  y[n] = w[n]s[n]

Windowing

•  Multiply value of signal at sample number n by the value of a windowing function

•  y[n] = w[n]s[n]

Pitch/duraYon modificaYon algorithms •  OLA (Overlap-‐add synthesis): Applies a certain overlap to each short

term signal (if we desire a pitch modificaYon) and add them. If we want to lengthen the signal we just repeat some of the periods. It can create glitches when the short term boundaries are not well defined.

! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"

"("$("$" '!"#%#&#$"

)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!

! 'A!'() ! "!B!$!!

C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!

!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%

)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!

)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!

!

L,-./+%!14!%.++9*!

OPI!3,7,@,%+!%2%7+-!

!

ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%

Q1--,&3!,&,/2%0%=!

L+,59*0&:!415!%+:-+&7%!

L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!

N91D%709,/!1D7.D7!

I%+5!91--,&3!

-%

-

--

--

--

--

---%

-----%

!!

*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3

! !

! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"

"("$("$" '!"#%#&#$"

)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!

! 'A!'() ! "!B!$!!

C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!

!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%

)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!

)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!

!

L,-./+%!14!%.++9*!

OPI!3,7,@,%+!%2%7+-!

!

ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%

Q1--,&3!,&,/2%0%=!

L+,59*0&:!415!%+:-+&7%!

L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!

N91D%709,/!1D7.D7!

I%+5!91--,&3!

-%

-

--

--

--

--

---%

-----%

!!

*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3

! !

Original period Desired period

Pitch/duraYon modificaYon algorithms •  PSOLA (Pitch synchronous overlap-‐add synthesis): Through a good

pitch/F0 detecYon it defines the boundaries of the short term signals to be amplitude peaks containing 2-‐3 pitch periods. Then it uses OLA to join them.

•  TD-‐PSOLA (Time-‐domain PSOLA): Patented by France Telecom, is an efficient Yme-‐domain version of PSOLA (no FFT required, suitable for real-‐Yme TTS). Can modify pitch up to 2X or 0.5X

•  MBR-‐PSOLA (MulY-‐band re-‐synthesis psola): tries to solve problems of TD-‐PSOLA in boundaries between diphones due to phase, pitch or spectral envelope mismatches y normalizing all the database a priory. It later turned into the open project called MBROLA, which focuses on minimizing the database storage.

•  LP-‐PSOLA (Linear predictors PSOLA): modificaYon of PSOLA where the LP coefficients are stored instead of the signal itself.

! ! ! !!

!"#$%&'! !"#$%&&'(#)*+,!-)#.*/01%/#12'3-!45#6787("#$*+,!-)#.*/0#12'3-!4#931&#:7;<=7#:7><=7#:7?<#=#

(! )*+),-./*+.%0+1%23423)5/63.%

"#$%&%'(! )*%! +,%%-*(! +./)*%+01%2! 3.! 45! 6789:! ;<=#'0)*>! 0+! ?@0)%! @/2%'+);/2;3<%(!*@>;/! %;'! ,%'-%0&%+! 0)! ;+! +./)*%)0-A! B@')*%'>#'%(! ;!='%;)! %<#/=;)0#/! #C! +./)*%+01%2! +,%%-*!+0=/;<! 3.! ;220/=! ,%'0#2+(! -;@+%+! )*%! %-*#! %CC%-)A!D#/)';-)0#/! #C! ! )*%! +0=/;<! 2#%+! /#)! -;@+%!@/C;&#@';3<%! ;@20)0&%! ,%'-%,)0#/+A! "#$%&%'(! #>0))0/=!>#'%! ,%'0#2+! 0/! 20,*#/%(! -;@+%+! )*%!<#++!#C!+,%%-*!0/C#'>;)0#/!-#/)%/)A!!

4*%!45!6789:!;<=#'0)*>!$;+!0>,<%>%/)%2!0/!4D9!+-'0,)0/=!<;/=@;=%A!:+!;!'%+@<)!#C!3%<#/=0/=! )#! )*%! ='#@,! #C! 0/)%','%)%2! <;/=@;=%+(! 4D9! 0+! '%<;)0&%<.! +<#$A! E&%/! )*#@=*! )*%!,'#=';>>%!$;+!#,)0>01%2(! )*%! +./)*%+0+!#C!;!+%/)%/-%! <;+)+! +%&%';<! +%-#/2+A! F!+@,,#+%(! )*;)!)*%!@+%!#C!-#>,0<%2!<;/=@;=%!%A=A!DGG(!-#>30/%2!$0)*!*;'2$;'%!,%'C#'>;/-%!='#$)*!0/!)*%!C@)@'%(!-#@<2!+*#')%/!)*0+!)0>%!)#!*@/2'%2+!#C!!>0<0+%-#/2+A!4*%!@/2%/0;3<%!;2&;/);=%!#C!4D9!<;/=@;=%!'%>;0/+!)*%!,#++030<0).!#C!,#')0/=!4D9!+-'0,)+!;>#/=!#,%';)0/=!+.+)%>+A!!

4*%!C@/2;>%/);<!,'#3<%>!#C!)*%!EHI!4D9!-#/+#<%!0+!;!,<%/).!#C!%''#'+!0/!+#@'-%!-#2%A!H;/.!C@/-)0#/+!2#%+!/#)!$#'J!,'#,%'<.(!;/2!+#>%!#C!)*%>(!0/!+,%-0;<!-;+%+(!2#%+!/#)!$#'J!;)!;<<A!4*%'%C#'%!+#>%!#C!)*%>!*;2!)#!3%!/%$<.!0>,<%>%/)%2A!!

5%+,0)%! )*%! C;-)! )*;)! )*%! +,%%-*! -'%;)%2! 3.! -#/-;)%/;)0#/! #C! ! 20,*#/%+! 0+! +@,%'0#'! )#!-#/-;)%/;)0#/!#C! ,*#/%>%+(! )*0+! ;,,'#;-*!2#%+! /#)! %/;3<%! +./)*%+0+! #C! *0=*!?@;<0).!*@>;/!+,%%-*A!F)!0+!-;@+%2!3.!)*%!C;-)(!)*;)!/%0)*%'!;!*@=%!20,*#/%!2;);3;+%!0+!;3<%!)#!-#&%'!;!='%;)!&;'0%).!#C! !*@>;/!+,%%-*A!4*%!@+%!#C!H@<)0K;/2!EL-0);)0#/!M%+./)*%+0+!#/!2;);3;+%(!>0=*)!+<0=*)<.! 0>,'#&%! )*%!?@;<0).!#C! )*%!+,%%-*(!3@)! !;')0-@<;)%!+./)*%+01%'+! '%>;0/! )*%!&0+0#/!C#'!)*%!C@)@'%A!

43!343+)3.%NOP!5@)#0)(!4AQ!:/!F/)'#2@-)0#/!)#!4%L)R)#R+,%%-*!7./)*%+0+(!*)),QSS)-)+AC,>+A;-A3%S+./)*%+0+!NTP! D;++02.(! 7A(! ";''0/=)#/! UAQ! H@<)0R<%&%<! ://#);)0#/! 0/! )*%! EHI! 7,%%-*! 5;);3;+%!

H;/;=%>%/)!7.+)%>(!7,%%-*!D#>>@/0-;)0#/!VV(!TWWT!

NVP! 5@)#0)(!4A(!9%0-*(!"AQ!HKM!6789:!447!7./)*%+0+!K;+%2!#/!;/!HKE!M%R7./)*%+0+!#C!)*%!7%=>%/)+!5;);3;+%(!*)),QSS)-)+AC,>+A;-A3%!

NXP!7.'2;<(!:A!;/2!D#<AQ!45!6789:!Y%'+@+!";'>#/0-!Z#0+%!H#2%<!F/!50,*#/%!K;+%2!7,%%-*!7./)*%+0+(!*)),QSS$$$A10,,.A*#A;))A-#>!

! !

Problems with diphone synthesis

•  Signal processing leave artifacts, making the speech sound unnatural

•  Diphone synthesis only captures local effects – But there are many more global effects

(syllable structure, stress pattern, word-level effects)

Unit Selection Synthesis

•  Generalization of the diphone intuition – Larger units

•  From diphones to sentences

– Many many copies of each unit •  10 hours of speech instead of 1500 diphones (a

few minutes of speech)

– Little or no signal processing applied to each unit

•  Unlike diphones

Why Unit Selection Synthesis

•  Natural data solves problems with diphones –  Diphone databases are carefully designed but:

•  Speaker makes errors •  Speaker doesn’t speak intended dialect •  Require database design to be right

–  If it’s automatic •  Labeled with what the speaker actually said •  Coarticulation, schwas, flaps are natural

•  “There’s no data like more data” –  Lots of copies of each unit mean you can choose just

the right one for the context –  Larger units mean you can capture wider effects

Unit Selection Intuition •  Given a big database •  For each segment (diphone) that we want to synthesize

–  Find the unit in the database that is the best to synthesize this target segment

•  What does “best” mean? –  “Target cost”: Closest match to the target description, in terms of

•  Phonetic context •  F0, stress, phrase position

–  “Join cost”: Best join with neighboring units •  Matching formants + other spectral characteristics •  Matching energy •  Matching F0

!

C(t1n,u1

n ) = Ctarget (i=1

n

" ti,ui) + C join (i= 2

n

" ui#1,ui)

Search of best units using Viterbi algorithm

!

ˆ u 1n = argmin

u1 ,...,un

C(t1n,u1

n )

HMM synthesis •  It is a totally different approach to concatenaYve synthesis –  It is much closer to ASR

•  Hidden Markov Models (HMM) are trained from labeled data to learn how each phone is pronounced in each condiYon –  It also learns its prosody

•  Then, given a desired phoneme sequence and prosody paQern, it outputs the most probable audio sequence.

Samples ~2007 Sample 2010 NITech

6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants....

Documents

Transcript of 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants....