Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. ·...
Transcript of Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. ·...
![Page 1: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/1.jpg)
Speech Synthesis asA Machine Learning Problem
Keiichi TokudaNagoya Institute of Technology
![Page 2: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/2.jpg)
Introduction
Rule-based, formant synthesis (~’90s)– Hand-crafting each phonetic units by rules
Corpus-based, concatenative synthesis (’90s~)– Concatenate speech units (waveform) from a database
• Single inventory: diphone synthesis• Multiple inventory: unit selection synthesis
Corpus-based, statistical parametric synthesis– Source-filter model + statistical acoustic model
• Hidden Markov model: HMM-based synthesis
2
How we can formulate and understand the whole corpus-based speech synthesis process in a unified statistical framework?
![Page 3: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/3.jpg)
Problem of speech synthesis (1/2)
3
We have a speech database, i.e., a set of texts and corresponding speech waveforms.Given a text to be synthesized, what is the speech waveform corresponding to the text?
: speech waveform
: speech waveforms
: texts
: text to be synthesized
databaseGiven
?
w
WX
x
![Page 4: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/4.jpg)
Problem of speech synthesis (2/2)
Problem of statistical parametric speech synthesis:
4
( )LOlooo
,,|maxargˆ P=
( ) ( ) λLOλλloo
dPP∫= ,|,|maxarg
: Synthesis data (speech parameter sequence)
: Training data (speech parameter sequence)
: Label sequence for training data(pronunciation, stress, POS, pause position, etc.)
: Label sequence for synthesis data
: Model parameters (HMM parameters)λ
l
L
Ο
o
![Page 5: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/5.jpg)
Approximation
Maximum A Posterior (MAP) & ML estimation
Speech parameter generation using estimated parameters
5
( )LOλλλ
,|maxargˆMAP P=
( ) ( )λλLOλ
PP ,|maxarg=
( )λLOλλ
,|maxargˆML P=
MAP estimation
ML estimation
( )λlooo
ˆ,|maxargˆ P= Speech parametergeneration
![Page 6: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/6.jpg)
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part6
Text analysis
![Page 7: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/7.jpg)
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Parameter generationfrom HMMsLabels Excitation
parametersExcitation
Spectralparameters
Speech signal Training part
Synthesis part7
Text analysis
Training HMMs
Context-dependent HMMs& state duration models
Spectralparameters
Excitationparameters
Labels
),|(maxargˆ λLOλλ
P=
![Page 8: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/8.jpg)
Hidden Markov model (HMM)
11a 22a 33a
12a 23a
)(1 tb o )(2 tb o )(3 tb o
1o 2o 3o 4o 5o To ・ ・
1 2 3
1 1 1 1 2 2 3 3
o
q
Observation sequence
State sequence
8
ija
)( tqb o
: state transition probability
: output probability
![Page 9: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/9.jpg)
Output probability of HMM
1 2 3 4 5 6 7 8
i
=T t
∑∏∑=
−==
qqoλqολο
T
ttqqq ttt
baPP1
)()|,()|(1
9
33a
1
2
3
23a )( 73 ob
1π
Model parameters can be estimated by an EM algorithm
![Page 10: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/10.jpg)
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Excitation
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part10
Text analysis
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Spectralparameters
)ˆ,|(maxargˆ λlooo
P=
![Page 11: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/11.jpg)
Speech parameter generation algorithm
)|(),|(max
)|(),|()|(
λqλqo
λqλqoλo
q
q
PP
PPP
≈
=∑
),ˆ|(maxargˆ
),|(maxargˆ
λqoo
λlqq
o
q
p
p
=
=
⇒
11
For given HMM , determine a speech parameter vectorSequence which maximizes
λTTTT ],,,[ 21 Toooo =
![Page 12: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/12.jpg)
Determination of state sequence
Observation sequence
State sequence
4 10 5dState duration
The state sequence can be determined by state durations12
11a 22a 33a
12a 23a
)(1 tb o )(2 tb o )(3 tb o
1o 2o 3o 4o 5o To ・ ・
1 2 3
1 1 1 1 2 2 3 3
ija
)( tqb o
: state transition probability
: output probability
o
q
![Page 13: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/13.jpg)
HMM (Hidden Markov Model)– State duration prob. depends only on transition prob.– State duration probability exponentially decreases
HSMM (Hidden Semi Markov Model)– HMM + explicit duration model ⇒ HSMM
State durations are given by means of Gaussians.
Hidden Semi Markov Model
13
1O 2O 3O 4O 5O 6O 7O 9O8O
1q 2q 3q
1d 2d 3d
Introduce state duration probability
)|( qdP
Gaussian
![Page 14: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/14.jpg)
Speech parameter generation algorithm
)|(),|(max
)|(),|()|(
λqλqo
λqλqoλo
q
q
PP
PPP
≈
=∑
),ˆ|(maxargˆ
),|(maxargˆ
λqoo
λlqq
o
q
p
p
=
=
⇒
14
For given HMM ,determine a speech parameter vectorSequence which maximizes
λTTTT ],,,[ 21 Toooo =
![Page 15: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/15.jpg)
Generated feature sequence
Mean Variance
becomes a sequence of mean vectors⇒ discontinuous outputs between stateso
15
frame
a i
![Page 16: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/16.jpg)
Dynamic features
112
22
11
2
)(5.0
−+
−+
+−≈∂∂
=∆
−≈∂∂
=∆
ttt
t
ttt
tccccc
cctcc
t
t
tc1−tc 1+tc
tc∆
tc1−tc 1+tc
tc2∆
16
1−∆ tc 1+∆ tc 1−∆ tc 1+∆ tc
5.0− 5.0 1 12−
S
![Page 17: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/17.jpg)
Integration of dynamic features
Relationship between speech parm. vec. & static feat
17
o cW
![Page 18: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/18.jpg)
Solution for the problem (1/2)
0c
qWc=
∂∂ ),ˆ|(log λP
qqq μWWcW ˆ1
ˆ1
ˆ−− = ΣΣ TT
TTTT ],,,[ 21 Tcccc =TTTT ],,,[ ˆˆˆˆ 21 Tqqqq μμμμ =
TTTT ],,,[ ˆˆˆˆ 21 Tqqqq ΣΣΣΣ =
By setting
we obtain
where
18
![Page 19: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/19.jpg)
Generated speech parameter trajectory
Mean Variance c
19
frame
c∆
c
a i
![Page 20: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/20.jpg)
Generated spectra
200
12
34
5(kH
z)0
12
34
5(kH
z)
a i silsil
w/o dyn
w dyn
Spectra changing smoothly between phonemes
![Page 21: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/21.jpg)
Conventional HMM Trajectory HMM
Training
Synthesis
Trajectory HMM
Solve inconsistency between training & synthesis
21
),|( λLCP
)ˆ,|( λlcP
),|( λLOP
Wcoλlo =|)ˆ,|(P
⇒ improving the model accuracy
is not a proper distribution of c),|(),|( λlWcλlo PP =
![Page 22: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/22.jpg)
HMM-based speech synthesis system
TEXT
Text analysis
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels
Labels
Training part
Synthesis part22
Text analysis
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Spectralparameters
Excitationparameters
Speech signal
Excitationgeneration
Synthesis Filter
SYNTHESIZEDSPEECH
Excitationparameters
Excitation
Spectralparameters
![Page 23: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/23.jpg)
∑=
−=M
m
mzmczH0
)(exp)(
Mel-cepstral representation of speech spectra
ML-estimation of mel-cepstrum
ML estimation of spectral parameter
)|(maxarg cxcc
p=
23
xc
: speech waveform (Gaussian process): mel-cepstrum
War
ped
frequ
ency
(rad
)
Frequency (rad)
warped frequencymel-scale frequency
ω
αα ~
1
11
1~ je
zzz −
−
−− =
−−
=
∑=
−=M
m
mzmczH0
~)(exp)(
ω
π
2π
π0 2/π
![Page 24: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/24.jpg)
Synthesis filter
Overview of speech vocoding
Original speech
Mel-cepstrumUnvoiced/voicedF0
24
)(zH)(ne )(nxexcitation
pulse train
white noise
speech
These speech parameter modeled by HMM
![Page 25: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/25.jpg)
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part25
Text analysis
![Page 26: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/26.jpg)
Observation of F0
Time
Log
Freq
uenc
y
Unable to model by continuous or discrete distributions⇒ Multi-space distribution HMM (MSD-HMM)
26
11 R=Ω
voiced0
2 R=Ωunvoiced
・
![Page 27: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/27.jpg)
Structure of MSD-HMM
27
2,1w 2,2w 2,3w2
2nR=Ω
)(12 xN )(22 xN )(32 xN
3,1w 3,2w 3,3w3
3nR=Ω
)(13 xN )(23 xN )(33 xN
1 2 3
1,1w 1,2w 1,3w1
1nR=Ω
)(11 xN )(21 xN )(31 xN
![Page 28: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/28.jpg)
MSD-HMM for F0 modeling
1 2 3HMM for F0
0,1w 0,2w 0,3w
1,1w 1,2w 1,3w
unvoiced
voiced
unvoiced
voiced
unvoiced
voiced
・ ・ ・01 R=Ω
12 R=Ω
voiced/unvoiced weights
28
![Page 29: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/29.jpg)
Generated F0
29
![Page 30: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/30.jpg)
Speech samples
Mel-cepstrum
w/ dyn. w/o dyn.
log F0w/ dyn.
w/o dyn.
30
![Page 31: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/31.jpg)
Inclusion of speech analysis & waveform reconstruction
Problem of statistical parametric speech synthesis
31
( )LXlxxx
,,|maxargˆ P=
( )∑∑∫=c Cx
cx |maxarg P ( )λlc ,|P
( )LCλ ,|P× ( ) λXC dP |
Waveform reconstruction Speech parameter generation
HMM parameter estimation Speech analysis
c and consist of mel-cepstrum & F0 parametersC
![Page 32: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/32.jpg)
Context clustering factors preceding, succeeding two phonemes Current phoneme Position of current phoneme in current syllable # of phonemes at preceding, current, succeeding syllable accent, stress of preceding, current, succeeding syllable Position of current syllable in current word # of preceding, succeeding accented, stressed syllable in current phrase # of syllables from previous, to next accented, stressed syllable Vowel within current syllable Guess at part of speech of preceding, current, succeeding word # of syllables in preceding, current, succeeding word Position of current word in current phrase # of preceding, succeeding content words in current phrase # of words from previous, to next content word # of syllables in preceding, current, succeeding phrase…
Vast # of combinations ⇒ Difficult to have possible models32
![Page 33: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/33.jpg)
Decision tree-based state clustering
k-a+b
t-a+h
…
…
……
…
yes
yesyesyes
yes no
no
no
no
no
w-a+t
w-a+sil w-a+sh
gy-a+sil
gy-a+pau
g-a+pau
leaf nodes
synthesizedstates
R=silence?
L=“gy”?
L=voice?
L=“w”?
R=silence?
33
![Page 34: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/34.jpg)
Stream-dependent tree-based clustering
Decision treesfor
mel-cepstrum
Decision treesfor F0
HMM
State durationmodel
Decision treefor state dur.
models
34
![Page 35: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/35.jpg)
Inclusion of model structure parameter
Problem of statistical parametric speech synthesis
35
( )LOlooo
,,|maxargˆ P=
( )∑∫=m
mP ,,|maxarg λloo
( )LOλ ,,| mP× ( ) λLO dmP ,|
Generate coefficient
Posterior ofmodel parameters
Posterior ofmodel structure
Usually fixed is usedm
![Page 36: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/36.jpg)
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
SYNTHESIZEDSPEECH
Excitationparameters
Excitation
Spectralparameters
Speech signal Training part
Synthesis part36
),|(maxargˆ ΛWLLL
P=
TEXT
Text analysisParameter generation
from HMMsLabels
),|(maxargˆ Λwlll
P=
Training HMMs
Context-dependent HMMs& state duration models
Spectralparameters
Excitationparameters
Labels
Text analysis
![Page 37: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/37.jpg)
Inclusion of text analysis
Problem of statistical parametric speech synthesis
37
( )WOwooo
,,|maxargˆ P=
( )∑∑∫ ∫=l Lo
λlo ,|maxarg P ( )Λ,| wlP
( )LOλ ,|P× ( ) ( ) ΛΛΛ ddPP λWL ,|
Speech parameter generation
HMM parameter estimation Text processing
Text processing
Prior
![Page 38: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/38.jpg)
HMM-based speech synthesis system
SPEECHDATABASE
ExcitationParameterextraction
SpectralParameterExtraction
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Spectralparameters
Excitationparameters
Labels
Speech signal Training part
Synthesis part38
Text analysis
![Page 39: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/39.jpg)
Inclusion of all components
Problem of statistical parametric speech synthesis
39
( )WXwxxx
,,|maxargˆ P=
( )∑∑∑∫ ∫=Cc Llx
cx, ,
|maxargm
P ( )mP ,,| λlc ( )Λ,| wlP
Waveform generation Parameter generation Text processing
Posterior of model parameter
Text processing Prior
( ) ΛΛ ddP λ( )Λ,|WLP× ( )XC|P
Speech analysis
( )LCλ ,,| mP× ( )LC ,|mP
Posterior of model structure
![Page 40: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/40.jpg)
Emotional Speech Synthesis
text neutral angry
「授業中に携帯いじってんじゃねえよ!
電源切っとけ!」
“Don’t touch your cell phone during a class! Turn off it!”
「ミーティングには毎週参加しなさい!」
“You must attend the weekly meeting!”
trained with 200 utterances
![Page 41: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/41.jpg)
Speaker Adaptation (mimicking voices)
w/o adaptation (initial model) Adapted with 4 utterances Adapted with 50 utterances Speaker-dependent model
Speaker-independent
Adaptationdata
Adapted model
MLLR-based adaptation
adaptation
?
![Page 42: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/42.jpg)
Speaker Interpolation (mixing voices)
Linear combination of two speaker-dependent models
Model A Model BInterpolated model
1.00 0.75 0.50 0.25 0.00
0.00 0.25 0.50 0.75 1.00
A:
B:
![Page 43: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/43.jpg)
Voice Morphing
A BTwo voices:
Four voices:A B
![Page 44: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/44.jpg)
Interpolation of Speaking Styles
Neutral High Tension
Interpolation extrapolation
Base model A Base model B
![Page 45: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/45.jpg)
Multilingual Speech Synthesis Japanese American English Chinese (Mandarin) (by ATR) Brazilian Portuguese (by Nitech, and UFRJ) European Portuguese
(by Nitech, Univ of Porto, and UFRJ) Slovenian
(by Bostjan Vesnicer, University of Ljubljana, Slovenia ) Swedish
(by Anders Lundgren, KTH, Sweden) German (by University of Bonn, and Nitech) Korean (by Sang-Jin Kim, ETRI, Korea) Hungarian (by Budapest University of Technology and
Economics) Finish (by TKK, Finland) Baby English (by Univ of Edinburgh, UK) Polish, Slovak, Finnish, Arabic, Farsi, Polyglot, etc.
![Page 46: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/46.jpg)
Singing Voice Synthesis
Singing voice database
Musical score
Trained HMMs
Singing voice forany piece of music
male:
female:
male:
female:
![Page 47: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts](https://reader033.fdocuments.net/reader033/viewer/2022052013/60298bd122b8cb141457e833/html5/thumbnails/47.jpg)
Conclusion
• Whole speech synthesis process is described as a statistical framework
• It gives a unified view and reveals what is correct and what is wrong
• Importance of the databaseFuture work• Still we have many problems should be solved:
• Speech waveform modeling• Combination with text processing part
47
Statistical approach to speech synthesis