Post on 05-Jan-2016
description
Modeling and Generation of Accentual Phrase F0 Contours Based on Discrete HMMs Synchronized at Mora-U
nit TransitionsAtsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)Koji Iwano (currently with Tokyo Institute of Technology, Japan)Keikichi Hirose (Dep. of Frontier Eng., The University of Tokyo, Japan)
Introduction to Corpus-Based Intonation Modeling
• Traditional approach: rules derived from linguistic expertiseHuman-dependent (too complicated and not satisfactory, because the phenomena involved are not completely understood)
• Corpus-based approach: modeling derived from statistical analysis of speech corporaAutomatic (potential to improve as better speech corpora become available)
Background
• HMMs are widely used in speech recognition, and fast learning algorithms exist
• Macroscopic discrete HMMs associated to accentual phrases can store information such as accent type and prosodic structure
• Morae are extremely important to describe Japanese intonation - sequences of high and low mora can characterize accent types
Overview of the Method
• Definition of HMM and alphabet:– Accent types modeled by discrete HMMs
– 2-code mora F0 contour alphabet used as output symbols
– State transitions sychronized with mora transitions
• Classification of HMMs and training:– HMMs classified according to linguistic attributes
– Training by usual FB algorithm
• Generation of F0 contours:– Best sequence of symbols generated by a modified Vi
terbi algorithm
The Mora-F0 Alphabet
• Two codes: stylized mora F0 contours and mora-to-mora F0: 34 symbols each
• Obtained by LBG clustering from a 500-sentence database (ATR continuous speech database, speaker MHT)
• All the database is labeled using the 2-code symbols.
State transition Mora transition
Accentual phrase
The Accentual Phrase HMM
HMM
• Accentual phrases are classified according to:– Accent type
– Position of accentual phrase in the sentence
– (Optional: number of morae, part-of-speech, syntactic structure)
Example:
Example: ‘Karewa Tookyookara kuru. (He comes from Tokyo)
Accent type Position
1 1
0 2
1 3
Label sequence
[],[],[]
[],[],[],[],[],[]
[],[]
shape1
F01, shape2
F02
M1:
M2:
M3:
HMM Topologies
(a) Accent types 0 and 1
(a) Other accent types
Training Database
• ATR Continuous Speech Database (500 sentences, speaker MHT)
• Segmented in mora and accentual phrases
• Mora labels using the mora-F0 alphabet: shape (stylized F0 contour), mora F0.
• Accentual phrase labels: number of morae, position in the sentence
Output Code Generation
How to use the HMM for synthesis?
A) Recognition
B) Synthesis
1 output sequenceLikelihoodBest path
Best output sequenceBest path
Intonation Modeling Using HMM
for t=2,3,...,Tfor it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+[-log b(y(t)| it)]}
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+[-log b(y(t)| it)]}
next it
next t
Viterbi Search for the Recognition Problem:
Intonation Modeling Using HMM
for t=2,3,...,Tfor it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+[-log b(ymax(t)| it)]}
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+[-log b(ymax(t)| it)]}
next it
next t
Modified Viterbi Search for the Synthesis Problem:
Use of Bigram Probabilities
for t=2,3,...,Tfor it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}
next it
next t
k=1,…,K (dimension of y)
k=1,…,K (dimension of y)
Accent Type Modeling Using HMM
3.65
3.7
3.75
3.8
3.85
3.9
3.95
4
4.05
4.1
4.15
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Mora #
log(Hz) "Type0""Type1""Type2""Type3"
Phrase Boundary Level Modeling Using HMM
3.9
3.92
3.94
3.96
3.98
4
4.02
4.04
4.06
4.08
0 0.5 1 1.5 2 2.5 3 3.5 4
Mora #
log(Hz) "level1.graph""level2.graph""level3.graph"J-TOBI
B.I.PauseY/N
Bound.Level
332
YNN
123
-0.4
-0.2
0
0.2
0.4
0 50 100 150 200 250 300 350 400 450 500
log
F0
[H
z]
t [msec]
"PH1_0"
PH1_0.original
-0.4
-0.2
0
0.2
0.4
0 50 100 150 200 250 300 350 400 450 500
log
F0
[H
z]
t [msec]
"PH1_0"
PH1_0.bigram
-0.4
-0.2
0
0.2
0.4
0 50 100 150 200 250 300 350 400 450 500
log
F0
[H
z]
t [msec]
"PH1_1"
PH1_1.original
-0.4
-0.2
0
0.2
0.4
0 50 100 150 200 250 300 350 400 450 500
log
F0
[H
z]
t [msec]
"PH1_1"
PH1_1.bigram
-0.4
-0.2
0
0.2
0.4
0 50 100 150 200 250 300 350 400 450 500
log
F0
[H
z]
t [msec]
"PH1_2"
PH1_2.original
-0.4
-0.2
0
0.2
0.4
0 50 100 150 200 250 300 350 400 450 500
log
F0
[H
z]
t [msec]
"PH1_2"
PH1_2.bigram
The Effect ofBigrams
Comments• We presented a novel approach to intonation modeli
ng for TTS synthesis based on discrete mora-synchronous HMMs.
• For now on, more features should be included in the HMM modeling (phonetic context, part-of-speech, etc.), and the approach should be compared to rule-based methods.
• Training data scarcity is a major problem to overcome (by feature clustering, an F0 contour generation model, etc.)
Hidden Markov Models (HMM)A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state.
Symbols: 1,2, ..., K
a12 a23 a34
a22 a33
b(1|1)~b(K|1) b(1|2)~b(K|2) b(1|3)~b(K|3)
a44
b(1|4)~b(K|4)
a11
a13
1 2 3 4
ステップ1:データベース作成
•ATR の連続音声データベースを使用(500文,話者 MHT)
•モーラ単位に分割•モーララベルの付与•F0 パターンを抽出•LBG 法によるクラスタリング•全データベースにクラスタクラスを付与
Bigramの導入
for t=2,3,...,Tfor it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}
next it
next t
k=1,…,K (dimension of y)
k=1,…,K (dimension of y)
考察・今後の展望
•学習データが少ない•TTS システムへの組込みにはさらなる工夫が必要他の言語情報を考慮(音素、モーラ数、品詞等)データ不足を克服するための工夫(クラスタリング等)モデルの接続に関する検討
Hidden Markov Models (HMM)A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state.
Symbols: 1,2, ..., K
a12 a23 a34
a22 a33
b(1|1)~b(K|1) b(1|2)~b(K|2) b(1|3)~b(K|3)
a44
b(1|4)~b(K|4)
a11
a13
1 2 3 4