Appendix Fundamental Frequency Modeling for Neural...
Transcript of Appendix Fundamental Frequency Modeling for Neural...
Appendix
FundamentalFrequencyModelingforNeural-Network-based
StatisticalParametricSpeechSynthesis
ID:201517062017-12-07
1contact:[email protected],suggestions,anddiscussion
8/2/18
シン ワン
XinWANG
Linguisticfeatures
2
INTRODUCTION
8/2/18
x1:T
次は新金岡、新金岡です。
Text-analyzer
“Prosody”analyzer[2,3]
シンカナオカ デス
次 は 新金岡 、 新金岡 です 。
名詞
ツギ
助詞
ワ
固有名詞 固有名詞
シンカナオカ
助動詞Parser[1]
Dictionary
ツギ ワ シンカナオカ シンカナオカ デス*
| ||* *
Durationmodel[4]
Interface
ツ*名詞
11…
1
ツ*名詞
11…
2
ギ*名詞
12…
3
ギ*名詞
12…
4
… ギ*名詞
12…
6
in T framesx1:T[1] T. Kudo. MeCab: Yet Another Part-of-Speech and Morphological Analyzer.[2] 匂坂,佐藤,電子情報通信学会論文誌,Vol.J66-D, No.7,pp.849–856,1983. [3] 鈴木雅之, et al. "CRF を用いた日本語東京方言のアクセント結合自動推定." (2012): 2-2.[4] T. Yoshimura, et al. Duration modeling for Hmm- based speech synthesis. In ICSLP, volume 98, pages 29–32, 1998.
Linguisticfeaturesl Japanesedata(generatedbyOpenJTalk [5]):
§ Previous-previous/previous/current/next/next-nextphoneme
§ Distance fromcurrentmoratotheaccentnucleus§ Position ofcurrentmoraintheaccentphrase
§ Part-of-speech ofprevious/current/nextword§ Inflectedformsofprevious/current/nextword§ Conjugationtypeofprevious/current/nextword
§ Numberofmoraofprevious/current/nextaccentphrase§ Accenttypeofprevious/current/nextaccentphrase§ Whether previous/current/nextaccentphraseisinterrogative§ Position ofcurrentaccentphraseinbreathgroup§ Isthereapause afterpreviousorbeforenextaccentphrase
3
INTRODUCTION
8/2/18
x1:T
[5] The HTS Working Group. The Japanese TTS System ‘Open JTalk’, 2016. http://open-jtalk.sourceforge.net
Phoneme
Mora
Word
Accentphrase
Linguisticfeaturesl Japanesedata(generatedbyOpenJTalk [5]):
§ Numberofmoraofprevious/current/nextbreathgroup§ Numberofaccentphraseofprevious/current/nextbreathgroup§ Position ofcurrentbreathgroupinutterance
l Englishdata(generatedbyFliteHTS_engine[6]):§ Similarfeaturesoverphoneme/syllable/phrase§ Pitchaccent->accentedornot§ PartoftheToBI boundarytone(LL,LH)
4
INTRODUCTION
8/2/18
x1:T
[6] HTS Working Group. The English TTS system Flite+HTS engine, 2014. http://hts-engine.sourceforge.net
Breathgroup
Acousticfeatures
5
INTRODUCTION
8/2/18
Speech vocoder
FFT +Cepstral analysis
WindowingFraming
in T frames
o1:T
o1:T
…EachsliceiscalledaspeechframeLength:20ms;overlap:15ms
SpectrumamplitudemaybeusedPhasemaybeignored
…
…
F0 tracking unvoiced unvoiced 200 Hz
[1] H. Kawahara et al. Speech Communication, 27:187–207, 1999.[2]M.Morise,et al..IEICETrans. onInformationandSystems,99(7):1877–1884,2016.[3] K. Tokuda, et al. Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043–1046, 1994.
o1:TDependingonthetask,mayonly containF0orspectrum
8/2/18 6
INTRODUCTIONSource-filtermodelHTSSlidesver.2.3,releasedbyHTSWorkingGrouphttp://hts.sp.nitech.ac.jp/NagoyaInstituteofTechnologyDepartmentofComputerScience
Taskdefinition
• Equal-lengthsequence-to-sequenceconversion
7
INTRODUCTION
x1:T = {x1, · · · ,xT } o1:T = {o1, · · · ,oT }
⇥⇤ = argmax⇥
|D|Y
k=1
p(o(k)1:Tk
|x(k)1:Tk
;⇥)
bo1:T = argmaxo1:T
p(o1:T |x1:T ;⇥⇤)
StatisticalF0 model
Linguistic features F0 contour
Modeltraining
F0generation
Corporaandfeatures
8[12]King,S.andKaraiskos,V.(2011).TheBlizzardChallenge2011.InProc.BlizzardChallengeWorkshop,pages1–10.[13]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:Anewtts fromatr basedoncorpus-basedtechnologies.InProc.SSW5,pages179–184.[14]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.[15]Kawahara,H.,Masuda-Katsuse,I.,andCheveigne,A.d.(1999).Restructuringspeechrepresentationsusingapitch-adaptivetime-frequencysmoothingandaninstantaneous-frequency- basedF0
extraction: Possibleroleofarepetitivestructureinsounds.SpeechCommunication,27:187–207.
OVERVIEW OF PHDRESEARCH
Name Size Note
BlizzardChallenge2011corpus[12]Nancyvoice
~12,000utterances16hours
English,neutral style,readingspeech
ATR Ximera corpora[13]F009 voice
~30,000 utterances48hours
Japaneseneutral style,readingspeech
Feature Dimension
Linguisticfeatures phonesequence,prosodic features... ~390
Acousticfeatures
Mel-generalized cepstral[14] 60
F0 (withunvoiced/voiced) 1+1
Band-aperiodicity 25
MotivationqWhyF0
• Morethansurfacewordmeaning
• Morethanimagined…
9
INTRODUCTION
[10]NanetteVeilleux, etal.6.911Transcribing Prosodic Structure ofSpokenUtterances withToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.
Speaker A: Who made the marmalade.
Speaker B:Marianna made the marmalade.
Speaker A: Bob made the marmalade.
Speaker B: (No,) Marianna made the marmalade.
Speaker B:Marianna made the marmalade.
Speaker B: Marianna made the marmalade.
Speaker B: Mariannamade the marmalade.
WordEmbeddings
108/2/18
Embeddings
TTSwithwordvectorsl Replaceprosodictagswithwordvectors
11
WORD VECTORS
text
speech waveform
graphemeto
phoneme syntactic analysis
interface
acoustic model
textanalysis
acoustic modeling
prosody prediction
this is a test
0101...00|0010.2.4.
D I sI zeI
t e s t
this Htest H*L-L%
(S (NP this)(VP is
(NP a test)))
TTSwithwordvectorsl Replaceprosodictagswithwordvectors[4]
§ similartothefirstworkbyanotherWang[5]§ whywordvectors[6]: unsupervised learning,linguisticregularity…
12
WORD VECTORS
text
speech waveform
graphemeto
phoneme
interface
acoustic model
textanalysis
acoustic modeling
wordvectors
[4]Wang,X.,Takaki,S.,&Yamagishi, J.(2016).InvestigationofUsingContinuousRepresentationofVariousLinguisticUnitsinNeural NetworkbasedTTS.IEICE,Vol.E99-D,No.10.[5]Wang,P.,Qian,Y.,Soong,F.K.,He,L.,&Zhao,H.(2015).WordembeddingforrecurrentneuralnetworkbasedTTSsynthesis. In ICASSP(pp.4879-4883).[6]Mikolov,T.,Yih,W.,&Zweig, G.(2013).Linguisticregularitiesincontinuousspacewordrepresentations.InHLT-NAACL (pp.746–751).
0101...00|0.120.34...
D I sI zeI
t e s t
this [0.12,0.34..]is [1.2,-23,..]test ...
(S (NP this)(VP is
(NP a test)))
this is a test
TTSwithwordvectorsl Resultsofpreviouswork[4]
§ Mushra testwith20paird nativespeakersinCSTR
13
WORD VECTORS
ID inputtotheacousticmodel(arecurrentneuralnetwork)
𝑅" phonemes +predictedprosodic tags
𝑅# phonemes𝑅$% phonemes +wordvector
Tab1.Systems
[4]Wang,X.,Takaki,S.,&Yamagishi,J.(2016).InvestigationofUsingContinuousRepresentationofVariousLinguisticUnitsinNeuralNetworkbasedText-to-SpeechSynthesis.IEICE,Vol.E99-D,No.10.
prosodicannotation
Enhancethewordvectorwithprosodicinformationl Sumup
§ secondarycorpus:small,withprosodicannotation
14
WORD VECTORS
ToBItags
Post-filtertraining
secondaryspeechcorpus
Enhancethewordvectorwithprosodicinformationl Sumup
§ secondarycorpus:small,withprosodicannotation
15
WORD VECTORS
𝑀$
ToBItags
vectorpost-filter
prosodicfeatures
𝑀$
prosodicannotation
Post-filtertraining
secondaryspeechcorpus
Misthesetofwordvector
Enhancethewordvectorwithprosodicinformationl Sumup
§ secondarycorpus:small,withprosodicannotation
16
WORD VECTORS
ToBItags prosodicfeatures
vectorpost-filter
𝑀$ ...
enhancedwordvector
rawwordvector
VectorenhancingPost-filtertraining
𝑀$𝑀$
vectorpost-filter
𝑀$
prosodicannotation
secondaryspeechcorpus
Enhancethewordvectorwithprosodicinformationl Sumup
§ secondarycorpus:small,withexpertprosodicannotation§ primarycorpus:huge,w/oexpertprosodicannotation
17
WORD VECTORS
enhancedwordvector
𝑀$
text
grapheme-tophoneme
acousticmodel
acousticfeatures
Acousticmodeltraining
primaryspeechcorpus
ToBItags prosodicfeatures
vectorpost-filter
𝑀$ ...
enhancedwordvector
rawwordvector
VectorenhancingPost-filtertraining
𝑀$
vectorpost-filter
𝑀$
prosodicannotation
secondaryspeechcorpus
Enhancethewordvectorwithprosodicinformationl Sumup
§ secondarycorpus:small,withexpertprosodicannotation§ primarycorpus:huge,w/oexpertprosodicannotation
18
WORD VECTORS
enhancedwordvector
𝑀$
text
grapheme-tophoneme
acousticmodel
acousticfeatures
Acousticmodeltraining
primaryspeechcorpus
ToBItags prosodicfeatures
vectorpost-filter
𝑀$ ...
enhancedwordvector
rawwordvector
VectorenhancingPost-filtertraining
𝑀$
vectorpost-filter
𝑀$
prosodicannotation
secondaryspeechcorpus
Enhancethewordvectorwithprosodicinformationl How?Trainapost-filtererwithtriplet-rankinglosscriterion[12]
19
E = max⇥0, 1� Sim(pw,F(mw)) + Sim(pw,F(mw�))
⇤.
prosodic tags
feature extraction
NN-based classifier
vectorpost-filter
wordvectors
F(.)
mw�mw
F(.)
w�
Sim(x,y) =x · y
||x|| · ||y||
secondaryspeech corpus
pw
speech (and text)
w
[12]Bengio,S.,&Heigold,G.(2014).Wordembeddings forspeechrecognition.InINTERSPEECH-2014 (pp.1053–1057).
WORD VECTORS
Experimentsl Systems
§ allsystemsuseanotheracousticmodeltopredictspectralfeatures
20
systemID input totheacousticmodel(F0trajectorymodel)𝑅# phonemes𝑅" phonemes +conventionalprosodiccontext(automaticallypredicted)𝑅$' phonemes +rawwordvector𝑅$( phonemes +enhancedwordvector𝑅$')* phonemes +rawwordvectortunedbyback-propagationinTTS𝑅$()* phonemes +enhancedwordvectortunedbyback-propagation inTTS
WORD VECTORS
Resultsl Objectivetest
21
RN Rp Rwr Rwe Rwrbp
Rwebp
0.77
0.78
0.79F0
Cor
r.in
Mel
-sca
le
RN Rp Rwr Rwe Rwrbp
Rwebp
38.5
39
39.5
40
F0 R
MSE
in M
el-s
cale
𝑅# onlyphoneme
𝑅" +prosodic context
𝑅$')* +rawwordvector(finetuned)
𝑅$()* +enhancedwordvector(finetuned)
𝑅$' +rawwordvector
𝑅$( +enhanced wordvector
WORD VECTORS
50 100 150 200 250Frame index
0
100
200
300
400
500
F0 (H
z)
NATURALRNRpRwrRwe
22
Resultsl Sample
𝑅# onlyphoneme
𝑅" +prosodic context
𝑅$' +rawwordvector
𝑅$( +enhanced wordvector
𝑅$')* +rawwordvector(finetuned)
𝑅$()* +enhancedwordvector(finetuned)
ifthemovewould require
WORD VECTORS
Resultsl Subjectivetest
§ conductedinCSTR,by20paidnativespeakers
§ someevaluatorsfavor𝑅$( verymuch whileothersfavor𝑅$' verymuch,becauseofmissing thecontextofsentence?
49.87%
46.79%
47.00%
50.13%
53.21%
53.00%!"!#$!#%
!#%&'!#%!#%&'
0% 50% 100%
𝑅# onlyphoneme
𝑅" +prosodic context
𝑅$')* +rawwordvector(finetuned)
𝑅$()* +enhancedwordvector(finetuned)
𝑅$' +rawwordvector
𝑅$( +enhanced wordvector
WORD VECTORS
Highway
248/2/18
Highway
Motivationl VerydeepnetworkforSPSS?
• Imageclassification:>100hiddenlayers[14]
• Speechrecognition:>10hiddenlayers[15]
l Justmorehiddenlayers?• Imageclassification
• speechrecognition
Ø SPSS?1. aregressiontask2. heterogeneoustargets:F0,MGC ...
8/2/18 25
ON NETWORK'S DEPTH
[14]He,K.,Zhang,X.,Ren,S.,&Sun,J.(2015).DeepResidualLearningforImageRecognition.CoRR,abs/1512.03385.Retrievedfromhttp://arxiv.org/abs/1512.03385[15]Liang,L.,&Steve,R.(2016).Small-footprintDeepNeuralNetworkswithHighwayConnectionsforSpeechRecognition.InProc.INTERSPEECH (pp.12–16).
classificationtasks
Highwaynetworks[16] forSPSS
• Whyhighwaynetwork?§ easiertrainingofverydeepnetworks§ easyinvestigationofnetwork'sbehavior
8/2/18 26
feedforward
feedforward
MGC
highwayblock
F0 BAP
highwayblock...
linguistic features
ON NETWORK'S DEPTH
feedforward
feedforward
MGC
feedforward
F0 BAP
feedforward...
linguistic features
[16]Srivastava,R.K.,Greff,K.,&Schmidhuber,J.(2015).HighwayNetworks.CoRR,abs/1505.00387.Retrievedfromhttp://arxiv.org/abs/1505.00387
feedforward network highway network
feedforward
feedforward
+
... gate
X
X
X
-1
y = T (x)�H(x) + [1� T (x)]� x
x
H(x) T (x)
T (x) = sigmoid(Wx+ b)
HighwaynetworksforSPSS
• Whymulti-stream?§ reduceinteractionbetweenMGC andF0 modeling
8/2/18 27
feedforward
feedforward
MGC
highwayblock
F0 BAP
highwayblock...
linguistic features
single-streamhighway network
feedforward
feed-forward
highwayblock
F0
highwayblock
linguistic features
feed-forward
highwayblock
MGC
highwayblock
feed-forward
highwayblock
BAP
highwayblock
multi-streamhighway network
.........
ON NETWORK'S DEPTH
feedforward
feedforward
MGC
feedforward
F0 BAP
feedforward...
linguistic features
single-streamfeedforward network
Experimentsl Networks
l Corpus:BC2011 Nancyvoice
8/2/18 28
ON NETWORK'S DEPTH
Notation System Configuration
DS deepfeedforwardsingle-streamnetwork layersize: 382
HS highwaysingle-streamnetwork
layersize: 3822tanh layersfor eachhighwayblock
HM highwaymulti-streamnetwork
layersize:256 (eachsub-network)2tanh layersfor eachhighwayblock
Results
• Single-stream:sufficientdepthisnecessary
v Depth:totalnumberofhiddentanh-basedfeedforwardlayers
29
ON NETWORK'S DEPTH
2 4 8 14 20 40Network depth
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
F0 C
orre
latio
n
DSHSHM
2 4 8 14 20 40Network depth
42.5
43
43.5
44
44.5
45
45.5
46
46.5
47
47.5
F0 R
MSE
(Hz)
DSHSHM
2 4 8 14 20 40Network depth
1.02
1.04
1.06
1.08
1.1
1.12
1.14
MG
C R
MSE
DSHSHM
8/2/18
Results
• Single-stream:sufficientdepthisnecessary
• Multi-stream:more(feedforward)layersforF0 ?
Ø Similarresultsgivenfixeddepthbutvariedlayersizes
30
ON NETWORK'S DEPTH
2 4 8 14 20 40Network depth
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
F0 C
orre
latio
n
DSHSHM
2 4 8 14 20 40Network depth
42.5
43
43.5
44
44.5
45
45.5
46
46.5
47
47.5
F0 R
MSE
(Hz)
DSHSHM
2 4 8 14 20 40Network depth
1.02
1.04
1.06
1.08
1.1
1.12
1.14
MG
C R
MSE
DSHSHM
8/2/18
Results
• Single-stream:sufficientdepthisnecessary
• Multi-stream:morefeedforwardlayersforF0 ?
31
ON NETWORK'S DEPTH
3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters
0.66
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
F0 C
orre
latio
n
2
4
8 14
20
402
4
8 14 20 40
24
8 14 20 40
DSHSHM
3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters
42.5
43
43.5
44
44.5
45
45.5
46
46.5
47
47.5
F0 R
MSE
(Hz)
2
48
14
20
402
4
8 14 20 40
24
814
20 40
DSHSHM
3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters
1.02
1.04
1.06
1.08
1.1
1.12
1.14
MG
C R
MSE
2
4
8 1420 40
2
4
814 20 40
2
4
8
1420
40
DSHSHM
8/2/18
1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07
Number of model parameters
1.005
1.01
1.015
1.02
1.025
1.03
1.035
1.04
1.045
MG
C R
MS
E
382
782
8821024
382
482582
7821024
HM1
HM2 HM
3HM
4
DS
HS
HM
Results
Ø Similarresultsgivenfixeddepth,variedlayersize
32
ON NETWORK'S DEPTH
1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07
Number of model parameters
42.5
43
43.5
44
44.5
45
45.5
46
46.5
F0 R
MS
E (
Hz)
382
782
882
1024
382 482
582782
1024
HM1
HM2
HM3
HM4
DS
HS
HM
1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07
Number of model parameters
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
F0 C
orr
ela
tion
382
782
882
1024
382
482
582782
1024
HM1
HM2
HM3
HM4
DS
HS
HM
8/2/18
Analysisnetworkbehaviorl Investigationtool
• indicatesnetworkbehavior
§ As,
§ As,
33
ON NETWORK'S DEPTH
feedforward
feed-forward
F0
highwayblock
linguistic features
feed-forward
MGC
highwayblock
feed-forward
BAP
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
y = T (x)�H(x) + [1� T (x)]� x
Histogram of
0 1
T (x)
T (x) ⇡ 0
T (x) ⇡ 1
y ⇡ x
y ⇡ H(x)
8/2/18
T (x)
Analysisnetworkbehaviorl Histogramof,HM14
34
ON NETWORK'S DEPTH
feedforward
feed-forward
F0
highwayblock
linguistic features
feed-forward
MGC
highwayblock
feed-forward
BAP
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
0 1
6e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
T (x)
8/2/18
Analysisnetworkbehaviorl Histogramof,HM14
• forMGC
• forF0
• F0sub-networkispartiallyinactive
35
ON NETWORK'S DEPTH
feedforward
feed-forward
F0
highwayblock
linguistic features
feed-forward
MGC
highwayblock
feed-forward
BAP
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
T (x)
0 1
6e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
0 1
7e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
8/2/18
Analysisnetworkbehaviorl Histogramof,HS14
• Single-streamnetworkisfullyactive
Ø MGC dominatesthenetwork
36
ON NETWORK'S DEPTH
feedforward
feed-forward
F0
highway block
linguistic features
MGC BAP
highway block
highway block
highway block
highway block
highway block
highway block
T (x)
0 1
8e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
8/2/18
1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
MG
C R
MSE
HM2
HM4
HM8
HM14HM20 HM40
HM60
HM80
Systems with different depthObjectivemeasure
37
1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)
42.6
42.7
42.8
42.9
43
43.1
43.2
43.3
43.4
F0 R
MSE
(Hz)
HM2
HM4
HM8
HM14
HM20
HM40
HM60
HM80
Systems with different depth
1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)
0.724
0.726
0.728
0.73
0.732
0.734
F0 C
orre
latio
n
HM2
HM4
HM8
HM14
HM20
HM40
HM60
HM80
Systems with different depth
ON NETWORK'S DEPTHResults(HMonly)
38
MGC sub-network F0 sub-network
block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork
0 1
4e+0
5 block 1 block 2 block 3 block 4 block 5
0 1
7e+0
5 block 6 block 7
0 1
6e+0
5 block 1 block 2 block 3 block 4 block 5
0 1
2e+0
5 block 6 block 7
ON NETWORK'S DEPTHResults(HM14)
39
MGC sub-network F0 sub-network
block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork
0 1
4e+0
5 block 1 block 2 block 3 block 4 block 5
0 1
1e+0
6 block 6 block 7 block 8 block 9 block 10
0 1
1e+0
6 block 1 block 2 block 3 block 4 block 5
0 1
4e+0
5 block 6 block 7 block 8 block 9 block 10
ON NETWORK'S DEPTHResults(HM20)
0 1
2e+0
6 block 1 block 2 block 3 block 4 block 5
0 1
2e+0
5 block 6 block 7 block 8 block 9 block 10
0 1
2e+0
5 block 11 block 12 block 13 block 14 block 15
0 1
1e+0
6 block 16 block 17 block 18 block 19 block 20
40
0 1
4e+0
5 block 1 block 2 block 3 block 4 block 5
0 1
1e+0
6 block 6 block 7 block 8 block 9 block 10
0 12e
+06 block 11 block 12 block 13 block 14 block 15
0 1
3e+0
6 block 16 block 17 block 18 block 19 block 20
MGC sub-network F0 sub-network
block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork
ON NETWORK'S DEPTHResults(HM40)
0 1
2e+0
6 block 1 block 2 block 3 block 4 block 5
0 1
2e+0
5 block 8 block 10 block 12 block 14 block 16
0 1
2e+0
5 block 19 block 20 block 22 block 23 block 24
0 1
1e+0
6 block 26 block 27 block 28 block 29 block 30
41
MGC sub-network F0 sub-network
block1: neartheinput endofthenetworkblock30: neartheoutput endofthenetwork
0 1
4e+0
5 block 1 block 2 block 3 block 4 block 5
0 1
2e+0
6 block 8 block 10 block 12 block 14 block 16
0 14e
+06 block 19 block 20 block 22 block 23 block 24
0 1
4e+0
6 block 26 block 27 block 28 block 29 block 30
ON NETWORK'S DEPTHResults(HM60)
42
MGC sub-network F0 sub-network
block1: neartheinput endofthenetworkblock39: neartheoutput endofthenetwork
0 1
7e+0
5 block 1 block 3 block 5 block 7 block 9
0 1
2e+0
6 block 11 block 13 block 15 block 17 block 19
0 14e
+06 block 21 block 23 block 25 block 27 block 29
0 1
4e+0
6 block 31 block 33 block 35 block 37 block 39
0 1
2e+0
6 block 1 block 3 block 5 block 7 block 9
0 1
4e+0
5 block 11 block 13 block 15 block 17 block 19
0 1
4e+0
5 block 21 block 23 block 25 block 27 block 29
0 1
4e+0
5 block 31 block 33 block 35 block 37 block 39
ON NETWORK'S DEPTHResults(HM80)
43
ON NETWORK'S DEPTHOverfittingofHM80?
MGC sub-network LF0 sub-networkPhoneme identity Position of phoneme in syllable
Position of phoneme in syllable Accent type of next syllablePosition of phoneme in syllable (backward) Accent type of previous syllable
Number of previous stressed syllables in phrase Position of syllable in the wordNumber of stressed syllables remained in phrase Phoneme identity
MGC sub-network LF0 sub-networkPosition of phrase in utterance Number of words in previous phrase
Number of words in phrase Number of words in next phraseNumber of phrases in utterance ToBI boundary tone
Number of syllables in previous phrase Number of syllables in next phraseToBI boundary tone Number of syllables in previous phrase
Analysisl Contributionofthelinguisticfeatures
• Mostusefulfeatures
• Leastusefulfeatures
44
ON NETWORK'S DEPTH
8/2/18
Experimentsq Otherresults
• Sensitivityanalysisinmulti-streamhighway• MGCandF0usedifferentlinguisticfeatures
• SimilarresultsonJapanesedata
ISSUE 1:JOINT LEARNING FOR F0?
45
MGC
CurrentphonemeidentityPositionofcurrentphonemeinsyllable(forward)Positionofcurrentphonemeinsyllable(backward)NumberofprecedinglexicallystressedsyllablesinphraseNumberoffollowinglexicallystressedsyllablesinphrase
F0
Positionofphonemeinsyllable(forward)IsthenextsyllablebearinganEnglishpitch-accentIstheprevioussyllablebearinganEnglishpitch-accentPositionofcurrentsyllableinthewordCurrentphonemeidentity
Layers DS HS HM
2
4
8
14
20
40
60
80
SamplesON NETWORK'S DEPTH
Experimentsl Networks
l Corpus:ATRF0098/2/18 47
ON NETWORK'S DEPTH
Notation System Configuration
HMnmulti-streamhighwaynetwork
layersize256 foreachsub-network2tanh layersinone highwayblocksigmoidforhighwaygate
RNN[1]Recurrent neuralnetwork(single-stream)
2feedforward layers,512eachlayer2bi-directionalLSTMlayers,256eachlayer1linearprojection outputlayer
DNNDeepfeedforwardneuralnetwork(single-stream)
1feedforward layers,1024eachlayer3feedforward layers,512eachlayer1linearprojection outputlayer
[1]Wang,X.,Takaki,S.,&Yamagishi, J.(2016).AComparativeStudyofthePerformanceofHMM,DNN,andRNN basedSpeechSynthesisSystemsTrainedonVeryLargeSpeaker-DependentCorpora.InProc.SSW9 (pp.125–128).
48
ON NETWORK'S DEPTH
5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights
0.98
1.00
1.02
1.04
1.06
1.08
1.10
MG
CR
MS
E
4
68
2
3
4
816
32
2
34
8
16
32
Single-stream feedforwardSingle-stream highwayMulti-stream highway
5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights
31
32
33
34
35
36
F0R
MS
E(H
z)
4 6 8
23
4
816 32
23
4
8
1632
5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights
0.830
0.835
0.840
0.845
0.850
0.855
0.860
0.865
0.870
F0C
orre
latio
n(0
-1)
4 6
8
2
3
4
816
322
3 4 8
1632
49
0 1
2e+0
4 block 1 block 2
0 1
3e+0
4 block 1 block 2
0 1
5e+04 b.1 b.2 b.3 b.4 b.5 b.6 b.7
0 1
2e+05 b.1 b.2 b.3 b.4 b.5 b.6 b.7
MGC stream F0 stream
0 1
6e+04 b.1 b.2 b.3 b.4 b.5 b.6 b.7
0 1
9e+04 b.8 b.9 b.10
0 1
2e+05 b.1 b.2 b.3 b.4 b.5 b.6 b.7
0 1
2e+04 b.8 b.9 b.10
ON NETWORK'S DEPTHResults(Japanese)l HMwith2,7,10highwayblocks
50
minEs Dim Rank'RR-Phone' 0.993 thephonemeafterthenextphonemeidentity 44 40
'C-Br_Len_Mora' 0.993 thenumberofmorasinthecurrentbreathgroup 100 41
'C-Br_Bw-Pos-in_Utt_Mora' 0.994 positionofthecurrentbreathgroupidentitybymora(backward) 201 42
'C-Br_Fw-Pos-in_Utt_Mora' 0.995 positionofthecurrentbreathgroupidentitybymora(forward) 201 43
'C-Acc_Fw-Pos-in_Br_Mora' 0.996 positionofthecurrentaccentphraseidentityinthecurrentbreathgroupbythemora(forward) 121 44
'Utt_Len_Acc' 0.997 thenumberofaccentphrasesinthisutterance 60 45
'Utt_Len_Br' 0.999 thenumberofbreathgroupsinthisutterance 30 46
'Utt_Len_Mora' 0.999 thenumberofmoras inthisutterance 200 47
minEs Dim Rank'L-Acc_Len_Mora' 0.994 thenumberofmorasinthepreviousaccentphrase 61 40'C-Br_Len_Mora' 0.995 thenumberofmorasinthecurrentbreathgroup 100 41
'C-Acc_Fw-Pos-in_Br_Mora' 0.995 positionofthecurrentaccentphraseidentityinthecurrentbreathgroupbythemora(forward) 121 42'C-Br_Bw-Pos-in_Utt_Mora' 0.996 positionofthecurrentbreathgroupidentitybymora(backward) 201 43'C-Br_Fw-Pos-in_Utt_Mora' 0.997 positionofthecurrentbreathgroupidentitybymora(forward) 201 44
'Utt_Len_Acc' 0.997 thenumberofaccentphrasesinthisutterance 60 45'Utt_Len_Br' 0.999 thenumberofbreathgroupsinthisutterance 30 46
'Utt_Len_Mora' 1.000 thenumberofmoras inthisutterance 200 47
ON NETWORK'S DEPTHResults(Japanese)l Sensitivityanalysis,47classesofcontextualfeatures
• Leastuseful features,MGC stream
• Leastuseful features,F0 stream
Summaryl Findings:
• MGC benefitsfromdeepernetworks
• F0 sub-networkcanbeshallow
• Single-streamnetworksfocusmoreonMGC otherthanF0
• Investigationlinguisticfeatures'usefulness§ (automaticallyinferred)F0-relatedtagsarenoisyforEnglish
• ExperimentsonJapanesecorpus§ similar:multi-streamnetworkimprovesF0 modeling§ different:F0-relatedtagsareuseful
51
ON NETWORK'S DEPTH
Anyreason?
8/2/18
8/2/18 52
HIGHWAY ARCHITECTURE
MGC
Bottom network
Linear
F0
Linear
BAP
Linear
MGC F0 BAP
Bottom network
Linear
MGC
Bottom network
Linear
F0
Linear
BAP
Linear
ht
Linear
bot =
2
64bbo(MGC)
t
bo(F0)t
bo(BAP )t
3
75 =
2
4W s,11 W s,12 0
0 1 00 0 1
3
5
2
64bo(MGC)t
bo(F0)t
bo(BAP )t
3
75 = W sW oht
bot =
2
64bo(MGC)t
bo(F0)t
bo(BAP )t
3
75 =
2
4W o,11 0 0
0 W o,22 00 0 W o,33
3
5
2
4ht,1
ht,2
ht,3
3
5 = W oht,
bot =
2
64bo(MGC)t
bo(F0)t
bo(BAP )t
3
75 =
2
4W o,11 W o,12 W o,13
W o,21 W o,22 W o,23
W o,31 W o,32 W o,33
3
5
2
4ht,1
ht,2
ht,3
3
5 = W oht,
8/2/18 53
HIGHWAY ARCHITECTURE
MGC
Bottom network
Linear
F0
Linear
BAP
Linear
Linear
bot =
2
64bbo(MGC)
t
bo(F0)t
bo(BAP )t
3
75 =
2
4W s,11 W s,12 0
0 1 00 0 1
3
5
2
64bo(MGC)t
bo(F0)t
bo(BAP )t
3
75 = W sW oht
bbo(MGC)
t = W s,11bo(MGC)t +W s,12bo(F0)
t
l SameargumentasSARvsRMDN• Dependencybetweenmean,notrandomvariables
p(o(MGC), o(F0), o(BAP )) = p(o(MGC))p(o(F0))p(o(BAP ))
Themeanofisaffectedbymeanofp(o(MGC)) p(o(F0))
8/2/18 54
HIGHWAY ARCHITECTURE
MGC
Bottom network
Linear
F0
Linear
BAP
Linear
Linear
l Trainingdependencymodel?
Observed(natural)data
MGC
Bottom network
Linear
F0
Linear
BAP
Linear
Linear
p(X,Y ;⇥) = p(X|Y ;⇥1)p(Y ;⇥2)
⇥⇤ =argmax⇥
Y
{xn,yn}2D
p(X = xn, Y = yn)
= argmax⇥
Y
{xn,yn}2D
p(X = xn|Y = yn)p(Y = yn)
8/2/18 55
HIGHWAY ARCHITECTURE
MGC
Bottom network
Linear
F0
Linear
BAP
Linear
Linear
l Generatefromdependencymodel?
Idealsolution:
p(X,Y ;⇥⇤) = p(X|Y ;⇥⇤1)p(Y ;⇥⇤
2)
{x⇤, y⇤} =arg max{bx,by}
p(X = bx, Y = by)
= arg max{bx,by}
p(X = bx|Y = by)p(Y = by)
Approximation:y⇤ = argmax
byp(Y = by)
x⇤ = argmaxbx
p(X = bx|Y = y⇤)
AM&
Vocoder
568/2/18
COMPARE NEW MODELS/VOCODERSOverviewl StilltheSPSSframeworkl Testonboth“vocoder”andacousticmodels
minimum phase
10/20/17 1
Wavenet PML
Phaserecovery
SAR RNNDAR
SAR-Wa SAR-PmSAR-Pr SAR-Wo SGA-Wo RGA-Wo RNN-Wo
Waveform generators
Acoustic models
Linguistic features
F0 MGCGAN
WORLD
COMPARE NEW MODELS/VOCODERS
• Abs*:copy-synthesis WithoutMLPGWithout formantenhancement
Overview
• 40blocks,dilution [2,4,8,16,32,64,128,256,512,2,4…]
WAVENET
Linguistic features / MGC
Feedforward
Diluted1-D CNN
Sub-network
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNNs softmax Waveform+
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*…
1-D CNN
1-D CNN +
Sub-networkWAVENET
Linguistic features / MGC
Up samplingTimeresolution: 16kHz
Timeresolution: 1/(5ms)=20Hz(Framelevel)
1-D CNN softmax Waveform+
…Block 1 Block 2 Block 40
Waveform(feedback)
linear
Feedforward
Bi-LSTM
F0
Nosub-network
Bi-LSTMsub-network
NaturalTrainedonnaturalMGC/F0GeneratedonsyntheticMGC&F0
Generationmethodl Randomsampling
l One-best(invoicedregions)
WAVENET
Random-samp One-best Natural
bot ⇠ P (ot|bot�R:t�1, bat)
bot = argmaxot
p(ot|bot�R:t�1, bat)
Random-samp One-best Natural
Conditioned ongeneratedMGC&F0
Conditioned onlinguistic features&F0
WaveNet Natural
62
Analysisl IsWaveNet-vocoderstrictlyspeaker/languagedependent?
• Butmaynotworkformalespeakers
WAVENET
WaveNet-vocoder
F009 (Japanese data)
Training
WaveNet-vocoder
Nancy(English data)
NaturalMGC F0
Generation
Natural MGC F0
Conditioned onnaturalMGC&F0
63
Analysisl Embeddingofwaveform(conditionedonMGC/F0)
WAVENET
1-D CNN softmax Waveform+
…Block 1 Block 2 Block 40
Waveform(feedback)
linear
WaveformlevelID(10bits)
Weightsof linearlayer
2-DEmbedding
64
Analysisl Embeddingofwaveform(conditionedonlinguisticfeatures/F0)
WAVENET
1-D CNN softmax Waveform+
…Block 1 Block 2 Block 40
Waveform(feedback)
linear
WaveformlevelID(10bits)
Weightsof linearlayer
2-DEmbedding
65
Analysisl Varianceofoutputfromeachblock(conditionedonMGC/F0)
0 5 10 15 20 25 30 35 40
Block ID
�4
�2
0
2
4
Feat
ure
valu
e
98-percentile 2-percentile median stdev
Rangeofoutput valuefromeachblock
Largestvalue
Smallestvalue
Standarddeviation
WAVENET
1-D CNN softmax Waveform+
…Block 1 Block 2 Block 40
Waveform(feedback)
linear