Post on 29-Oct-2019
FundamentalFrequencyModelingforNeural-Network-based
StatisticalParametricSpeechSynthesis
ID:201517062018-07-13
1contact:wangxin@nii.ac.jpwewelcomecriticalcomments,suggestions,anddiscussion
XinWang
q Introduction• Background• Topic• Thesisoutline
q Issuesandmethods
q Summary
CONTENTS
2
Newresults/updatedexplanation
(c.f. HTSSlides,byHTSWorkingGroup)
Text-to-speech(TTS)q TTSpipeline[1,2]
q Synthesizer
3
INTRODUCTION
[1] Taylor,P.(2009).Text-to-SpeechSynthesis.[2] Dutoit,T.(1997).AnIntroductiontoText-to-speech Synthesis.[3] Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[4] Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.
AcousticfeaturesFundamentalfrequency(F0)
SpectralfeaturesAcousticmodels
Vocoder
Statisticalparametricspeechsynthesizer(SPSS)[3,4]
TextText
analyzerSpeech
synthesizerLinguisticfeatures Speech
Text-to-speech(TTS)q Neural-network-basedacousticmodels[5,6,7]
4
INTRODUCTION
[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[6] Z.H.Ling,etal.Deeplearningforacousticmodelinginparametricspeechgeneration:Asystematicreviewofexistingtechniquesandfuturetrends.IEEESignalProcessingMagazine,
32(3):35–52, 2015.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.
AcousticfeaturesLinguisticfeatures Acousticmodels
次は新金岡、新金岡です。
Frametier
Neuralnetworks
... 1 1 ... 1 1 1 ... 1 2 ...
... ... * ... * ...
... ツ ツ ... ツ ツ ギ ... ギ ワ ...
... ts ts ... u u g ... i w ...
… 162163 … 194195196 … 227228 ...
0.1 …1.2 …4.5 …
0 0 43 56 32 …Phone tier
Moratier
Phrasetier Spectralfeatures
F0
Fundamentalfrequency(F0)
Spectralfeatures
Topicq NeuralF0modelingforTTS
qWhyF0?
5
INTRODUCTION
Frametier
Neuralnetworks
0.1 …1.2 …4.5 …
0 0 43 56 32 …Phone tier
Moratier
Phrasetier Spectralfeatures
F0
... 1 1 ... 1 1 1 ... 1 2 ...
... ... * ... * ...
... ツ ツ ... ツ ツ ギ ... ギ ワ ...
... ts ts ... u u g ... i w ...
… 162163 … 194195196 … 227228 ...
[8]NanetteVeilleux,etal.6.911TranscribingProsodicStructureofSpokenUtteranceswithToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.
次は新金岡、新金岡です。
Speaker A: Who made the marmalade.
Speaker B:Marianna made the marmalade.
Speaker A: Bob made the marmalade.
Speaker B: (No,) Marianna made the marmalade.
Speaker B:Marianna made the marmalade.
Speaker B: Marianna made the marmalade.
Speaker B: Mariannamade the marmalade.
Topicq Issuestobeaddressed
6
INTRODUCTION
Neuralnetworks
Spectralfeatures
F0
[9]M.S.Ribeiro.Suprasegmental representationsforthemodelingoffundamentalfrequencyinstatisticalparametricspeechsynthesis.PhDthesis,TheUniversityofEdinburgh,2018.
[10]J.Hirschberg.Pitchaccentincontextpredictingintonational prominencefromtext.ArtificialIntelligence,63(1):305–340,1993.
... … … … xTx1 x2 x3
... … … …
... … … …
Linguisticfeatures bsT
boTbo3
bs3bs2bs1
bo1 bo2
F0
p(o1:T , s1:T |x1:T ;⇥)=TY
t=1
N ([ot, st]; Network⇥(x1:T , t),�I)
F0features[9]Linguisticfeatures[10] Issue1:jointmodeling?
Issue2:temporaldependency?
Issue3:efficientenough?
Thesisoutlineq Conventionalapproaches[7](Table3.1inthesis)
7
INTRODUCTION
…
…
Recurrentlayers
Feedforwardlayer
Linguisticfeatures
F0contour
…Spectralfeatures
xTx1 x2 x3
bsTboTbo3
bs3bs2bs1bo1 bo2
Recurrentneuralnetwork(RNN)
... 1 1 ... 1 1 1 ... 1 2 ...
... ... * ... * ...... ツ ツ ... ツ ツ ギ ... ギ ワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...
T frames(timesteps)[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.
Thesisoutlineq Threeissues
8
INTRODUCTION
Issue1:Jointmodeling?
…
…
…
xTx1 x2 x3
bsTboTbo3
bs3bs2bs1bo1 bo2
T frames
... 1 1 ... 1 1 1 ... 1 2 ...
... ... * ... * ...... ツ ツ ... ツ ツ ギ ... ギ ワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...
Issue2:Temporaldependency?
Issue3:Frame-by-frame
processing isefficient?
Thesisoutline(Chapter4)q Onissue1:JointmodelingofF0andspectralfeatures?
• Investigationusinghighwaynetworkso Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0andspectral
✕ Sub-optimalforF0modelingü OnlyF0astarget
9
INTRODUCTION
…
… xTx1 x2 x3
boTbo3bo1 bo2
… bsTbs3bs2bs1
Novelanalysis
F0contour
Spectralfeatures
Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?
• Evidencefromrandomsampling✕ RNNignorestemporaldependency
10
INTRODUCTION
…
… xTx1 x2 x3
boTbo3bo1 bo2
Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?
ü Shallowautoregressivemodel(SAR)• Short-termdependency
ü Deepautoregressivemodel(DAR)• Longerdependency&bestresults&randomsampling
11
INTRODUCTION
…
… xTx1 x2 x3
boTbo3bo1 bo2
DAR
Novelmodels&interpretations
Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?
✕ Inefficient
12
INTRODUCTION
…
… xTx1 x2 x3
boTbo3bo1 bo2
... 1 1 ... 1 1 1 ... 1
... ... * ... *... ツ ツ ... ツ ツ ギ ... ギ
... ts ts ... u u g ... i… 162 163 … 194 195 196 … 227
Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?
ü Two-stagemodel:efficient&interpretable&multi-level
13
INTRODUCTION
… boTbo3bo1 bo2
x1p x2p
Unit 1 Unit 2
1 1 1 1* *
ツ ツ ギ ギ
ts u g i
Stage1:F0contourmodeling
Stage2:Linguisticlinking
Novelmodel
Unsupervisedlearning
Fastsupervisedlearning
Thesisoutline
14
INTRODUCTION
Preliminary
Summary
Issue1Jointmodeling forF0?
Issue2Temporaldependency?
Issue3Frame-by-frame?
…
… xTx1 x2 x3
boTbo3bo1 bo2
… bsTbs3bs2bs1
…
… xTx1 x2 x3
boTbo3bo1 bo2
… boTbo3bo1 bo2
x1p x2p
Chapter1-3
Chapter5Chapter6
Chapter7
Chapter8
Chapter4 HighwaynetworksToolsforanalysis
SAR&extendedSARDAR &techniques
Two-stageF0modelstage1:frame-by-framestage2:unit-by-unit
Updatedresults
15
INTRODUCTION
Preliminary
Summary
Issue1Jointmodeling forF0?
Issue2Temporaldependency?
Issue3Frame-by-frame?
Chapter1-3
Chapter5Chapter6
Chapter7
Chapter8
Chapter4
SAR+logarearatioAdditional listeningtests
Listeningtest
CONTENTS
16
q Introduction
q Issue1:jointmodelingofF0andspectralfeatures
q Issuesandmethods
q Summary
Motivationq Commonapproach [5,7,11]
• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures
q Empiricalresultsagainstjointlearning[5,12]
17
ISSUE 1:JOINT LEARNING FOR F0?
Spectralfeatures
Neuralnetwork
F0
Linguisticfeatures
[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.[11] H.ZenandA.Senior.Deepmixturedensitynetworksforacousticmodelinginstatisticalparametricspeechsynthesis.In Proc.ICASSP,pages3844–3848,2014.[12] S.KangandH.Meng.Statisticalparametricspeechsynthesisusingweightedmulti-distributiondeepbeliefnetwork.InProc.Interspeech,pages1959–1963,2014.
Trueornot?Moreevidence?
Trueornot?Moreevidence?
Method
• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures
18
ISSUE 1:JOINT LEARNING FOR F0?
Spectralfeatures
Highwaynetworks [13]
F0
Linguisticfeatures
[13] R.K.Srivastava,K.Greff,andJ.Schmidhuber.Highwaynetworks.InProc.DeepLearningWorkshop,2015.
• Modelandtools:o Highwaynetwork[13]
o Histogram&sensitivitytools
Methodq Definitionofhighwaynetwork
19
ISSUE 1:JOINT LEARNING FOR F0?
bo1
x1
g1
h1 …
xT
boT
hTgT
bo2
x2
h2g2
bot = gt � ht + (1� gt)� xt
Highwaynetwork
gt = sigmoid(W gxt + bg)
ht = �(W ixt + bi)
Highwaygatevector
…
…bo1 bo2
x1 x2 xT
boT bot = W oht + bo.
ht = �(W ixt + bi)Feedforward
networkW i, bi
W o, boh1 hTh2
Methodq Highwaynetworkforacousticmodeling
• Spectralfeatures
20
ISSUE 1:JOINT LEARNING FOR F0?
Linguisticfeatures
Linear
MGC
Highwayblock
Highwayblock…
LinearBAP
Highwayblock
Highwayblock…
LinearF0
Highwayblock
Highwayblock…
Linear
MGC:Mel-generalizedcepstral(MGC)coefficients[14]
BAP:bandaperiodicitycoefficients[15,16]
Linear
MGCBAPF0
Highwayblock
Highwayblock…
Linear
Linguisticfeatures
g
[14] K.Tokuda,T.Kobayashi,T.Masuko,andS.Imai.Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046,1994.[15] H.Kawahara,J.Estill,andO.Fujimura.Aperiodicityextractionandcontrolusingmixedmodeexcitationandgroupdelay manipulationforahighqualityspeechanalysis,modification
andsynthesissystemstraight.InSecondInternationalWorkshoponModelsandAnalysisofVocalEmissionsforBiomedicalApplications,2001.[16] H.ZenandT.Toda.AnoverviewofNitech HMM-basedspeechsynthesissystemforBlizzardChallenge2005.InProc.Interspeech,pages93–96,2005.
Single-stream Multi-stream
Methodq Analysistools
1. Histogramofgatevectors
2. Sensitivityoftodifferentlinguisticfeatures(Sec.4.3.2)
21
ISSUE 1:JOINT LEARNING FOR F0?
bo = g � h+ (1� g)� x
g
{g1, · · · , gT } Histogram
g
0 0.5 10
0.5
1
1.5
2
2.5 #105
0 0.5 10
2
4
6
8 #104
0 0.5 10
1
2
3 #104
0 0.5 10
0.5
1
1.5
2
2.5 #104
0 0.5 10
1
2
3 #104
0 0.5 10
1
2
3
4 #104
0 0.5 10
2
4
6
8 #104
g ⇡ 1g ⇡ 0
Non-lineartransformationNotransformation
Linear
MGCBAPF0
Highwayblock
Highwayblock…
Linear
Linguisticfeatures
gh
bo
x
Experimentsq Configuration
• Data:English,16hours• Feature:MGC,BAP,F0(InterpolatedF0+voicing(U/V))
• Metric:
q Threemodels:• Single-streamfeedforwardnetwork• Single-streamhighwaynetwork• Multi-streamhighwaynetwork
q Twotests:1. Fixedlayerwidth,varyingnetworkdepth2. Fixednetworkdepth,varyinglayerwidth
22
ISSUE 1:JOINT LEARNING FOR F0?
Rootmeansquareerror(RMSE)Correlationcoefficients(CORR)
g
x
Highwayblock
h(1)
h(2)
Experimentsq Objectiveresults:increasingnetworkdepth
v Networkdepth:Numberoftanh-basedtransformationlayers• Single-streamnetworkprioritizesMGC?
Linguisticfeatures
Linear
MGC
Highwayblock
Highwayblock
…
LinearBAP
Highwayblock
Highwayblock
…
LinearF0
Highwayblock
Highwayblock
…
LinearF0
Highwayblock
Highwayblock
…
LinearMGC BAP
Linear
Linguisticfeatures
F0
feedforward
feedforward
…
LinearMGC BAP
Linear
Linguisticfeatures
feedforward
feedforward
ISSUE 1:JOINT LEARNING FOR F0?
23
MGCRMSE F0RMSE F0CORR
2 4 8 14 20 40Network depth
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
F0C
orre
latio
n(0
-1)
2 4 8 14 20 40Network depth
43
44
45
46
47
F0R
MS
E(H
z)
2 4 8 14 20 40Network depth
1.02
1.04
1.06
1.08
1.10
1.12
1.14
MG
CR
MS
E
Single-stream feedforwardSingle-stream highwayMulti-stream highway
Experimentsq Objectiveresults:increasingwidth(depth=14)
v MS1 :[MGC256]– [F0256]v MS2 :[MGC382]– [F0256]v MS3 :[MGC512]– [F0382]v MS4 :[MGC768]– [F0512]
ISSUE 1:JOINT LEARNING FOR F0?
24
• Single-streamnetworkprioritizesMGC?
3.0e6 9.0e6 2.5e7Number of Network weights
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
F0C
orre
latio
n(0
-1)
382
782882
1024382 482582 782 1024
MS1
MS2MS3
MS4
3.0e6 9.0e6 2.5e7Number of Network weights
42.5
43.0
43.5
44.0
44.5
45.0
45.5
46.0
46.5
F0R
MS
E(H
z)
382
782 882
1024382
482
582 7821024
MS1 MS2MS3
MS4
3.0e6 9.0e6 2.5e7Number of Network weights
1.01
1.02
1.03
1.04
1.05
MG
CR
MS
E
382
782882 1024
382 482 582
782 1024
MS1
MS2MS3 MS4
Single-stream feedforwardSingle-stream highwayMulti-stream highway
MGCRMSE F0RMSE F0CORR
0 1
6e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
Experimentsq Histogramof
• Multi-streamhighway(depth14)
•
ISSUE 1:JOINT LEARNING FOR F0?
25
linear
F0
highwayblock
linguistic features
linear
BAP
highwayblock
linear
MGC
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
Test set
g
g ⇡ 1g ⇡ 0
Non-lineartransformationNotransformation linear
0 1
6e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
Experimentsq Histogramof
• Multi-streamhighway(depth14,7blocks)
•
ISSUE 1:JOINT LEARNING FOR F0?
26
linear
linear
F0
highwayblock
linguistic features
linear
BAP
highwayblock
linear
MGC
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
Test set
0 1
7e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
g
g ⇡ 1g ⇡ 0
Non-lineartransformationNotransformation
• DifferenthiddenfeaturesforMGCandF0
Experimentsq Histogramof
• Single-streamhighway(depth14,7blocks)
ISSUE 1:JOINT LEARNING FOR F0?
27
linear
linear
F0
highwayblock
linguistic features
BAPMGC
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
highwayblock
Test set
0 1
8e
+0
5 b.1 b.2 b.3 b.4 b.5 b.6 b.7
• SimilartoMGCsub-netinmulti-streamhighway• Single-streamnetworkprioritizesMGC?
g
Summaryq Answertoissue1
q OnlyF0modelinginthefollowingchapters
q F0isusefulforMGCmodeling?Howtodo?(slidesappendix)
ISSUE 1:JOINT LEARNING FOR F0?
28
NOTforthesakeofF0modeling!
Joint(multi-task)learning? BeneficialforbothF0andspectralfeatures
? Theysharehiddenfeatures
Negativeevidence• Jointlearning(single-streamnetwork)
prioritizesspectralfeatures• Theyusedifferenthiddenfeatures• Theyusedifferentinputfeatures(Sec.4.4.3)• ResultsonEnglishandJapanesecorpora
(Sec.4.5)
q Introduction
q Issue1:jointmodelingofF0andspectralfeatures
q Issue2:temporaldependencymodelingofF0contours
q Issuesandmethods
q Summary
CONTENTS
29
Motivationq BaselineRNNmodel[17]
v T:numberofframes(timesteps)
30
ISSUE 2:TEMPORAL DEPENDENCY?
bo1:T = {bo1, · · · , boT }
Recurrent neural network (RNN)
F0 contour
Linguisticfeatures
[17]R.Fernandez,et.al.Prosodycontourpredictionwithlongshort- termmemory,bi-directional,deeprecurrentneuralnetworks.InProc.Interspeech,pages2268–2272,2014.
x1:T = {x1, · · · ,xT }
Motivationq BaselineRNNmodel
• Learnthecorrelationbetweenand,
31
ISSUE 2:TEMPORAL DEPENDENCY?
x1 x2 x3 x4 x5
bo1 bo2 bo3 bo4 bo5
x1:T = {x1, · · · ,xT }
bo1:T = {bo1, · · · , boT }
H(RNN)⇥ (·) bot = H
(RNN)⇥ (x1:T , t)
t2 6= t1ot1 ot2
Motivationq BaselineRNNmodel
Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)
32
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
[18]C.M.Bishop.MixtureDensityNetworks.Technicalreport,AstonUniversity,2004.[19]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.[20]M.Schuster.Bettergenerativemodelsforsequentialdataproblems:Bidirectionalrecurrentmixturedensitynetworks.In Proc.NIPS,pages589–595,1999.
H(RNN)⇥ (·)
ISSUE 2:TEMPORAL DEPENDENCY?
Probabilistic part
Computational part
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt,�I)
Recurrentmixturedensitynetwork
(RMDN)
bot = argmaxot
p(ot|x1:T ;⇥⇤) = µt Mean-basedgeneration
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0
33
Motivationq Initialanswer
• Evidencefromrandomsampling
ISSUE 2:TEMPORAL DEPENDENCY?
TemporaldependencyisignoredbyRNN/RMDN
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0RMDN mean-based output
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0RMDN mean-based outputRMDN sample
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt,�I)
Autoregressive(AR) [23] models
• ShallowARmodels(SAR)&extendedSAR• DeepARmodels(DAR)
34
ISSUE 2:TEMPORAL DEPENDENCY?
RNN/RMDNp(o1:T ) =
TY
t=1
p(ot|o1:t�1)
p(o1:T ) =TY
t=1
p(ot)
[23] B.Frey.GraphicalModelsforMachineLearningandDigitalCommunication.ABradfordbook.Bradfordbook,1998.
Motivationq Initialanswer
q Bettermodels?
TemporaldependencyisignoredbyRNN/RMDN
q Introduction
q Issue1:jointmodelingofF0andspectralfeatures
q Issue2:temporaldependencymodelingofF0contours
§ SARandextension§ DAR
q Issuesandmethods
q Summary
CONTENTS
35
36
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
SARq Definition
ISSUE 2:TEMPORAL DEPENDENCY?
p(o1:T |x1:T ) =TY
t=1
p(ot|ot�K:t�1,x1:T )
37
SARq Definition
• Timeinvariant• K:hyper-parameter
a2 a2 a2
a1a1a1a1
Trainableparameters
o1 o2 o3 o4 o5 K=2
ISSUE 2:TEMPORAL DEPENDENCY?
f (ot�K:t�1) =KX
k=1
ak � ot�k + b
= {a1, · · · ,aK , b}
p(o1:T |x1:T ;⇥, ) =TY
t=1
p(ot|ot�K:t�1,x1:T ;⇥, )
=TY
t=1
N (ot;µt + f (ot�K:t�1),⌃t)
38
SARq Interpretation1(Sec.5.3)
q Interpretation2(Sec.5.3)
o1 o2 o3 o4 o5
ISSUE 2:TEMPORAL DEPENDENCY?
p(o1:T |x1:T ;⇥, ) =TY
t=1
p(ot|ot�K:t�1,x1:T ;⇥, )
=TY
t=1
N (ot;µt + f (ot�K:t�1),⌃t)
=TY
t=1
pc(ct|x1:T ;⇥)
c1 c2 c3 c4 c5
ct = ot �KX
k=1
ak � ot�k
o1 o2 o3 o4 o5 c1 c2 c3 c4 c5Filters
SARq Interpretation(Sec.5.3)
• SAR=Lineartransformation+RMDN
39
Training
Generationx1:T
RMDN
RMDNbo1:T
o1:T
bc1:T
c1:T
ISSUE 2:TEMPORAL DEPENDENCY?
TY
t=1
p(ct|x1:T ;⇥)
TY
t=1
p(ct|x1:T ;⇥)
ct = ot �KX
k=1
ak � ot�k
bot = bct +KX
k=1
ak � bot�k
ExtendedSAR(eSAR)qMotivation
• SAR->Non-lineartransformation+RMDN?
• Yes,ifisinvertibleand̀ simple’
40
Training
Generationx1:T
RMDN
RMDNbo1:T
o1:T
bc1:T
c1:T
ISSUE 2:TEMPORAL DEPENDENCY?
TY
t=1
p(ct|x1:T ;⇥)
TY
t=1
p(ct|x1:T ;⇥)
c1:T = f (o1:T )
bo1:T = f�1 (bc1:T )
f (o1:T )
po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)
����� det@f (o1:T )
@o1:T
�����
ExtendedSAR(eSAR)q Definition
• Volume-preserving[24]
41
SAR eSAR
ISSUE 2:TEMPORAL DEPENDENCY?
c1:T = f (o1:T )
bo1:T = f�1 (bc1:T )
ct = ot � RNN (o1:t�1, t)
bot = bct +RNN (bo1:t�1, t)
det@f (o1:T )
@o1:Tdet
@f (o1:T )
@o1:T= 1
ct = ot �KX
k=1
ak � ot�k
det@f (o1:T )
@o1:T= 1
bot = bct +KX
k=1
ak � bot�k
po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)
����� det@f (o1:T )
@o1:T
�����
[24] J.M.Tomczak andM.Welling.Improvingvariationalauto-encodersusingconvexcombination linearinverseautoregressiveflow.InProc.Benelearn,pages162–194,2017.
ExtendedSAR(eSAR)q Implementation
42
o1 o2 o3 o4 o5
x1 x2 x3 x4 x5
M3 M4M2M1 M5
c1 c2 c3 c4 c5
µ5µ4µ3µ20
ISSUE 2:TEMPORAL DEPENDENCY?
ct = ot � µt
pc(c1:T |x1:T ;⇥)
Transformation
Modeling
po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)
µt = RNN (o1:t�1, t)Normflow
43
x1 x2 x3 x4 x5
M3 M4M2M1 M5
bc1 bc2 bc3 bc4 bc5
bo5bo4bo3bo2bo1
De-transformation
Generationbc1:T ⇠ pc(c1:T |x1:T ;⇥)
ExtendedSAR(eSAR)q Implementation
ISSUE 2:TEMPORAL DEPENDENCY?
po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)
bµ2 bµ3 bµ4 bµ5
bµt = RNN (bo1:t�1, t)
bot = bct + bµt
SummaryofSARq Theoreticallyappealing
q PerformanceforF0modeling(modeldetailslater)
q PerformanceforMGCmodeling• BetterthanRNN/RMDN[25,26]
ISSUE 2:TEMPORAL DEPENDENCY?
44[25] X.Wang,J.Lorenzo-Trueba,S.Takaki,L.Juvela,andJ.Yamagishi.Acomparisonofrecentwaveformgenerationandacousticmodelingmethodsforneural-network-basedspeechsynthesis.InProc.ICASSP,pages4804–4808,2018.
[26] X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforpara- metricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.
SAR
eSAR
SAR+stablefilters
0% 25% 50% 75% 100%
eSAR 55.70% SAR 44.30%
Signalprocessing
Machinelearning
Real-valuedpolesLogarearatio(Sec.5.3.2)Segment-variantfilters(slides)
Volume-preservingNone-volume-preserving (slides)…
q Introduction
q Issue1:jointmodelingofF0andspectralfeatures
q Issue2:temporaldependencymodelingofF0contours
§ SARandextension§ DAR
q Issuesandmethods
q Summary
CONTENTS
45
DARqMotivation
• AreSARandeSAR sufficientlygood?
46
ISSUE 2:TEMPORAL DEPENDENCY?
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0SAR sampled output
RandomfromSAR
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0eSAR sampled output
RandomfromeSAR
DARqMotivation
• RandomsamplingonSARandeSAR?
• SAR:linear(Sec.6.1)
• eSAR:non-linearbutaspecialform
q Non-linearandnon-invertibleARtransformation?
47
ISSUE 2:TEMPORAL DEPENDENCY?
o1:Tc1:T x1:Tc1:T = f�(o1:T )
NetworkwithARdependencyo1:T x1:T
TY
t=1
p(ct|x1:T ;⇥)
f�(·)
48
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
DARq Definition
ISSUE 2:TEMPORAL DEPENDENCY?
49
DARq Definition
• DARismoregeneralthanSAR(Sec6.2.2&slides,toyexamples)1. Longer-timedependency2. Non-lineardependency
ISSUE 2:TEMPORAL DEPENDENCY?
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|o1:t�1,x1:T ;⇥)
P (o1:T |x1:T ;⇥) =TY
t=1
P (ot|o1:t�1,x1:T ;⇥)
50
DARq Implementation(Sec.6.3)
ISSUE 2:TEMPORAL DEPENDENCY?
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
QuantizedF0Hierarchicalsoftmax
Datadropout
51
Experimentq Dataandfeatures
• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)
qModels
ISSUE 2:TEMPORAL DEPENDENCY?
Layer size
512
512
256
128
linear
bi-LSTM
bi-LSTM
FF
F0
GMM 2mix
F0
FF
RNN RMDN SAR
F0
GMM 2mix
linear
bi-LSTM
bi-LSTM
FF
linear
bi-LSTM
bi-LSTM
FF
FF FF
Linguistic features
GMM 2mix
eSAR
linear
bi-LSTM
bi-LSTM
FF
FF
normflow
normflow
F0
normflow
uni-LSTM 64
linear 2
v FF:feedforwardwithtanh-activationfunction
WaveNet
52
Experimentq Dataandfeatures
• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)• Feature:quantizedF0(256quantizationbins)
qModels
ISSUE 2:TEMPORAL DEPENDENCY?
Linguisticfeatures
QuantizedF0
hierarchicalsoftmaxlinear
uni-LSTM
bi-LSTM
FF
FF
v FF:feedforwardwithtanh-activationfunction
DAR
Linguisticfeatures
QuantizedF0
hierarchicalsoftmax
bi-LSTMbi-LSTM
FFFF
WaveNet-F0
linear
Layer size
512
512
256
128
256
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0SAR sampled output
RandomfromSAR
100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
160
260
360
460
560
F0(H
z)
Natural F0eSAR sampled output
RandomfromeSAR
53
Experimentq Randomsampling
ISSUE 2:TEMPORAL DEPENDENCY?
190
390
F0(H
z)
Natural DAR sample 1
190
390
F0(H
z)
Natural DAR sample 2
100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)
190
390
F0(H
z)
Natural DAR sample 3
DARsample1
DARsample2
DARsample3
190
390
F0(H
z)
Natural RNN
190
390
F0(H
z)
Natural RMDN
190
390
F0(H
z)
Natural SAR
190
390
F0(H
z)
Natural eSAR
190
390
F0(H
z)
Natural WaveNet-F0
100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)
190
390
F0(H
z)
Natural DAR
54
ExperimentqMean-basedgeneration
ISSUE 2:TEMPORAL DEPENDENCY?
55
ExperimentqMean-basedgeneration
• 500testutterances, >1000setsofscores
ISSUE 2:TEMPORAL DEPENDENCY?
Natural DAR SAR RMDN RNNNatural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 <1e-10 <1e-10SAR <1e-10 <1e-10 0.01785 0.7429RMDN <1e-10 <1e-10 0.01785 0.00426RNN <1e-10 <1e-10 0.7429 0.00426
3.00
3.25
3.50
3.75
4.00
4.25
MO
S
4.243
3.856
3.5643.628 3.565
Natural DAR SAR eSAR WaveNet-F0Natural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102SAR <1e-10 <1e-10 0.007186 0.000176eSAR <1e-10 9.062e-08 0.007186 0.219733WaveNet-F0 <1e-10 0.000102 0.000176 0.219733
3.00
3.25
3.50
3.75
4.00
4.25
MO
S
4.199
3.971
3.7323.813 3.803
MOStest2
p-value>0.05
p-value>0.05
MOStest1
Natural DAR SAR RMDN RNN
Natural DAR SAR eSAR WaveNet-F0
Natural DAR SAR RMDN RNN
Summaryq Fullanswertoissue2
ISSUE 2:TEMPORAL DEPENDENCY?
56
TemporaldependencyisignoredbyRNN/RMDN!Butbettermodelcanbedefined
RMDN SAR eSAR DAR
ARlinear? - Linear Non-linear(constrained) Non-linear
ARtime span -
Tractable? - Yes Somewhat No
Sampling? No No No Yes
h2h1
x1 x2
o1 o2
µ1 µ2
h2h1
x1 x2
o1 o2
µ1 µ2
h2h1
x1 x2
o1 o2
µ1 µ2
c1 c2
h2h1
x1 x2
o1 o2
µ1 µ2
t�K : t� 1 1 : t� 1 1 : t� 1
q Introduction
q Issue1:jointmodelingofF0andspectralfeatures
q Issue2:temporaldependencymodelingofF0contours
q Issue3:frame-by-frameprocessing
q Summary
CONTENTS
57
Motivationq Inefficientprocessing
ISSUE 3:FRAME-BY-FRAME PROCESSING?
* ** *** ***** ******* ******** ********** ************ ************** **********************************
…
…
oT
xT
o1
x1
o3o2
x2 x3
…
…
…
…xtn xt(n+1)
otn ot(n+1)…
…
…
…
58
Phone 1 Phone 2 Phone 3
Phrasetier
Moratier
Phone tierFrametier
ツツ ツ ツ ツ ツ ツ ツ ギ ギ ギ
ts ts ts u u u u u g g g
1 1 1 1 1 1 1 1 1 1 1
Phone 1 Phone 2 Phone 3
Phone 1 Phone 2 Phone 3
Motivationq Inefficientprocessing
ISSUE 3:FRAME-BY-FRAME PROCESSING?
…… …… …
* ** *** ***** ******* ******** ********** ************ ************** **********************************
…
…
oT
xT
o1
x1
o3o2
x2 x3
…
…
…
…xtn xt(n+1)
otn ot(n+1)…
…
…
…
59
Phone 1 Phone 2 Phone 3
Phrasetier
Moratier
Phone tierFrametier
ツツ ツ ツ ツ ツ ツ ツ ギ ギ ギ
ts ts ts u u u u u g g g
1 1 1 1 1 1 1 1 1 1 1
MotivationqMoreefficientprocessing?
ISSUE 3:FRAME-BY-FRAME PROCESSING?
… oTo1 o3o2 … …otn ot(n+1)… …
Phone 1 Phone 2 Phone 3
• Givendurationofeachunit
60
x3px2px1p
Howtodo?ツ ツ ギ
ts u g
1 1 1
Methodq Two-stageF0modeling
ISSUE 3:FRAME-BY-FRAME PROCESSING?
Phone 1 Phone 2 Phone 3
… oTo1 o3o2 … …otn ot(n+1)… …
Stage1:F0contourmodeling
Stage2:Linguisticlinking
61
x3px2px1p
Methodq Two-stageF0modeling
• Revisitlinguisticapproaches[28,29]
ISSUE 3:FRAME-BY-FRAME PROCESSING?
Phone 1 Phone 2 Phone 3
… oTo1 o3o2 … …otn ot(n+1)… …
Vector-quantizationvariational-auto-encoder(VQ-VAE[27])
Sequential classifier
Compactcodespace
62[27] A.vandenOord,O.Vinyals,andK.Kavukcuoglu.Neuraldiscreterepresentationlearning.InProc.NIPS,pagetoappear,2017.[28] K.E.Dusterhoff,A.W.Black,andP.A.Taylor.UsingdecisiontreeswithintheTiltintonationmodeltopredictF0contours.Proc.Eurospeech,pages1627–1630,1999.[29] K.Hirose,K.Sato,Y.Asano,andN.Minematsu.SynthesisofF0contoursusinggenerationprocessmodelparameterspredictedfromunlabeledcorpora:Applicationto
emotionalspeech synthesis.Speechcommunication, 46(3):385–404,2005.
x3px2px1p
Stage1:F0contourmodelingq VQ-VAE
• Unsupervisedlearning• Onecodeforvaried-lengthunit• Multiplelinguisticlevels
ISSUE 3:FRAME-BY-FRAME PROCESSING?
… oTo1 o3o2 … …otn ot(n+1)… …
Phone 1 Phone 2 Phone 3
Compactcodespace
Goals
63
Stage1:F0contourmodelingq VQ-VAE
• Onecodeperphone
ISSUE 3:FRAME-BY-FRAME PROCESSING?
… oTo1 o3o2 … …otn… …
Phone 1 Phone 2 Phone 3
ot(n+1)
e1p e2p
… oTo1 o3o2 … …otn… … ot(n+1)
z1p z2p z3p
e3p
Codebook
64
Decoder
Encoder
Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers
• Onecodeperphone&onecodepermora
Mora-levelencoder
Mora-level codebookz1m
e1m e2m
z2m
ISSUE 3:FRAME-BY-FRAME PROCESSING?
… oTo1 o3o2 … …otn… …
Phone 1 Phone 2 Phone 3
ot(n+1)
… oTo1 o3o2 … …otn… … ot(n+1)
65
Mora 1 Mora 2
Decoder
Phone-levelEncoder
e1p e2p
z1p z2p z3p
e3p
Phone-level codebook
Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers
ISSUE 3:FRAME-BY-FRAME PROCESSING?
… oTo1 o3o2 … …otn… …
Phone 1 Phone 2 Phone 3
ot(n+1)
… oTo1 o3o2 … …otn… … ot(n+1)
66
Mora 1 Mora 2
VQ-VAEencoder(s)
VQ-VAEdecoder
PhonecodeindicesMoracodeindices
{l1p , l2p , l3p , · · · , lNp}
{l1m , l2m , · · · , lNm}Codebooks +Duration
Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?
… oTo1 o3o2 … …otn… …
Phone 1 Phone 2 Phone 3
ot(n+1)
67
Mora 1 Mora 2
VQ-VAEdecoder
PhonecodeindicesMoracodeindices
{l1p , l2p , l3p , · · · , lNp}
{l1m , l2m , · · · , lNm}Codebooks +Duration
Linguisticlinker
Linguisticfeatures
Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?
68
PhonecodeindicesMoracodeindices
{l1p , l2p , l3p , · · · , lNp}
{l1m , l2m , · · · , lNm}
RNN-basedsequentialclassifier
• ClockworkRNN[30]• Highway[31]
• AR&feedbacklinks• Dropout[32](Sec.7.2.2)
[30] J.Koutnik,K.Greff,F.Gomez,andJ.Schmidhuber.AClockworkRNN.InProc.ICML,pages1863– 1871,2014.[31] K.Greff,R.K.Srivastava,andJ.Schmidhuber.Highwayandresidualnetworkslearnunrollediterativeestimation.InProc.ICLR,2017.[32] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:Asimplewaytopreventneuralnetworksfromoverfitting.TheJournalofMachine
LearningResearch,15(1):1929–1958, 2014.
Linguisticfeatures
F0modelingforTTSISSUE 3:FRAME-BY-FRAME PROCESSING?
69
Encoders&codebooks
Decoder
Trainingset
F0
F0
Codeindices
Linker
Linguisticfeatures
Stage1:VQ-VAE Stage2:Linker
Testset
Linker
Linguisticfeatures
Decoder
Codebooks
F0
F0modelingforTTS
Experimentsonstage1(sec.7.3.2)ExperimentsonwholemodelqModels
• Givennaturalduration
ISSUE 3:FRAME-BY-FRAME PROCESSING?
70
Model1 Model3
Frame-by-frameLinker
VQ-VAEdecoder(phone-level)
VQ-VAEdecoder(phone-level)
Phone-by-phoneLinker
VQ-VAEdecoder(phone-mora-level)
Phone-by-phoneLinker+Moralock
… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …Model2
Experimentsq Objectiveresults
ISSUE 3:FRAME-BY-FRAME PROCESSING?
RMSE CORR U/V Timecost(s/epoch)
Model1 34.3 0.839 7.96% 1300
Model2 27.1 0.906 6.36% 54
71
Model3 25.5 0.916 4.87% 65
Model1 Model2 Model3… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn
ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …
VAEvsDARq Objectiveresults
• VQ-VAEencoderneedstimeandmemory
ISSUE 3:FRAME-BY-FRAME PROCESSING?
RMSE CORR U/VNumberofparameters(m) Timecost (s/epoch)
Stage 1 Stage2 Sum Stage 1 Stage2 Sum
DAR 28.3 0.903 3.46%
VAEmodel3 25.5 0.916 4.87%
72
DAR VAE(model3)… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …
~1300
65
~700
1500
Stage1
Stage2
… oTo1 o3o2 … …otn… … ot(n+1)
z1p z2p z3p
0.36
0.44
1.11
0.67
2000
1565
1.48
1.11
73
ISSUE 3:FRAME-BY-FRAME PROCESSING?VAEvsDARq Subjectivetest
• 500testutterances, >1000setsofscores
Natural DAR SAR eSAR WaveNet-F0 VAE (model 3)Natural <1e-10 <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102 0.392125SAR <1e-10 <1e-10 0.007186 0.000176 <1e-10eSAR <1e-10 9.062e-08 0.007186 0.219733 5.946e-10WaveNet-F0 <1e-10 0.000102 0.000176 0.219733 2.635e-06VAE (model 3) <1e-10 0.392125 <1e-10 5.946e-10 2.635e-06
3.00
3.25
3.50
3.75
4.00
4.25
MO
S
4.199
3.971
3.7323.813 3.803
3.993
p-value>0.05
MOStest
p-value>0.05
Natural DAR SAR eSAR WaveNet-F0 VAE(model3)
74
ISSUE 3:FRAME-BY-FRAME PROCESSING?Summaryq Answertoissue3
q How
q Results• Multiplelinguistictiers• MoreefficientthanDAR:smaller+faster+F0CORR>0.91• Interpretablelatentcodespaces(Sec.7.3.2)• RandomsamplingOK(slides)
Itcouldbemoreefficient
LinkerLinguisticfeatures
VQVAEDecoderCodebooks F0
unit-by-unit frame-by-frame
q Introduction
q Issue1:jointmodelingofF0andspectralfeatures
q Issue2:temporaldependencymodelingofF0contours
q Issue3:frame-by-frameprocessing
q Conclusion
CONTENTS
75
Summary
76
CONCLUSION
…
…
…
xTx1 x2 x3
bsTboTbo3
bs3bs2bs1bo1 bo2
PreliminaryIssue1:
JointmodelingforF0?
Chapter1-3 Chapter4
? JointmodelingofF0andspectral?✕ Sub-optimalforF0modeling• Methods:
o Highwaynetworkso Histogram+sensitivityanalysis
• Results:o Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0
andspectralfeatures
Summary
77
CONCLUSION
…
… xTx1 x2 x3
boTbo3bo1 bo2
PreliminaryIssue1:
JointmodelingforF0?
Chapter1-3 Chapter4
? TemporaldependencyinRNN/RMDN?✕ IgnoredbyRNN/RMDN• Models:
o SAR:tractabledependencyo DAR:non-linear+longerdependency
• Results:o DAR:F0CORR>0.90,MOSscore⬆o DARsupportsrandomsampling!
Issue2:Temporal
dependency?
Chapter5-6
DAR
Summary
78
CONCLUSION
… boTbo3bo1 bo2
PreliminaryIssue1:
JointmodelingforF0?
Chapter1-3 Chapter4
? Frame-by-frameprocessing✕ Inefficient• Two-stageF0model:
o F0contourcoding:VQ-VAE+DARo Linguisticlinking:unit-by-unitclassifier
• Results:o F0CORR>0.91⬆o Moreefficient(smaller,faster)
… xTx1 x2 x3
Issue2:Temporal
dependency?
Chapter5-6
Issue3:Frame-by-frame?
Chapter7
x1p x2p
Phone 1 Phone 2
Summary
Chapter8
Newtopics?q Towardscompleteprosodymodeling
• NotonlyF0butalsoduration• Easyforjointmodeling
qWaveformmodeling• F0andwaveformare1dimensionalsignals• Signalprocessingmethodsavailable
SAR+logarearatio+segment-variantfilters
79
CONCLUSION
LinkerLinguisticfeatures
Latentcodeperphone
Durationofphone
Thankyouforyourattention
Q&A
80
Codes,scripts,slides:tonywangx.github.io
q NeuralF0modeling
ü Meetbasicrequirementonpublication:3journals+6conferences
81
THESIS OUTLINE
Chapter1-3Introduction, TTS,neuralnetworks
Chapter 8Summary and application of F0 model
Chapter4Investigationusinghighwaynetwork
Chapter5ShallowautoregressiveF0model(SAR)
ExtendedSAR
Chapter7Variationalauto-encoder(VAE)+DAR
ICASSP2018
Speechsynthesisworkshop2016,
SpeechCommunication, Vol96,Feb.2018,pp1-9
ICASSP2017
Interspeech2017
IEEETrans.ASLP, accepted,2018Chapter6DeepautoregressiveF0model (DAR)
Prepareajournalpaper
qWorknotincludedinPh.D.thesis• Wordembeddingasinputfeatures(IEICE2018,Interpseech2016)• Impactoftrainingdatasize(SSW2016)
q Collaboratedwork• DARusingmanually-annotated/corrupteddata(Interspeech 2018)
ProvideDAR,SAR,andWaveNet-vocoder
• Voicecloningusingfounddata(Odyssey2018)ProvideWaveNet-vocoder,andacousticmodels
• Cyborgspeech:multilingualspeechsynthesis(ICASSP2018)Provideacousticmodels
• SpeechsynthesisfromMFCC(ICASSP2018)ProvideDARonMFCC
• Controllablespeechsynthesis(Interspeech 2017)Provideacousticmodels
82
WORK DURING PH.D.