Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

FundamentalFrequencyModelingforNeural-Network-based

StatisticalParametricSpeechSynthesis

ID:201517062018-07-13

1contact:wangxin@nii.ac.jpwewelcomecriticalcomments,suggestions,anddiscussion

XinWang

q Introduction• Background• Topic• Thesisoutline

q Issuesandmethods

q Summary

CONTENTS

Newresults/updatedexplanation

(c.f. HTSSlides,byHTSWorkingGroup)

Text-to-speech(TTS)q TTSpipeline[1,2]

q Synthesizer

INTRODUCTION

[1] Taylor,P.(2009).Text-to-SpeechSynthesis.[2] Dutoit,T.(1997).AnIntroductiontoText-to-speech Synthesis.[3] Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[4] Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.

AcousticfeaturesFundamentalfrequency(F0)

SpectralfeaturesAcousticmodels

Vocoder

Statisticalparametricspeechsynthesizer(SPSS)[3,4]

TextText

analyzerSpeech

synthesizerLinguisticfeatures Speech

Text-to-speech(TTS)q Neural-network-basedacousticmodels[5,6,7]

INTRODUCTION

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[6] Z.H.Ling,etal.Deeplearningforacousticmodelinginparametricspeechgeneration:Asystematicreviewofexistingtechniquesandfuturetrends.IEEESignalProcessingMagazine,

32(3):35–52, 2015.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

AcousticfeaturesLinguisticfeatures Acousticmodels

次は新金岡、新金岡です。

Frametier

Neuralnetworks

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツツ ... ツツギ ... ギワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

Fundamentalfrequency(F0)

Spectralfeatures

Topicq NeuralF0modelingforTTS

qWhyF0?

INTRODUCTION

Frametier

Neuralnetworks

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツツ ... ツツギ ... ギワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

[8]NanetteVeilleux,etal.6.911TranscribingProsodicStructureofSpokenUtteranceswithToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.

次は新金岡、新金岡です。

Speaker A: Who made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker A: Bob made the marmalade.

Speaker B: (No,) Marianna made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker B: Marianna made the marmalade.

Speaker B: Mariannamade the marmalade.

Topicq Issuestobeaddressed

INTRODUCTION

Neuralnetworks

Spectralfeatures

[9]M.S.Ribeiro.Suprasegmental representationsforthemodelingoffundamentalfrequencyinstatisticalparametricspeechsynthesis.PhDthesis,TheUniversityofEdinburgh,2018.

[10]J.Hirschberg.Pitchaccentincontextpredictingintonational prominencefromtext.ArtificialIntelligence,63(1):305–340,1993.

... … … … xTx1 x2 x3

... … … …

Linguisticfeatures bsT

boTbo3

bs3bs2bs1

bo1 bo2

p(o1:T , s1:T |x1:T ;⇥)=TY

N ([ot, st]; Network⇥(x1:T , t),�I)

F0features[9]Linguisticfeatures[10] Issue1:jointmodeling?

Issue2:temporaldependency?

Issue3:efficientenough?

Thesisoutlineq Conventionalapproaches[7](Table3.1inthesis)

INTRODUCTION

Recurrentlayers

Feedforwardlayer

Linguisticfeatures

F0contour

…Spectralfeatures

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

Recurrentneuralnetwork(RNN)

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツツ ... ツツギ ... ギワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

T frames(timesteps)[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

Thesisoutlineq Threeissues

INTRODUCTION

Issue1:Jointmodeling?

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

T frames

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツツ ... ツツギ ... ギワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

Issue2:Temporaldependency?

Issue3:Frame-by-frame

processing isefficient?

Thesisoutline(Chapter4)q Onissue1:JointmodelingofF0andspectralfeatures?

• Investigationusinghighwaynetworkso Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0andspectral

✕ Sub-optimalforF0modelingü OnlyF0astarget

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

Novelanalysis

F0contour

Spectralfeatures

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

• Evidencefromrandomsampling✕ RNNignorestemporaldependency

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

ü Shallowautoregressivemodel(SAR)• Short-termdependency

ü Deepautoregressivemodel(DAR)• Longerdependency&bestresults&randomsampling

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

Novelmodels&interpretations

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

✕ Inefficient

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

... 1 1 ... 1 1 1 ... 1

... ... * ... *... ツツ ... ツツギ ... ギ

... ts ts ... u u g ... i… 162 163 … 194 195 196 … 227

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

ü Two-stagemodel:efficient&interpretable&multi-level

INTRODUCTION

… boTbo3bo1 bo2

x1p x2p

Unit 1 Unit 2

1 1 1 1* *

ツツギギ

ts u g i

Stage1:F0contourmodeling

Stage2:Linguisticlinking

Novelmodel

Unsupervisedlearning

Fastsupervisedlearning

Thesisoutline

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

… xTx1 x2 x3

boTbo3bo1 bo2

… boTbo3bo1 bo2

x1p x2p

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4 HighwaynetworksToolsforanalysis

SAR&extendedSARDAR &techniques

Two-stageF0modelstage1:frame-by-framestage2:unit-by-unit

Updatedresults

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4

SAR+logarearatioAdditional listeningtests

Listeningtest

CONTENTS

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issuesandmethods

q Summary

Motivationq Commonapproach [5,7,11]

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

q Empiricalresultsagainstjointlearning[5,12]

ISSUE 1:JOINT LEARNING FOR F0?

Spectralfeatures

Neuralnetwork

Linguisticfeatures

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.[11] H.ZenandA.Senior.Deepmixturedensitynetworksforacousticmodelinginstatisticalparametricspeechsynthesis.In Proc.ICASSP,pages3844–3848,2014.[12] S.KangandH.Meng.Statisticalparametricspeechsynthesisusingweightedmulti-distributiondeepbeliefnetwork.InProc.Interspeech,pages1959–1963,2014.

Trueornot?Moreevidence?

Method

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

Spectralfeatures

Highwaynetworks [13]

Linguisticfeatures

[13] R.K.Srivastava,K.Greff,andJ.Schmidhuber.Highwaynetworks.InProc.DeepLearningWorkshop,2015.

• Modelandtools:o Highwaynetwork[13]

o Histogram&sensitivitytools

Methodq Definitionofhighwaynetwork

h1 …

bot = gt � ht + (1� gt)� xt

Highwaynetwork

gt = sigmoid(W gxt + bg)

ht = �(W ixt + bi)

Highwaygatevector

…bo1 bo2

x1 x2 xT

boT bot = W oht + bo.

ht = �(W ixt + bi)Feedforward

networkW i, bi

W o, boh1 hTh2

Methodq Highwaynetworkforacousticmodeling

• Spectralfeatures

Linguisticfeatures

Linear

Highwayblock

Highwayblock…

LinearBAP

Highwayblock

Highwayblock…

LinearF0

Highwayblock

Highwayblock…

Linear

MGC:Mel-generalizedcepstral(MGC)coefficients[14]

BAP:bandaperiodicitycoefficients[15,16]

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

[14] K.Tokuda,T.Kobayashi,T.Masuko,andS.Imai.Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046,1994.[15] H.Kawahara,J.Estill,andO.Fujimura.Aperiodicityextractionandcontrolusingmixedmodeexcitationandgroupdelay manipulationforahighqualityspeechanalysis,modification

andsynthesissystemstraight.InSecondInternationalWorkshoponModelsandAnalysisofVocalEmissionsforBiomedicalApplications,2001.[16] H.ZenandT.Toda.AnoverviewofNitech HMM-basedspeechsynthesissystemforBlizzardChallenge2005.InProc.Interspeech,pages93–96,2005.

Single-stream Multi-stream

Methodq Analysistools

1. Histogramofgatevectors

2. Sensitivityoftodifferentlinguisticfeatures(Sec.4.3.2)

bo = g � h+ (1� g)� x

{g1, · · · , gT } Histogram

0 0.5 10

2.5 #105

0 0.5 10

8 #104

0 0.5 10

3 #104

0 0.5 10

2.5 #104

0 0.5 10

3 #104

0 0.5 10

4 #104

0 0.5 10

8 #104

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

Experimentsq Configuration

• Data:English,16hours• Feature:MGC,BAP,F0(InterpolatedF0+voicing(U/V))

• Metric:

q Threemodels:• Single-streamfeedforwardnetwork• Single-streamhighwaynetwork• Multi-streamhighwaynetwork

q Twotests:1. Fixedlayerwidth,varyingnetworkdepth2. Fixednetworkdepth,varyinglayerwidth

Rootmeansquareerror(RMSE)Correlationcoefficients(CORR)

Highwayblock

Experimentsq Objectiveresults:increasingnetworkdepth

v Networkdepth:Numberoftanh-basedtransformationlayers• Single-streamnetworkprioritizesMGC?

Linguisticfeatures

Linear

Highwayblock

LinearBAP

Highwayblock

LinearF0

Highwayblock

LinearF0

Highwayblock

LinearMGC BAP

Linear

Linguisticfeatures

feedforward

LinearMGC BAP

Linear

Linguisticfeatures

feedforward

MGCRMSE F0RMSE F0CORR

2 4 8 14 20 40Network depth

Single-stream feedforwardSingle-stream highwayMulti-stream highway

Experimentsq Objectiveresults:increasingwidth(depth=14)

v MS1 :[MGC256]– [F0256]v MS2 :[MGC382]– [F0256]v MS3 :[MGC512]– [F0382]v MS4 :[MGC768]– [F0512]

• Single-streamnetworkprioritizesMGC?

3.0e6 9.0e6 2.5e7Number of Network weights

782882

1024382 482582 782 1024

MS2MS3

782 882

1024382

582 7821024

MS1 MS2MS3

782882 1024

382 482 582

782 1024

MS2MS3 MS4

Single-stream feedforwardSingle-stream highwayMulti-stream highway

MGCRMSE F0RMSE F0CORR

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

Experimentsq Histogramof

• Multi-streamhighway(depth14)

linear

highwayblock

linguistic features

linear

highwayblock

linear

highwayblock

Test set

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation linear

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

• Multi-streamhighway(depth14,7blocks)

linear

highwayblock

linguistic features

linear

highwayblock

linear

highwayblock

Test set

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

• DifferenthiddenfeaturesforMGCandF0

• Single-streamhighway(depth14,7blocks)

linear

highwayblock

linguistic features

BAPMGC

highwayblock

Test set

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

• SimilartoMGCsub-netinmulti-streamhighway• Single-streamnetworkprioritizesMGC?

Summaryq Answertoissue1

q OnlyF0modelinginthefollowingchapters

q F0isusefulforMGCmodeling?Howtodo?(slidesappendix)

NOTforthesakeofF0modeling!

Joint(multi-task)learning? BeneficialforbothF0andspectralfeatures

? Theysharehiddenfeatures

Negativeevidence• Jointlearning(single-streamnetwork)

prioritizesspectralfeatures• Theyusedifferenthiddenfeatures• Theyusedifferentinputfeatures(Sec.4.4.3)• ResultsonEnglishandJapanesecorpora

(Sec.4.5)

q Introduction

q Issue2:temporaldependencymodelingofF0contours

q Issuesandmethods

q Summary

CONTENTS

Motivationq BaselineRNNmodel[17]

v T:numberofframes(timesteps)

ISSUE 2:TEMPORAL DEPENDENCY?

bo1:T = {bo1, · · · , boT }

Recurrent neural network (RNN)

F0 contour

Linguisticfeatures

[17]R.Fernandez,et.al.Prosodycontourpredictionwithlongshort- termmemory,bi-directional,deeprecurrentneuralnetworks.InProc.Interspeech,pages2268–2272,2014.

x1:T = {x1, · · · ,xT }

Motivationq BaselineRNNmodel

• Learnthecorrelationbetweenand,

x1 x2 x3 x4 x5

bo1 bo2 bo3 bo4 bo5

x1:T = {x1, · · · ,xT }

bo1:T = {bo1, · · · , boT }

H(RNN)⇥ (·) bot = H

(RNN)⇥ (x1:T , t)

t2 6= t1ot1 ot2

Motivationq BaselineRNNmodel

Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

[18]C.M.Bishop.MixtureDensityNetworks.Technicalreport,AstonUniversity,2004.[19]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.[20]M.Schuster.Bettergenerativemodelsforsequentialdataproblems:Bidirectionalrecurrentmixturedensitynetworks.In Proc.NIPS,pages589–595,1999.

H(RNN)⇥ (·)

Probabilistic part

Computational part

p(o1:T |x1:T ;⇥) =TY

p(ot|x1:T ;⇥) =TY

N (ot;µt,�I)

Recurrentmixturedensitynetwork

(RMDN)

bot = argmaxot

p(ot|x1:T ;⇥⇤) = µt Mean-basedgeneration

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

Natural F0

Motivationq Initialanswer

• Evidencefromrandomsampling

TemporaldependencyisignoredbyRNN/RMDN

Natural F0RMDN mean-based output

Natural F0RMDN mean-based outputRMDN sample

p(o1:T |x1:T ;⇥) =TY

p(ot|x1:T ;⇥) =TY

N (ot;µt,�I)

Autoregressive(AR) [23] models

• ShallowARmodels(SAR)&extendedSAR• DeepARmodels(DAR)

RNN/RMDNp(o1:T ) =

p(ot|o1:t�1)

p(o1:T ) =TY

[23] B.Frey.GraphicalModelsforMachineLearningandDigitalCommunication.ABradfordbook.Bradfordbook,1998.

Motivationq Initialanswer

q Bettermodels?

TemporaldependencyisignoredbyRNN/RMDN

q Introduction

§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

SARq Definition

p(o1:T |x1:T ) =TY

p(ot|ot�K:t�1,x1:T )

SARq Definition

• Timeinvariant• K:hyper-parameter

a2 a2 a2

a1a1a1a1

Trainableparameters

o1 o2 o3 o4 o5 K=2

f (ot�K:t�1) =KX

ak � ot�k + b

= {a1, · · · ,aK , b}

p(o1:T |x1:T ;⇥, ) =TY

p(ot|ot�K:t�1,x1:T ;⇥, )

N (ot;µt + f (ot�K:t�1),⌃t)

SARq Interpretation1(Sec.5.3)

q Interpretation2(Sec.5.3)

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥, ) =TY

p(ot|ot�K:t�1,x1:T ;⇥, )

N (ot;µt + f (ot�K:t�1),⌃t)

pc(ct|x1:T ;⇥)

c1 c2 c3 c4 c5

ct = ot �KX

ak � ot�k

o1 o2 o3 o4 o5 c1 c2 c3 c4 c5Filters

SARq Interpretation(Sec.5.3)

• SAR=Lineartransformation+RMDN

Training

Generationx1:T

RMDNbo1:T

p(ct|x1:T ;⇥)

ct = ot �KX

ak � ot�k

bot = bct +KX

ak � bot�k

ExtendedSAR(eSAR)qMotivation

• SAR->Non-lineartransformation+RMDN?

• Yes,ifisinvertibleand̀ simple’

Training

Generationx1:T

RMDNbo1:T

p(ct|x1:T ;⇥)

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

f (o1:T )

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

�� det@f (o1:T )

��

ExtendedSAR(eSAR)q Definition

• Volume-preserving[24]

SAR eSAR

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

ct = ot � RNN (o1:t�1, t)

bot = bct +RNN (bo1:t�1, t)

det@f (o1:T )

@o1:Tdet

@f (o1:T )

@o1:T= 1

ct = ot �KX

ak � ot�k

det@f (o1:T )

@o1:T= 1

bot = bct +KX

ak � bot�k

�� det@f (o1:T )

��

[24] J.M.Tomczak andM.Welling.Improvingvariationalauto-encodersusingconvexcombination linearinverseautoregressiveflow.InProc.Benelearn,pages162–194,2017.

ExtendedSAR(eSAR)q Implementation

o1 o2 o3 o4 o5

x1 x2 x3 x4 x5

M3 M4M2M1 M5

c1 c2 c3 c4 c5

µ5µ4µ3µ20

ct = ot � µt

pc(c1:T |x1:T ;⇥)

Transformation

Modeling

µt = RNN (o1:t�1, t)Normflow

x1 x2 x3 x4 x5

M3 M4M2M1 M5

bc1 bc2 bc3 bc4 bc5

bo5bo4bo3bo2bo1

De-transformation

Generationbc1:T ⇠ pc(c1:T |x1:T ;⇥)

ExtendedSAR(eSAR)q Implementation

bµ2 bµ3 bµ4 bµ5

bµt = RNN (bo1:t�1, t)

bot = bct + bµt

SummaryofSARq Theoreticallyappealing

q PerformanceforF0modeling(modeldetailslater)

q PerformanceforMGCmodeling• BetterthanRNN/RMDN[25,26]

44[25] X.Wang,J.Lorenzo-Trueba,S.Takaki,L.Juvela,andJ.Yamagishi.Acomparisonofrecentwaveformgenerationandacousticmodelingmethodsforneural-network-basedspeechsynthesis.InProc.ICASSP,pages4804–4808,2018.

[26] X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforpara- metricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.

SAR+stablefilters

0% 25% 50% 75% 100%

eSAR 55.70% SAR 44.30%

Signalprocessing

Machinelearning

Real-valuedpolesLogarearatio(Sec.5.3.2)Segment-variantfilters(slides)

Volume-preservingNone-volume-preserving (slides)…

q Introduction

§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

DARqMotivation

• AreSARandeSAR sufficientlygood?

Natural F0SAR sampled output

RandomfromSAR

Natural F0eSAR sampled output

RandomfromeSAR

DARqMotivation

• RandomsamplingonSARandeSAR?

• SAR:linear(Sec.6.1)

• eSAR:non-linearbutaspecialform

q Non-linearandnon-invertibleARtransformation?

o1:Tc1:T x1:Tc1:T = f�(o1:T )

NetworkwithARdependencyo1:T x1:T

p(ct|x1:T ;⇥)

f�(·)

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

DARq Definition

• DARismoregeneralthanSAR(Sec6.2.2&slides,toyexamples)1. Longer-timedependency2. Non-lineardependency

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

p(ot|o1:t�1,x1:T ;⇥)

P (o1:T |x1:T ;⇥) =TY

P (ot|o1:t�1,x1:T ;⇥)

DARq Implementation(Sec.6.3)

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

QuantizedF0Hierarchicalsoftmax

Datadropout

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)

qModels

Layer size

linear

bi-LSTM

GMM 2mix

RNN RMDN SAR

GMM 2mix

linear

bi-LSTM

linear

bi-LSTM

Linguistic features

GMM 2mix

linear

bi-LSTM

normflow

uni-LSTM 64

linear 2

v FF:feedforwardwithtanh-activationfunction

WaveNet

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)• Feature:quantizedF0(256quantizationbins)

qModels

Linguisticfeatures

QuantizedF0

hierarchicalsoftmaxlinear

uni-LSTM

bi-LSTM

v FF:feedforwardwithtanh-activationfunction

Linguisticfeatures

QuantizedF0

hierarchicalsoftmax

bi-LSTMbi-LSTM

WaveNet-F0

linear

Layer size

Natural F0SAR sampled output

RandomfromSAR

Natural F0eSAR sampled output

RandomfromeSAR

Experimentq Randomsampling

Natural DAR sample 1

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

DARsample1

DARsample2

DARsample3

Natural RNN

Natural RMDN

Natural SAR

Natural eSAR

Natural WaveNet-F0

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

Natural DAR

ExperimentqMean-basedgeneration

• 500testutterances, >1000setsofscores

Natural DAR SAR RMDN RNNNatural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 <1e-10 <1e-10SAR <1e-10 <1e-10 0.01785 0.7429RMDN <1e-10 <1e-10 0.01785 0.00426RNN <1e-10 <1e-10 0.7429 0.00426

3.5643.628 3.565

Natural DAR SAR eSAR WaveNet-F0Natural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102SAR <1e-10 <1e-10 0.007186 0.000176eSAR <1e-10 9.062e-08 0.007186 0.219733WaveNet-F0 <1e-10 0.000102 0.000176 0.219733

3.7323.813 3.803

MOStest2

p-value>0.05

MOStest1

Natural DAR SAR RMDN RNN

Natural DAR SAR eSAR WaveNet-F0

Natural DAR SAR RMDN RNN

Summaryq Fullanswertoissue2

TemporaldependencyisignoredbyRNN/RMDN!Butbettermodelcanbedefined

RMDN SAR eSAR DAR

ARlinear? - Linear Non-linear(constrained) Non-linear

ARtime span -

Tractable? - Yes Somewhat No

Sampling? No No No Yes

µ1 µ2

t�K : t� 1 1 : t� 1 1 : t� 1

q Introduction

q Issue3:frame-by-frameprocessing

q Summary

CONTENTS

Motivationq Inefficientprocessing

ISSUE 3:FRAME-BY-FRAME PROCESSING?

* ** *** ***** ******* ******** ********** ************ ************** **********************************

…xtn xt(n+1)

otn ot(n+1)…

Phone 1 Phone 2 Phone 3

Phrasetier

Moratier

Phone tierFrametier

ツツツツツツツツギギギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

Motivationq Inefficientprocessing

…… …… …

* ** *** ***** ******* ******** ********** ************ ************** **********************************

…xtn xt(n+1)

otn ot(n+1)…

Phrasetier

Moratier

Phone tierFrametier

ツツツツツツツツギギギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

MotivationqMoreefficientprocessing?

… oTo1 o3o2 … …otn ot(n+1)… …

• Givendurationofeachunit

x3px2px1p

Howtodo?ツツギ

ts u g

Methodq Two-stageF0modeling

… oTo1 o3o2 … …otn ot(n+1)… …

Stage1:F0contourmodeling

Stage2:Linguisticlinking

x3px2px1p

Methodq Two-stageF0modeling

• Revisitlinguisticapproaches[28,29]

… oTo1 o3o2 … …otn ot(n+1)… …

Vector-quantizationvariational-auto-encoder(VQ-VAE[27])

Sequential classifier

Compactcodespace

62[27] A.vandenOord,O.Vinyals,andK.Kavukcuoglu.Neuraldiscreterepresentationlearning.InProc.NIPS,pagetoappear,2017.[28] K.E.Dusterhoff,A.W.Black,andP.A.Taylor.UsingdecisiontreeswithintheTiltintonationmodeltopredictF0contours.Proc.Eurospeech,pages1627–1630,1999.[29] K.Hirose,K.Sato,Y.Asano,andN.Minematsu.SynthesisofF0contoursusinggenerationprocessmodelparameterspredictedfromunlabeledcorpora:Applicationto

emotionalspeech synthesis.Speechcommunication, 46(3):385–404,2005.

x3px2px1p

Stage1:F0contourmodelingq VQ-VAE

• Unsupervisedlearning• Onecodeforvaried-lengthunit• Multiplelinguisticlevels

… oTo1 o3o2 … …otn ot(n+1)… …

Compactcodespace

Stage1:F0contourmodelingq VQ-VAE

• Onecodeperphone

… oTo1 o3o2 … …otn… …

ot(n+1)

e1p e2p

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

Codebook

Decoder

Encoder

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

• Onecodeperphone&onecodepermora

Mora-levelencoder

Mora-level codebookz1m

e1m e2m

… oTo1 o3o2 … …otn… …

ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

Mora 1 Mora 2

Decoder

Phone-levelEncoder

e1p e2p

z1p z2p z3p

Phone-level codebook

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

… oTo1 o3o2 … …otn… …

ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

Mora 1 Mora 2

VQ-VAEencoder(s)

VQ-VAEdecoder

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

ot(n+1)

Mora 1 Mora 2

VQ-VAEdecoder

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Linguisticlinker

Linguisticfeatures

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}

RNN-basedsequentialclassifier

• ClockworkRNN[30]• Highway[31]

• AR&feedbacklinks• Dropout[32](Sec.7.2.2)

[30] J.Koutnik,K.Greff,F.Gomez,andJ.Schmidhuber.AClockworkRNN.InProc.ICML,pages1863– 1871,2014.[31] K.Greff,R.K.Srivastava,andJ.Schmidhuber.Highwayandresidualnetworkslearnunrollediterativeestimation.InProc.ICLR,2017.[32] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:Asimplewaytopreventneuralnetworksfromoverfitting.TheJournalofMachine

LearningResearch,15(1):1929–1958, 2014.

Linguisticfeatures

F0modelingforTTSISSUE 3:FRAME-BY-FRAME PROCESSING?

Encoders&codebooks

Decoder

Trainingset

Codeindices

Linker

Linguisticfeatures

Stage1:VQ-VAE Stage2:Linker

Testset

Linker

Linguisticfeatures

Decoder

Codebooks

F0modelingforTTS

Experimentsonstage1(sec.7.3.2)ExperimentsonwholemodelqModels

• Givennaturalduration

Model1 Model3

Frame-by-frameLinker

VQ-VAEdecoder(phone-level)

Phone-by-phoneLinker

VQ-VAEdecoder(phone-mora-level)

Phone-by-phoneLinker+Moralock

… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …Model2

Experimentsq Objectiveresults

RMSE CORR U/V Timecost(s/epoch)

Model1 34.3 0.839 7.96% 1300

Model2 27.1 0.906 6.36% 54

Model3 25.5 0.916 4.87% 65

Model1 Model2 Model3… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn

ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

VAEvsDARq Objectiveresults

• VQ-VAEencoderneedstimeandmemory

RMSE CORR U/VNumberofparameters(m) Timecost (s/epoch)

Stage 1 Stage2 Sum Stage 1 Stage2 Sum

DAR 28.3 0.903 3.46%

VAEmodel3 25.5 0.916 4.87%

DAR VAE(model3)… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

Stage1

Stage2

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

ISSUE 3:FRAME-BY-FRAME PROCESSING?VAEvsDARq Subjectivetest

• 500testutterances, >1000setsofscores

Natural DAR SAR eSAR WaveNet-F0 VAE (model 3)Natural <1e-10 <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102 0.392125SAR <1e-10 <1e-10 0.007186 0.000176 <1e-10eSAR <1e-10 9.062e-08 0.007186 0.219733 5.946e-10WaveNet-F0 <1e-10 0.000102 0.000176 0.219733 2.635e-06VAE (model 3) <1e-10 0.392125 <1e-10 5.946e-10 2.635e-06

3.7323.813 3.803

p-value>0.05

MOStest

p-value>0.05

Natural DAR SAR eSAR WaveNet-F0 VAE(model3)

ISSUE 3:FRAME-BY-FRAME PROCESSING?Summaryq Answertoissue3

q Results• Multiplelinguistictiers• MoreefficientthanDAR:smaller+faster+F0CORR>0.91• Interpretablelatentcodespaces(Sec.7.3.2)• RandomsamplingOK(slides)

Itcouldbemoreefficient

LinkerLinguisticfeatures

VQVAEDecoderCodebooks F0

unit-by-unit frame-by-frame

q Introduction

q Issue3:frame-by-frameprocessing

q Conclusion

CONTENTS

Summary

CONCLUSION

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? JointmodelingofF0andspectral?✕ Sub-optimalforF0modeling• Methods:

o Highwaynetworkso Histogram+sensitivityanalysis

• Results:o Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0

andspectralfeatures

Summary

CONCLUSION

… xTx1 x2 x3

boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? TemporaldependencyinRNN/RMDN?✕ IgnoredbyRNN/RMDN• Models:

o SAR:tractabledependencyo DAR:non-linear+longerdependency

• Results:o DAR:F0CORR>0.90,MOSscore⬆o DARsupportsrandomsampling!

Issue2:Temporal

dependency?

Chapter5-6

Summary

CONCLUSION

… boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? Frame-by-frameprocessing✕ Inefficient• Two-stageF0model:

o F0contourcoding:VQ-VAE+DARo Linguisticlinking:unit-by-unitclassifier

• Results:o F0CORR>0.91⬆o Moreefficient(smaller,faster)

… xTx1 x2 x3

Issue2:Temporal

dependency?

Chapter5-6

Issue3:Frame-by-frame?

Chapter7

x1p x2p

Phone 1 Phone 2

Summary

Chapter8

Newtopics?q Towardscompleteprosodymodeling

• NotonlyF0butalsoduration• Easyforjointmodeling

qWaveformmodeling• F0andwaveformare1dimensionalsignals• Signalprocessingmethodsavailable

SAR+logarearatio+segment-variantfilters

CONCLUSION

LinkerLinguisticfeatures

Latentcodeperphone

Durationofphone

Thankyouforyourattention

Codes,scripts,slides:tonywangx.github.io

q NeuralF0modeling

ü Meetbasicrequirementonpublication:3journals+6conferences

THESIS OUTLINE

Chapter1-3Introduction, TTS,neuralnetworks

Chapter 8Summary and application of F0 model

Chapter4Investigationusinghighwaynetwork

Chapter5ShallowautoregressiveF0model(SAR)

ExtendedSAR

Chapter7Variationalauto-encoder(VAE)+DAR

ICASSP2018

Speechsynthesisworkshop2016,

SpeechCommunication, Vol96,Feb.2018,pp1-9

ICASSP2017

Interspeech2017

IEEETrans.ASLP, accepted,2018Chapter6DeepautoregressiveF0model (DAR)

Prepareajournalpaper

qWorknotincludedinPh.D.thesis• Wordembeddingasinputfeatures(IEICE2018,Interpseech2016)• Impactoftrainingdatasize(SSW2016)

q Collaboratedwork• DARusingmanually-annotated/corrupteddata(Interspeech 2018)

ProvideDAR,SAR,andWaveNet-vocoder

• Voicecloningusingfounddata(Odyssey2018)ProvideWaveNet-vocoder,andacousticmodels

• Cyborgspeech:multilingualspeechsynthesis(ICASSP2018)Provideacousticmodels

• SpeechsynthesisfromMFCC(ICASSP2018)ProvideDARonMFCC

• Controllablespeechsynthesis(Interspeech 2017)Provideacousticmodels

WORK DURING PH.D.

Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

Documents

Transcript of Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

UNIT-IV HARMONICS Fundamental Frequency and Harmonics

Fundamental steps inartificial neural networks-based medical diagnosis – Pubrica

A Neural Network based Integer Frequency Offset Estimation ...

The vocal fundamental frequency in psychophysiological ...ubt.opus.hbz-nrw.de/volltexte/2015/909/pdf/Dissertation_PleinDe.pdf · The vocal fundamental frequency in psychophysiological

Neural predictive controller of a two-area load frequency ... predictive controller of... · ELECTRICAL ENGINEERING Neural predictive controller of a two-area load frequency control

Fundamental Concepts of Time and Frequency Metrology

Multitask Learning for Fundamental Frequency Estimation in ...

Neural Network Based Load Frequency Control for Restructuring

Time-frequency analysis using Bayesian regularized neural ...cdn.intechweb.org/pdfs/11962.pdf · Time-frequency analysis using Bayesian regularized neural network model 407 possible

Fundamental Frequency 2003 06

Predicting High-Frequency Stock Market by Neural Networks

Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Generating Music Based on Fundamental Frequency (Matlab)

Vocal Fundamental Frequency and Sound Pressure Level in ...

Entrained neural oscillations in multiple frequency bands ...

ISSN Online: Impact Factor: Design and development of ... · PDF fileKeywords: Design, development of ... Proe Solidworks ... Fundamental Natural Frequency The fundamental frequency,

Fundamental intelligencebayanbox.ir/view/3784932824693424466/neural-YazdEng.Ir1.pdf · Fundamental intelligence Machine Learning Computational neural Computational fuzzy Computational

THE FUNDAMENTAL FREQUENCY VARIATION SPECTRUM

Neural reuse: A fundamental organizational principle of ... · Neural reuse: A fundamental ... artiﬁcial intelligence, and philosophy of mind. ... which the prospect of a clear-cut

Comparison of speaking fundamental frequency in English and ...