Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

Post on 29-Oct-2019

12 views 0 download

Transcript of Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

FundamentalFrequencyModelingforNeural-Network-based

StatisticalParametricSpeechSynthesis

ID:201517062018-07-13

1contact:wangxin@nii.ac.jpwewelcomecriticalcomments,suggestions,anddiscussion

XinWang

q Introduction• Background• Topic• Thesisoutline

q Issuesandmethods

q Summary

CONTENTS

2

Newresults/updatedexplanation

(c.f. HTSSlides,byHTSWorkingGroup)

Text-to-speech(TTS)q TTSpipeline[1,2]

q Synthesizer

3

INTRODUCTION

[1] Taylor,P.(2009).Text-to-SpeechSynthesis.[2] Dutoit,T.(1997).AnIntroductiontoText-to-speech Synthesis.[3] Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[4] Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.

AcousticfeaturesFundamentalfrequency(F0)

SpectralfeaturesAcousticmodels

Vocoder

Statisticalparametricspeechsynthesizer(SPSS)[3,4]

TextText

analyzerSpeech

synthesizerLinguisticfeatures Speech

Text-to-speech(TTS)q Neural-network-basedacousticmodels[5,6,7]

4

INTRODUCTION

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[6] Z.H.Ling,etal.Deeplearningforacousticmodelinginparametricspeechgeneration:Asystematicreviewofexistingtechniquesandfuturetrends.IEEESignalProcessingMagazine,

32(3):35–52, 2015.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

AcousticfeaturesLinguisticfeatures Acousticmodels

次は新金岡、新金岡です。

Frametier

Neuralnetworks

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツ ツ ... ツ ツ ギ ... ギ ワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

F0

Fundamentalfrequency(F0)

Spectralfeatures

Topicq NeuralF0modelingforTTS

qWhyF0?

5

INTRODUCTION

Frametier

Neuralnetworks

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

F0

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツ ツ ... ツ ツ ギ ... ギ ワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

[8]NanetteVeilleux,etal.6.911TranscribingProsodicStructureofSpokenUtteranceswithToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.

次は新金岡、新金岡です。

Speaker A: Who made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker A: Bob made the marmalade.

Speaker B: (No,) Marianna made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker B: Marianna made the marmalade.

Speaker B: Mariannamade the marmalade.

Topicq Issuestobeaddressed

6

INTRODUCTION

Neuralnetworks

Spectralfeatures

F0

[9]M.S.Ribeiro.Suprasegmental representationsforthemodelingoffundamentalfrequencyinstatisticalparametricspeechsynthesis.PhDthesis,TheUniversityofEdinburgh,2018.

[10]J.Hirschberg.Pitchaccentincontextpredictingintonational prominencefromtext.ArtificialIntelligence,63(1):305–340,1993.

... … … … xTx1 x2 x3

... … … …

... … … …

Linguisticfeatures bsT

boTbo3

bs3bs2bs1

bo1 bo2

F0

p(o1:T , s1:T |x1:T ;⇥)=TY

t=1

N ([ot, st]; Network⇥(x1:T , t),�I)

F0features[9]Linguisticfeatures[10] Issue1:jointmodeling?

Issue2:temporaldependency?

Issue3:efficientenough?

Thesisoutlineq Conventionalapproaches[7](Table3.1inthesis)

7

INTRODUCTION

Recurrentlayers

Feedforwardlayer

Linguisticfeatures

F0contour

…Spectralfeatures

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

Recurrentneuralnetwork(RNN)

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツ ツ ... ツ ツ ギ ... ギ ワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

T frames(timesteps)[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

Thesisoutlineq Threeissues

8

INTRODUCTION

Issue1:Jointmodeling?

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

T frames

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツ ツ ... ツ ツ ギ ... ギ ワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

Issue2:Temporaldependency?

Issue3:Frame-by-frame

processing isefficient?

Thesisoutline(Chapter4)q Onissue1:JointmodelingofF0andspectralfeatures?

• Investigationusinghighwaynetworkso Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0andspectral

✕ Sub-optimalforF0modelingü OnlyF0astarget

9

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

Novelanalysis

F0contour

Spectralfeatures

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

• Evidencefromrandomsampling✕ RNNignorestemporaldependency

10

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

ü Shallowautoregressivemodel(SAR)• Short-termdependency

ü Deepautoregressivemodel(DAR)• Longerdependency&bestresults&randomsampling

11

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

DAR

Novelmodels&interpretations

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

✕ Inefficient

12

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

... 1 1 ... 1 1 1 ... 1

... ... * ... *... ツ ツ ... ツ ツ ギ ... ギ

... ts ts ... u u g ... i… 162 163 … 194 195 196 … 227

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

ü Two-stagemodel:efficient&interpretable&multi-level

13

INTRODUCTION

… boTbo3bo1 bo2

x1p x2p

Unit 1 Unit 2

1 1 1 1* *

ツ ツ ギ ギ

ts u g i

Stage1:F0contourmodeling

Stage2:Linguisticlinking

Novelmodel

Unsupervisedlearning

Fastsupervisedlearning

Thesisoutline

14

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

… xTx1 x2 x3

boTbo3bo1 bo2

… boTbo3bo1 bo2

x1p x2p

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4 HighwaynetworksToolsforanalysis

SAR&extendedSARDAR &techniques

Two-stageF0modelstage1:frame-by-framestage2:unit-by-unit

Updatedresults

15

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4

SAR+logarearatioAdditional listeningtests

Listeningtest

CONTENTS

16

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issuesandmethods

q Summary

Motivationq Commonapproach [5,7,11]

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

q Empiricalresultsagainstjointlearning[5,12]

17

ISSUE 1:JOINT LEARNING FOR F0?

Spectralfeatures

Neuralnetwork

F0

Linguisticfeatures

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.[11] H.ZenandA.Senior.Deepmixturedensitynetworksforacousticmodelinginstatisticalparametricspeechsynthesis.In Proc.ICASSP,pages3844–3848,2014.[12] S.KangandH.Meng.Statisticalparametricspeechsynthesisusingweightedmulti-distributiondeepbeliefnetwork.InProc.Interspeech,pages1959–1963,2014.

Trueornot?Moreevidence?

Trueornot?Moreevidence?

Method

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

18

ISSUE 1:JOINT LEARNING FOR F0?

Spectralfeatures

Highwaynetworks [13]

F0

Linguisticfeatures

[13] R.K.Srivastava,K.Greff,andJ.Schmidhuber.Highwaynetworks.InProc.DeepLearningWorkshop,2015.

• Modelandtools:o Highwaynetwork[13]

o Histogram&sensitivitytools

Methodq Definitionofhighwaynetwork

19

ISSUE 1:JOINT LEARNING FOR F0?

bo1

x1

g1

h1 …

xT

boT

hTgT

bo2

x2

h2g2

bot = gt � ht + (1� gt)� xt

Highwaynetwork

gt = sigmoid(W gxt + bg)

ht = �(W ixt + bi)

Highwaygatevector

…bo1 bo2

x1 x2 xT

boT bot = W oht + bo.

ht = �(W ixt + bi)Feedforward

networkW i, bi

W o, boh1 hTh2

Methodq Highwaynetworkforacousticmodeling

• Spectralfeatures

20

ISSUE 1:JOINT LEARNING FOR F0?

Linguisticfeatures

Linear

MGC

Highwayblock

Highwayblock…

LinearBAP

Highwayblock

Highwayblock…

LinearF0

Highwayblock

Highwayblock…

Linear

MGC:Mel-generalizedcepstral(MGC)coefficients[14]

BAP:bandaperiodicitycoefficients[15,16]

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

g

[14] K.Tokuda,T.Kobayashi,T.Masuko,andS.Imai.Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046,1994.[15] H.Kawahara,J.Estill,andO.Fujimura.Aperiodicityextractionandcontrolusingmixedmodeexcitationandgroupdelay manipulationforahighqualityspeechanalysis,modification

andsynthesissystemstraight.InSecondInternationalWorkshoponModelsandAnalysisofVocalEmissionsforBiomedicalApplications,2001.[16] H.ZenandT.Toda.AnoverviewofNitech HMM-basedspeechsynthesissystemforBlizzardChallenge2005.InProc.Interspeech,pages93–96,2005.

Single-stream Multi-stream

Methodq Analysistools

1. Histogramofgatevectors

2. Sensitivityoftodifferentlinguisticfeatures(Sec.4.3.2)

21

ISSUE 1:JOINT LEARNING FOR F0?

bo = g � h+ (1� g)� x

g

{g1, · · · , gT } Histogram

g

0 0.5 10

0.5

1

1.5

2

2.5 #105

0 0.5 10

2

4

6

8 #104

0 0.5 10

1

2

3 #104

0 0.5 10

0.5

1

1.5

2

2.5 #104

0 0.5 10

1

2

3 #104

0 0.5 10

1

2

3

4 #104

0 0.5 10

2

4

6

8 #104

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

gh

bo

x

Experimentsq Configuration

• Data:English,16hours• Feature:MGC,BAP,F0(InterpolatedF0+voicing(U/V))

• Metric:

q Threemodels:• Single-streamfeedforwardnetwork• Single-streamhighwaynetwork• Multi-streamhighwaynetwork

q Twotests:1. Fixedlayerwidth,varyingnetworkdepth2. Fixednetworkdepth,varyinglayerwidth

22

ISSUE 1:JOINT LEARNING FOR F0?

Rootmeansquareerror(RMSE)Correlationcoefficients(CORR)

g

x

Highwayblock

h(1)

h(2)

Experimentsq Objectiveresults:increasingnetworkdepth

v Networkdepth:Numberoftanh-basedtransformationlayers• Single-streamnetworkprioritizesMGC?

Linguisticfeatures

Linear

MGC

Highwayblock

Highwayblock

LinearBAP

Highwayblock

Highwayblock

LinearF0

Highwayblock

Highwayblock

LinearF0

Highwayblock

Highwayblock

LinearMGC BAP

Linear

Linguisticfeatures

F0

feedforward

feedforward

LinearMGC BAP

Linear

Linguisticfeatures

feedforward

feedforward

ISSUE 1:JOINT LEARNING FOR F0?

23

MGCRMSE F0RMSE F0CORR

2 4 8 14 20 40Network depth

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

F0C

orre

latio

n(0

-1)

2 4 8 14 20 40Network depth

43

44

45

46

47

F0R

MS

E(H

z)

2 4 8 14 20 40Network depth

1.02

1.04

1.06

1.08

1.10

1.12

1.14

MG

CR

MS

E

Single-stream feedforwardSingle-stream highwayMulti-stream highway

Experimentsq Objectiveresults:increasingwidth(depth=14)

v MS1 :[MGC256]– [F0256]v MS2 :[MGC382]– [F0256]v MS3 :[MGC512]– [F0382]v MS4 :[MGC768]– [F0512]

ISSUE 1:JOINT LEARNING FOR F0?

24

• Single-streamnetworkprioritizesMGC?

3.0e6 9.0e6 2.5e7Number of Network weights

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

F0C

orre

latio

n(0

-1)

382

782882

1024382 482582 782 1024

MS1

MS2MS3

MS4

3.0e6 9.0e6 2.5e7Number of Network weights

42.5

43.0

43.5

44.0

44.5

45.0

45.5

46.0

46.5

F0R

MS

E(H

z)

382

782 882

1024382

482

582 7821024

MS1 MS2MS3

MS4

3.0e6 9.0e6 2.5e7Number of Network weights

1.01

1.02

1.03

1.04

1.05

MG

CR

MS

E

382

782882 1024

382 482 582

782 1024

MS1

MS2MS3 MS4

Single-stream feedforwardSingle-stream highwayMulti-stream highway

MGCRMSE F0RMSE F0CORR

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

Experimentsq Histogramof

• Multi-streamhighway(depth14)

ISSUE 1:JOINT LEARNING FOR F0?

25

linear

F0

highwayblock

linguistic features

linear

BAP

highwayblock

linear

MGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

g

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation linear

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

Experimentsq Histogramof

• Multi-streamhighway(depth14,7blocks)

ISSUE 1:JOINT LEARNING FOR F0?

26

linear

linear

F0

highwayblock

linguistic features

linear

BAP

highwayblock

linear

MGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

0 1

7e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

g

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

• DifferenthiddenfeaturesforMGCandF0

Experimentsq Histogramof

• Single-streamhighway(depth14,7blocks)

ISSUE 1:JOINT LEARNING FOR F0?

27

linear

linear

F0

highwayblock

linguistic features

BAPMGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

0 1

8e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

• SimilartoMGCsub-netinmulti-streamhighway• Single-streamnetworkprioritizesMGC?

g

Summaryq Answertoissue1

q OnlyF0modelinginthefollowingchapters

q F0isusefulforMGCmodeling?Howtodo?(slidesappendix)

ISSUE 1:JOINT LEARNING FOR F0?

28

NOTforthesakeofF0modeling!

Joint(multi-task)learning? BeneficialforbothF0andspectralfeatures

? Theysharehiddenfeatures

Negativeevidence• Jointlearning(single-streamnetwork)

prioritizesspectralfeatures• Theyusedifferenthiddenfeatures• Theyusedifferentinputfeatures(Sec.4.4.3)• ResultsonEnglishandJapanesecorpora

(Sec.4.5)

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

q Issuesandmethods

q Summary

CONTENTS

29

Motivationq BaselineRNNmodel[17]

v T:numberofframes(timesteps)

30

ISSUE 2:TEMPORAL DEPENDENCY?

bo1:T = {bo1, · · · , boT }

Recurrent neural network (RNN)

F0 contour

Linguisticfeatures

[17]R.Fernandez,et.al.Prosodycontourpredictionwithlongshort- termmemory,bi-directional,deeprecurrentneuralnetworks.InProc.Interspeech,pages2268–2272,2014.

x1:T = {x1, · · · ,xT }

Motivationq BaselineRNNmodel

• Learnthecorrelationbetweenand,

31

ISSUE 2:TEMPORAL DEPENDENCY?

x1 x2 x3 x4 x5

bo1 bo2 bo3 bo4 bo5

x1:T = {x1, · · · ,xT }

bo1:T = {bo1, · · · , boT }

H(RNN)⇥ (·) bot = H

(RNN)⇥ (x1:T , t)

t2 6= t1ot1 ot2

Motivationq BaselineRNNmodel

Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)

32

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

[18]C.M.Bishop.MixtureDensityNetworks.Technicalreport,AstonUniversity,2004.[19]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.[20]M.Schuster.Bettergenerativemodelsforsequentialdataproblems:Bidirectionalrecurrentmixturedensitynetworks.In Proc.NIPS,pages589–595,1999.

H(RNN)⇥ (·)

ISSUE 2:TEMPORAL DEPENDENCY?

Probabilistic part

Computational part

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt,�I)

Recurrentmixturedensitynetwork

(RMDN)

bot = argmaxot

p(ot|x1:T ;⇥⇤) = µt Mean-basedgeneration

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0

33

Motivationq Initialanswer

• Evidencefromrandomsampling

ISSUE 2:TEMPORAL DEPENDENCY?

TemporaldependencyisignoredbyRNN/RMDN

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0RMDN mean-based output

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0RMDN mean-based outputRMDN sample

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt,�I)

Autoregressive(AR) [23] models

• ShallowARmodels(SAR)&extendedSAR• DeepARmodels(DAR)

34

ISSUE 2:TEMPORAL DEPENDENCY?

RNN/RMDNp(o1:T ) =

TY

t=1

p(ot|o1:t�1)

p(o1:T ) =TY

t=1

p(ot)

[23] B.Frey.GraphicalModelsforMachineLearningandDigitalCommunication.ABradfordbook.Bradfordbook,1998.

Motivationq Initialanswer

q Bettermodels?

TemporaldependencyisignoredbyRNN/RMDN

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

35

36

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

SARq Definition

ISSUE 2:TEMPORAL DEPENDENCY?

p(o1:T |x1:T ) =TY

t=1

p(ot|ot�K:t�1,x1:T )

37

SARq Definition

• Timeinvariant• K:hyper-parameter

a2 a2 a2

a1a1a1a1

Trainableparameters

o1 o2 o3 o4 o5 K=2

ISSUE 2:TEMPORAL DEPENDENCY?

f (ot�K:t�1) =KX

k=1

ak � ot�k + b

= {a1, · · · ,aK , b}

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥, )

=TY

t=1

N (ot;µt + f (ot�K:t�1),⌃t)

38

SARq Interpretation1(Sec.5.3)

q Interpretation2(Sec.5.3)

o1 o2 o3 o4 o5

ISSUE 2:TEMPORAL DEPENDENCY?

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥, )

=TY

t=1

N (ot;µt + f (ot�K:t�1),⌃t)

=TY

t=1

pc(ct|x1:T ;⇥)

c1 c2 c3 c4 c5

ct = ot �KX

k=1

ak � ot�k

o1 o2 o3 o4 o5 c1 c2 c3 c4 c5Filters

SARq Interpretation(Sec.5.3)

• SAR=Lineartransformation+RMDN

39

Training

Generationx1:T

RMDN

RMDNbo1:T

o1:T

bc1:T

c1:T

ISSUE 2:TEMPORAL DEPENDENCY?

TY

t=1

p(ct|x1:T ;⇥)

TY

t=1

p(ct|x1:T ;⇥)

ct = ot �KX

k=1

ak � ot�k

bot = bct +KX

k=1

ak � bot�k

ExtendedSAR(eSAR)qMotivation

• SAR->Non-lineartransformation+RMDN?

• Yes,ifisinvertibleand̀ simple’

40

Training

Generationx1:T

RMDN

RMDNbo1:T

o1:T

bc1:T

c1:T

ISSUE 2:TEMPORAL DEPENDENCY?

TY

t=1

p(ct|x1:T ;⇥)

TY

t=1

p(ct|x1:T ;⇥)

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

f (o1:T )

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

����� det@f (o1:T )

@o1:T

�����

ExtendedSAR(eSAR)q Definition

• Volume-preserving[24]

41

SAR eSAR

ISSUE 2:TEMPORAL DEPENDENCY?

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

ct = ot � RNN (o1:t�1, t)

bot = bct +RNN (bo1:t�1, t)

det@f (o1:T )

@o1:Tdet

@f (o1:T )

@o1:T= 1

ct = ot �KX

k=1

ak � ot�k

det@f (o1:T )

@o1:T= 1

bot = bct +KX

k=1

ak � bot�k

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

����� det@f (o1:T )

@o1:T

�����

[24] J.M.Tomczak andM.Welling.Improvingvariationalauto-encodersusingconvexcombination linearinverseautoregressiveflow.InProc.Benelearn,pages162–194,2017.

ExtendedSAR(eSAR)q Implementation

42

o1 o2 o3 o4 o5

x1 x2 x3 x4 x5

M3 M4M2M1 M5

c1 c2 c3 c4 c5

µ5µ4µ3µ20

ISSUE 2:TEMPORAL DEPENDENCY?

ct = ot � µt

pc(c1:T |x1:T ;⇥)

Transformation

Modeling

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

µt = RNN (o1:t�1, t)Normflow

43

x1 x2 x3 x4 x5

M3 M4M2M1 M5

bc1 bc2 bc3 bc4 bc5

bo5bo4bo3bo2bo1

De-transformation

Generationbc1:T ⇠ pc(c1:T |x1:T ;⇥)

ExtendedSAR(eSAR)q Implementation

ISSUE 2:TEMPORAL DEPENDENCY?

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

bµ2 bµ3 bµ4 bµ5

bµt = RNN (bo1:t�1, t)

bot = bct + bµt

SummaryofSARq Theoreticallyappealing

q PerformanceforF0modeling(modeldetailslater)

q PerformanceforMGCmodeling• BetterthanRNN/RMDN[25,26]

ISSUE 2:TEMPORAL DEPENDENCY?

44[25] X.Wang,J.Lorenzo-Trueba,S.Takaki,L.Juvela,andJ.Yamagishi.Acomparisonofrecentwaveformgenerationandacousticmodelingmethodsforneural-network-basedspeechsynthesis.InProc.ICASSP,pages4804–4808,2018.

[26] X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforpara- metricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.

SAR

eSAR

SAR+stablefilters

0% 25% 50% 75% 100%

eSAR 55.70% SAR 44.30%

Signalprocessing

Machinelearning

Real-valuedpolesLogarearatio(Sec.5.3.2)Segment-variantfilters(slides)

Volume-preservingNone-volume-preserving (slides)…

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

45

DARqMotivation

• AreSARandeSAR sufficientlygood?

46

ISSUE 2:TEMPORAL DEPENDENCY?

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0SAR sampled output

RandomfromSAR

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0eSAR sampled output

RandomfromeSAR

DARqMotivation

• RandomsamplingonSARandeSAR?

• SAR:linear(Sec.6.1)

• eSAR:non-linearbutaspecialform

q Non-linearandnon-invertibleARtransformation?

47

ISSUE 2:TEMPORAL DEPENDENCY?

o1:Tc1:T x1:Tc1:T = f�(o1:T )

NetworkwithARdependencyo1:T x1:T

TY

t=1

p(ct|x1:T ;⇥)

f�(·)

48

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

DARq Definition

ISSUE 2:TEMPORAL DEPENDENCY?

49

DARq Definition

• DARismoregeneralthanSAR(Sec6.2.2&slides,toyexamples)1. Longer-timedependency2. Non-lineardependency

ISSUE 2:TEMPORAL DEPENDENCY?

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|o1:t�1,x1:T ;⇥)

P (o1:T |x1:T ;⇥) =TY

t=1

P (ot|o1:t�1,x1:T ;⇥)

50

DARq Implementation(Sec.6.3)

ISSUE 2:TEMPORAL DEPENDENCY?

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

QuantizedF0Hierarchicalsoftmax

Datadropout

51

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)

qModels

ISSUE 2:TEMPORAL DEPENDENCY?

Layer size

512

512

256

128

linear

bi-LSTM

bi-LSTM

FF

F0

GMM 2mix

F0

FF

RNN RMDN SAR

F0

GMM 2mix

linear

bi-LSTM

bi-LSTM

FF

linear

bi-LSTM

bi-LSTM

FF

FF FF

Linguistic features

GMM 2mix

eSAR

linear

bi-LSTM

bi-LSTM

FF

FF

normflow

normflow

F0

normflow

uni-LSTM 64

linear 2

v FF:feedforwardwithtanh-activationfunction

WaveNet

52

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)• Feature:quantizedF0(256quantizationbins)

qModels

ISSUE 2:TEMPORAL DEPENDENCY?

Linguisticfeatures

QuantizedF0

hierarchicalsoftmaxlinear

uni-LSTM

bi-LSTM

FF

FF

v FF:feedforwardwithtanh-activationfunction

DAR

Linguisticfeatures

QuantizedF0

hierarchicalsoftmax

bi-LSTMbi-LSTM

FFFF

WaveNet-F0

linear

Layer size

512

512

256

128

256

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0SAR sampled output

RandomfromSAR

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0eSAR sampled output

RandomfromeSAR

53

Experimentq Randomsampling

ISSUE 2:TEMPORAL DEPENDENCY?

190

390

F0(H

z)

Natural DAR sample 1

190

390

F0(H

z)

Natural DAR sample 2

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

190

390

F0(H

z)

Natural DAR sample 3

DARsample1

DARsample2

DARsample3

190

390

F0(H

z)

Natural RNN

190

390

F0(H

z)

Natural RMDN

190

390

F0(H

z)

Natural SAR

190

390

F0(H

z)

Natural eSAR

190

390

F0(H

z)

Natural WaveNet-F0

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

190

390

F0(H

z)

Natural DAR

54

ExperimentqMean-basedgeneration

ISSUE 2:TEMPORAL DEPENDENCY?

55

ExperimentqMean-basedgeneration

• 500testutterances, >1000setsofscores

ISSUE 2:TEMPORAL DEPENDENCY?

Natural DAR SAR RMDN RNNNatural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 <1e-10 <1e-10SAR <1e-10 <1e-10 0.01785 0.7429RMDN <1e-10 <1e-10 0.01785 0.00426RNN <1e-10 <1e-10 0.7429 0.00426

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.243

3.856

3.5643.628 3.565

Natural DAR SAR eSAR WaveNet-F0Natural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102SAR <1e-10 <1e-10 0.007186 0.000176eSAR <1e-10 9.062e-08 0.007186 0.219733WaveNet-F0 <1e-10 0.000102 0.000176 0.219733

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.199

3.971

3.7323.813 3.803

MOStest2

p-value>0.05

p-value>0.05

MOStest1

Natural DAR SAR RMDN RNN

Natural DAR SAR eSAR WaveNet-F0

Natural DAR SAR RMDN RNN

Summaryq Fullanswertoissue2

ISSUE 2:TEMPORAL DEPENDENCY?

56

TemporaldependencyisignoredbyRNN/RMDN!Butbettermodelcanbedefined

RMDN SAR eSAR DAR

ARlinear? - Linear Non-linear(constrained) Non-linear

ARtime span -

Tractable? - Yes Somewhat No

Sampling? No No No Yes

h2h1

x1 x2

o1 o2

µ1 µ2

h2h1

x1 x2

o1 o2

µ1 µ2

h2h1

x1 x2

o1 o2

µ1 µ2

c1 c2

h2h1

x1 x2

o1 o2

µ1 µ2

t�K : t� 1 1 : t� 1 1 : t� 1

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

q Issue3:frame-by-frameprocessing

q Summary

CONTENTS

57

Motivationq Inefficientprocessing

ISSUE 3:FRAME-BY-FRAME PROCESSING?

* ** *** ***** ******* ******** ********** ************ ************** **********************************

oT

xT

o1

x1

o3o2

x2 x3

…xtn xt(n+1)

otn ot(n+1)…

58

Phone 1 Phone 2 Phone 3

Phrasetier

Moratier

Phone tierFrametier

ツツ ツ ツ ツ ツ ツ ツ ギ ギ ギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

Phone 1 Phone 2 Phone 3

Phone 1 Phone 2 Phone 3

Motivationq Inefficientprocessing

ISSUE 3:FRAME-BY-FRAME PROCESSING?

…… …… …

* ** *** ***** ******* ******** ********** ************ ************** **********************************

oT

xT

o1

x1

o3o2

x2 x3

…xtn xt(n+1)

otn ot(n+1)…

59

Phone 1 Phone 2 Phone 3

Phrasetier

Moratier

Phone tierFrametier

ツツ ツ ツ ツ ツ ツ ツ ギ ギ ギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

MotivationqMoreefficientprocessing?

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn ot(n+1)… …

Phone 1 Phone 2 Phone 3

• Givendurationofeachunit

60

x3px2px1p

Howtodo?ツ ツ ギ

ts u g

1 1 1

Methodq Two-stageF0modeling

ISSUE 3:FRAME-BY-FRAME PROCESSING?

Phone 1 Phone 2 Phone 3

… oTo1 o3o2 … …otn ot(n+1)… …

Stage1:F0contourmodeling

Stage2:Linguisticlinking

61

x3px2px1p

Methodq Two-stageF0modeling

• Revisitlinguisticapproaches[28,29]

ISSUE 3:FRAME-BY-FRAME PROCESSING?

Phone 1 Phone 2 Phone 3

… oTo1 o3o2 … …otn ot(n+1)… …

Vector-quantizationvariational-auto-encoder(VQ-VAE[27])

Sequential classifier

Compactcodespace

62[27] A.vandenOord,O.Vinyals,andK.Kavukcuoglu.Neuraldiscreterepresentationlearning.InProc.NIPS,pagetoappear,2017.[28] K.E.Dusterhoff,A.W.Black,andP.A.Taylor.UsingdecisiontreeswithintheTiltintonationmodeltopredictF0contours.Proc.Eurospeech,pages1627–1630,1999.[29] K.Hirose,K.Sato,Y.Asano,andN.Minematsu.SynthesisofF0contoursusinggenerationprocessmodelparameterspredictedfromunlabeledcorpora:Applicationto

emotionalspeech synthesis.Speechcommunication, 46(3):385–404,2005.

x3px2px1p

Stage1:F0contourmodelingq VQ-VAE

• Unsupervisedlearning• Onecodeforvaried-lengthunit• Multiplelinguisticlevels

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn ot(n+1)… …

Phone 1 Phone 2 Phone 3

Compactcodespace

Goals

63

Stage1:F0contourmodelingq VQ-VAE

• Onecodeperphone

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

e1p e2p

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

e3p

Codebook

64

Decoder

Encoder

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

• Onecodeperphone&onecodepermora

Mora-levelencoder

Mora-level codebookz1m

e1m e2m

z2m

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

65

Mora 1 Mora 2

Decoder

Phone-levelEncoder

e1p e2p

z1p z2p z3p

e3p

Phone-level codebook

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

66

Mora 1 Mora 2

VQ-VAEencoder(s)

VQ-VAEdecoder

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

67

Mora 1 Mora 2

VQ-VAEdecoder

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Linguisticlinker

Linguisticfeatures

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

68

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}

RNN-basedsequentialclassifier

• ClockworkRNN[30]• Highway[31]

• AR&feedbacklinks• Dropout[32](Sec.7.2.2)

[30] J.Koutnik,K.Greff,F.Gomez,andJ.Schmidhuber.AClockworkRNN.InProc.ICML,pages1863– 1871,2014.[31] K.Greff,R.K.Srivastava,andJ.Schmidhuber.Highwayandresidualnetworkslearnunrollediterativeestimation.InProc.ICLR,2017.[32] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:Asimplewaytopreventneuralnetworksfromoverfitting.TheJournalofMachine

LearningResearch,15(1):1929–1958, 2014.

Linguisticfeatures

F0modelingforTTSISSUE 3:FRAME-BY-FRAME PROCESSING?

69

Encoders&codebooks

Decoder

Trainingset

F0

F0

Codeindices

Linker

Linguisticfeatures

Stage1:VQ-VAE Stage2:Linker

Testset

Linker

Linguisticfeatures

Decoder

Codebooks

F0

F0modelingforTTS

Experimentsonstage1(sec.7.3.2)ExperimentsonwholemodelqModels

• Givennaturalduration

ISSUE 3:FRAME-BY-FRAME PROCESSING?

70

Model1 Model3

Frame-by-frameLinker

VQ-VAEdecoder(phone-level)

VQ-VAEdecoder(phone-level)

Phone-by-phoneLinker

VQ-VAEdecoder(phone-mora-level)

Phone-by-phoneLinker+Moralock

… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …Model2

Experimentsq Objectiveresults

ISSUE 3:FRAME-BY-FRAME PROCESSING?

RMSE CORR U/V Timecost(s/epoch)

Model1 34.3 0.839 7.96% 1300

Model2 27.1 0.906 6.36% 54

71

Model3 25.5 0.916 4.87% 65

Model1 Model2 Model3… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn

ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

VAEvsDARq Objectiveresults

• VQ-VAEencoderneedstimeandmemory

ISSUE 3:FRAME-BY-FRAME PROCESSING?

RMSE CORR U/VNumberofparameters(m) Timecost (s/epoch)

Stage 1 Stage2 Sum Stage 1 Stage2 Sum

DAR 28.3 0.903 3.46%

VAEmodel3 25.5 0.916 4.87%

72

DAR VAE(model3)… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

~1300

65

~700

1500

Stage1

Stage2

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

0.36

0.44

1.11

0.67

2000

1565

1.48

1.11

73

ISSUE 3:FRAME-BY-FRAME PROCESSING?VAEvsDARq Subjectivetest

• 500testutterances, >1000setsofscores

Natural DAR SAR eSAR WaveNet-F0 VAE (model 3)Natural <1e-10 <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102 0.392125SAR <1e-10 <1e-10 0.007186 0.000176 <1e-10eSAR <1e-10 9.062e-08 0.007186 0.219733 5.946e-10WaveNet-F0 <1e-10 0.000102 0.000176 0.219733 2.635e-06VAE (model 3) <1e-10 0.392125 <1e-10 5.946e-10 2.635e-06

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.199

3.971

3.7323.813 3.803

3.993

p-value>0.05

MOStest

p-value>0.05

Natural DAR SAR eSAR WaveNet-F0 VAE(model3)

74

ISSUE 3:FRAME-BY-FRAME PROCESSING?Summaryq Answertoissue3

q How

q Results• Multiplelinguistictiers• MoreefficientthanDAR:smaller+faster+F0CORR>0.91• Interpretablelatentcodespaces(Sec.7.3.2)• RandomsamplingOK(slides)

Itcouldbemoreefficient

LinkerLinguisticfeatures

VQVAEDecoderCodebooks F0

unit-by-unit frame-by-frame

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

q Issue3:frame-by-frameprocessing

q Conclusion

CONTENTS

75

Summary

76

CONCLUSION

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? JointmodelingofF0andspectral?✕ Sub-optimalforF0modeling• Methods:

o Highwaynetworkso Histogram+sensitivityanalysis

• Results:o Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0

andspectralfeatures

Summary

77

CONCLUSION

… xTx1 x2 x3

boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? TemporaldependencyinRNN/RMDN?✕ IgnoredbyRNN/RMDN• Models:

o SAR:tractabledependencyo DAR:non-linear+longerdependency

• Results:o DAR:F0CORR>0.90,MOSscore⬆o DARsupportsrandomsampling!

Issue2:Temporal

dependency?

Chapter5-6

DAR

Summary

78

CONCLUSION

… boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? Frame-by-frameprocessing✕ Inefficient• Two-stageF0model:

o F0contourcoding:VQ-VAE+DARo Linguisticlinking:unit-by-unitclassifier

• Results:o F0CORR>0.91⬆o Moreefficient(smaller,faster)

… xTx1 x2 x3

Issue2:Temporal

dependency?

Chapter5-6

Issue3:Frame-by-frame?

Chapter7

x1p x2p

Phone 1 Phone 2

Summary

Chapter8

Newtopics?q Towardscompleteprosodymodeling

• NotonlyF0butalsoduration• Easyforjointmodeling

qWaveformmodeling• F0andwaveformare1dimensionalsignals• Signalprocessingmethodsavailable

SAR+logarearatio+segment-variantfilters

79

CONCLUSION

LinkerLinguisticfeatures

Latentcodeperphone

Durationofphone

Thankyouforyourattention

Q&A

80

Codes,scripts,slides:tonywangx.github.io

q NeuralF0modeling

ü Meetbasicrequirementonpublication:3journals+6conferences

81

THESIS OUTLINE

Chapter1-3Introduction, TTS,neuralnetworks

Chapter 8Summary and application of F0 model

Chapter4Investigationusinghighwaynetwork

Chapter5ShallowautoregressiveF0model(SAR)

ExtendedSAR

Chapter7Variationalauto-encoder(VAE)+DAR

ICASSP2018

Speechsynthesisworkshop2016,

SpeechCommunication, Vol96,Feb.2018,pp1-9

ICASSP2017

Interspeech2017

IEEETrans.ASLP, accepted,2018Chapter6DeepautoregressiveF0model (DAR)

Prepareajournalpaper

qWorknotincludedinPh.D.thesis• Wordembeddingasinputfeatures(IEICE2018,Interpseech2016)• Impactoftrainingdatasize(SSW2016)

q Collaboratedwork• DARusingmanually-annotated/corrupteddata(Interspeech 2018)

ProvideDAR,SAR,andWaveNet-vocoder

• Voicecloningusingfounddata(Odyssey2018)ProvideWaveNet-vocoder,andacousticmodels

• Cyborgspeech:multilingualspeechsynthesis(ICASSP2018)Provideacousticmodels

• SpeechsynthesisfromMFCC(ICASSP2018)ProvideDARonMFCC

• Controllablespeechsynthesis(Interspeech 2017)Provideacousticmodels

82

WORK DURING PH.D.