Voice Conversion
Dr. Elizabeth GodoySpeech Processing Guest Lecture
December 11, 2012
E.Godoy Biography
I. United States (Rhode Island): Native Hometown: Middletown, RI Undergrad & Masters at MIT (Boston Area)
II. France (Lannion): 2007-2011 Worked on my PhD at Orange Labs
III. Iraklio: Current Work at FORTH with Prof Stylianou
December 11, 20122 E.Godoy, Voice Conversion
Professional Background
I. B.S. & M.Eng from MIT, Electrical Eng. Specialty in Signal Processing
Underwater acoustics: target physics, environmental modeling, torpedo homing
Antenna beamforming (Masters): wireless networks
II. PhD in Signal Processing at Orange Labs Speech Processing: Voice Conversion
Speech Synthesis Team (Text-to-Speech) Focus on Spectral Envelope Transformation
III. Post-Doctoral Research at FORTH LISTA: Speech in Noise & Intelligibility
Analyses of Human Speaking Styles (e.g. Lombard, Clear) Speech Modifications to Improve Intelligibility
December 11, 20123 E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 4
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +
Amplitude Scaling
III. Conversion Results Objective Metrics & Subjective Evaluations Sound Samples
IV. Summary & Conclusions
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 5
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
Voice Conversion (VC)
December 11, 2012E.Godoy, Voice Conversion 6
Transform the speech of a (source) speaker so that it sounds like the speech of a different (target) speaker.
This is awesome!
He sounds like me!
Voice Conversi
onThis is awesome!
source target
???
Context: Speech Synthesis
December 11, 2012E.Godoy, Voice Conversion 7 December 11, 2012E.Godoy, Guest Lecture7
Increase in applications using speech technologies Cell phones, GPS, video gaming, customer
service apps…
Information communicated through speech!
Text-to-Speech (TTS) Synthesis Generate speech from a given text
Turn left! Insert your card.
Next Stop: Lannion
Text-to-Speech!
This is Abraham Lincoln speaking…Ha ha!
Text-to-Speech (TTS) Systems
December 11, 2012E.Godoy, Voice Conversion 8
TTS Approaches
1. Concatenative: speech synthesized from recorded segments
Unit-Selection: parts of speech chosen from corpora & strung together
High-quality synthesis, but need to record & process corpora
2. Parametric: speech generated from model parameters HMM-based: speaker models built from speech using linguistic
infoLimited quality due to simplified speech modeling & statistical averaging
Concatenativeor
Paramateric?
Text-to-Speech (TTS) Example
December 11, 2012E.Godoy, Voice Conversion 9
voice
Voice Conversion: TTS Motivation
December 11, 2012E.Godoy, Voice Conversion 10
Concatenative speech synthesis High-quality speech But, need to record & process a large corpora for
each voice
Voice Conversion Create different voices by speech-to-speech
transformation Focus on acoustics of voices
What gives a voice an identity?
December 11, 2012E.Godoy, Voice Conversion 11
“Voice” notion of identity (voice rather than speech)
Characterize speech based on different levels
1. Segmental Pitch – fundamental frequency Timbre – distinguishes between different types
of sounds
2. Supra-Segmental Prosody – intonation & rhythm of speech
Goals of Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 12
1. Synthesize High-Quality Speech Maintain quality of source speech (limit
degradations)
2. Capture Target Speaker Identity Requires learning between source & target
features
Difficult task! significant modifications of source speech needed
that risk severely degrading speech quality…
Stages of Voice Conversion
Key Parameter: the spectral envelope (relation to timbre)
December 11, 2012E.Godoy, Voice Conversion 13
CORPORA
TARGET speech
SOURCE speech
PARAMETEREXTRACTION
Converted Speech SYNTHESIS
PARAMETEREXTRACTION
1) Analysis, 2) Learning, 3) Transformation
LEARNING• Generating Acoustic Feature Spaces• Establish Mappings between Source & Target Parameters
TRANSFORMATION• Classify Source Parameters in Feature Space• Apply Transformation Function
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 14
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +
Amplitude Scaling
The Spectral Envelope Spectral Envelope: curve approximating the DFT magnitude
Related to voice timbre, plays a key role in many speech applications: Coding, Recognition, Synthesis, Voice transformation/conversion
Voice Conversion: important for both speech quality and voice identity
0 1000 2000 3000 4000 5000 6000 7000 8000-120
-100
-80
-60
-40
-20
0
December 11, 201215 E.Godoy, Voice Conversion
Spectral Envelope Parameterization
Two common methods
1) Cepstrum - Discrete Cepstral Coefficients - Mel-Frequency Cepstral Coefficients
(MFCC) change the frequency scale to reflect
bands of human hearing
2) Linear Prediction (LP) - Line Spectral Frequencies (LSF)
December 11, 201216 E.Godoy, Voice Conversion
Standard Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 17
Parallel corpora: source & target utter same sentences Parameters are spectral features (e.g. vectors of cepstral
coefficients) Alignment of speech frames in time
Standard: Gaussian Mixture Model
Corpora(Parallel)
Extract target parameters
Extract source parameters Converted
Speech synthesis
Align source & target frames LEARNING
TRANSFORMATION
Focus: Learning & Transforming the Spectral Envelope
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 18
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model
Formulation Limitations
1. Acoustic mappings between source & target parameters2. Over-smoothing of the spectral envelope
Proposed: Dynamic Frequency Warping + Amplitude Scaling
Gaussian Mixture Model for VC
December 11, 2012E.Godoy, Voice Conversion 19
Origins: Evolved from "fuzzy" Vector Quantization (i.e. VQ with
"soft" classification) Originally proposed by [Stylianou et al; 98] Joint learning of GMM (most common) by [Kain et al; 98]
Underlying principle: Exploit joint statistics exhibited by aligned source & target
frames
Methodology: Represent distributions of spectral feature vectors as mix
of Q Gaussians Transformation function then based on MMSE criterion
GMM-based Spectral Transformation
1) Align N spectral feature vectors in time. (discrete cepstral coeffs)
2) Represent PDF of vectors as mixture of Q multivariate Gaussians
Learn from Expectation Maximization (EM) on Z
3) Transform source vectors using weighted mixture of Maximum Likelihood (ML) estimator for each component.
: probability source frame belongs to acoustic class described by component q (calculated
in Decoding)
0,1),,;()(11
q
Q
q
q
Q
q
qqq zNzp
Qqqqq :1,,,
)( nxq xw
YXZyyYxxX NN , :joint,,..., :target,,..., :source 11
E.Godoy, Voice Conversion
Q
q
xqn
xxq
yxq
yqn
xqnn xxwxy
1
1)()()(ˆ
GMM-Transformation Steps
1) Source frame want to estimate target vector:
2) Classify calculate
: probability source frame belongs to acoustic class described by component q (Decoding step)
3) Apply transformation function:
)( nxq xw
E.Godoy, Voice Conversion
Q
q
xqn
xxq
yxq
yqn
xqnn xxwxy
1
1)()()(ˆ
nx ny
nx )( nxq xw
weighted sum ML estimator for class
Acoustically Aligning Source & Target Speech
First question: are the acoustic events from the source & target speech appropriately associated?
Time alignment Acoustic alignment ???
One-to-Many Problem Occurs when single acoustic event of source aligned to
multiple acoustic events of target [Mouchtaris et al; 07] Problem is that cannot distinguish distinct target
events given only source information
Source: one single event As ,
Target: two distinct events Bt, Ct
[As ; Bt] [As ; Ct]
Joint Frames Acoustic Space
As
Bt
Ct-4 -2 0 2 4 6 8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
December 11, 201222 E.Godoy, Voice Conversion
Cluster by Phoneme (“Phonetic GMM”)
Motivation: eliminate mixtures to alleviate one-to-many problem!
Introduce contextual information Phoneme (unit of speech describing linguistic sound):
/a/,/e/,/n/, etc
Formulation: cluster frames according to phoneme label
Each Gaussian class q then corresponds to a phoneme:
Outcome: error indeed decreases using classification by phoneme Specifically, errors from one-to-many mappings
reduced!
qN
N
Nzz
Nz
N
q
N
l
Tllll
N
ll
phonemefor frames
,))((1
,1
11
December 11, 201223 E.Godoy, Voice Conversion
Still GMM Limitations for VC
Unfortunately, converted speech quality poor original converted
What about the joint statistics in the GMM? Difference between GMM and VQ-type (no joint stats)
transformation?
Q
q
yqn
xq
Q
q
xqn
xxq
yxq
yqn
xq
xw
xxw
1
1
1
)(
: typeVQ
)()(
: GMM
No significant difference! (originally shown [Chen et al; 03])
joint statistics
December 11, 201224 E.Godoy, Voice Conversion
"Over-Smoothing" in GMM-based VC
The Over-Smoothing Problem: (Explains poor quality!) Transformed spectral envelopes using GMM are "overly-
smooth" Loss of spectral details converted speech "muffled",
"loss of presence"
Cause: (Goes back to those joint statistics…) Low inter-speaker parameter correlation
(weak statistical link exhibited by aligned source & target frames) class mean
Q
q
yqn
xq
Q
q
xqn
xxq
yxq
yqn
xqnn xwxxwxy
11
1)()()()(ˆ
Frame-level alignment of source & target parameters not effective!
December 11, 201225 E.Godoy, Voice Conversion
“Spectrogram” Example (GMM over-smoothing)
December 11, 2012E.Godoy, Voice Conversion 26
X
Hz0 2000 4000 6000
50
100
150
200
250
Phonetic GMM
Hz0 2000 4000 6000
50
100
150
200
250
DFWA
Hz0 2000 4000 6000
50
100
150
200
250
Y
Hz0 2000 4000 6000
50
100
150
200
250
X
Hz0 2000 4000 6000
50
100
150
200
250
Phonetic GMM
Hz0 2000 4000 6000
50
100
150
200
250
DFWA
Hz0 2000 4000 6000
50
100
150
200
250
Y
Hz0 2000 4000 6000
50
100
150
200
250
Spectral Envelopes across a sentence.
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 27
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model
Proposed: Dynamic Frequency Warping + Amplitude Scaling
Related work DFWA description
Dynamic Frequency Warping (DFW)
Dynamic Frequency Warping [Valbret; 92] Warp source spectral envelope in frequency
to resemble that of target
Alternative to GMM No transformation with joint statistics!
Maintains spectral details higher quality speech
Spectral amplitude not adjusted explicitly poor identity transformation
December 11, 201228 E.Godoy, Voice Conversion
Spectral Envelopes of a Frame: DFW
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
DFW
December 11, 201229 E.Godoy, Voice Conversion
Hybrid DFW-GMM Approaches
December 11, 2012E.Godoy, Voice Conversion 30
Goal to adjust spectral amplitude after DFW
Combine DFW- and GMM- transformed envelopes
Rely on arbitrary smoothing factors [Toda; 01],[Erro; 10]
Impose compromise between 1. maintaining spectral details (via DFW) 2. respecting average trends (via GMM)
Proposed Approach: DFWA
December 11, 2012E.Godoy, Voice Conversion 31
Observation: Do not need the GMM to adjust amplitude!
Alternatively, can use a simple amplitude correction term
Dynamic Frequency Warping with Amplitude Scaling (DFWA)
trendsaverage respects )(
details; spectral contains))((
))(()()(1
1,
fA
fWS
fWSfAfS
q
qx
qx
qndfwa
n
n
Levels of Source-Target Parameter Mappings
(1) Frame-level: Standard (GMM)
(3) Class-level: Global (DFWA)
December 11, 201232 E.Godoy, Voice Conversion
Recall Standard Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 33
Parallel corpora: source & target utter same sentences Alignment of individual source & target frames in time
Mappings between acoustic spaces determined on frame level
Corpora(Parallel)
Extract target parameters
Extract source parameters Converted
Speech synthesis
Align source & target frames LEARNING
TRANSFORMATION
DFW with Amplitude Scaling (DFWA)
Associate source & target parameters based on class statistics
Unlike typical VC, no frame alignment required in DFWA!
1. Define Acoustic Classes2. DFW3. Amplitude Scaling
Corpora
Clustering
Extract source parameters
Converted Speech synthesis
DFW Amplitude Scaling
Extract target parameters
Clustering
December 11, 201234 E.Godoy, Voice Conversion
Defining Acoustic Classes
Acoustic classes built using clustering, which serves 2 purposes:1. Classify individual source or target frames in acoustic space
2. Associate source & target classes
Choice of clustering approach depends on available information: Acoustic information:
With aligned frames: joint clustering (e.g. joint GMM or VQ) Without aligned frames: independent clustering & associate
classes (e.g. with closest means) Contextual information: use symbolic information (e.g.
phoneme labels)
Outcome: q=1,2…Q acoustic classes in a one-to-one correspondence
December 11, 201235 E.Godoy, Voice Conversion
DFW Estimation
Compares distributions of observed source & target spectral envelope peak frequencies (e.g. formants)
Global vision of spectral peak behavior (for each class) Only peak locations (and not amplitudes) considered
DFW function: piecewise linear Intervals defined by aligning maxima in spectral peak frequency
distributions– Dijkstra algorithm: min sum abs diff between target & warped source
distributionsq
yjq
xiq Mmff
mm,...,1),,( ,,
xiqmq
yjqmqx
iqxiq
yjq
yjq
mq mm
mm
mm fBfCff
ffB ,,,,
,,
,,,
1
1
mqmqq CfBfW ,,)(
xiq
xiq mmfff
1,, ,
E.Godoy, Voice Conversion
DFW Estimation
Clear global trends (peak locations)
Avoid sporadic frame-to-frame comparisons
Dijkstra algorithm selects pairs
DFW statistically aligning most probable spectral events (peak locations)
Amplitude scaling then adjusts differences between the target & warped source spectral envelopes
0 1000 2000 3000 4000 5000 6000 7000 80000
0.005
0.01
0.015
Frequency (Hz)
Peak Occurrence Distributions
source
0 1000 2000 3000 4000 5000 6000 7000 80000
0.005
0.01
0.015
Frequency (Hz)
target
0 1000 2000 3000 4000 5000 6000 7000 80000
0.005
0.01
0.015
Frequency (Hz)
Peak Occurrence Distributions
0 1000 2000 3000 4000 5000 6000 7000 80000
0.005
0.01
0.015
Frequency (Hz)
source
target
0 1000 2000 3000 4000 5000 6000 7000 80000
0.005
0.01
0.015
Frequency (Hz)
Peak Occurrence Distributions
0 1000 2000 3000 4000 5000 6000 7000 80000
0.005
0.01
0.015
Frequency (Hz)
source
target
warped source
E.Godoy, Voice Conversion
Spectral Envelopes of a Frame
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
DFW
December 11, 201238 E.Godoy, Voice Conversion
Amplitude Scaling
Goal of amplitude scaling:
Adapt frequency-warped source envelopes to better resemble those of target
estimation using statistics of target and warped source data
For class q, amplitude scaling function defined by:
- Comparison between source & target data strictly on acoustic class-level- Difference in average log spectra
- No arbitrary weighting or smoothing!
)))((log())(log())(log( 1 fWSfSfA qxq
yqq
)( fAq
E.Godoy, Voice Conversion
DFWA Transformation
Transformation (for source frame n in class q):
In discrete cepstral domain: (log spectrum parameters)
trendsaverage respects )(
details; spectral contains))((
))(()()(1
1,
fA
fWS
fWSfAfS
q
qx
qx
qndfwa
n
n
)))((log( represents , ofmean
))((for tscoefficien cepstral
1
1
fWS
fWS
qx
nq
qx
n
n
n
qnyq
dfwany ˆ
Average target envelope respected ("unbiased estimator")Captures timbre!
Spectral details maintained:difference between frame realization & averageEnsures Quality!
)))((log())(log())(log( 1 fWSfSfA qxq
yqq
December 11, 201240 E.Godoy, Voice Conversion
Spectral Envelopes of a Frame
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
DFW
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
XDFW
DFWA
0 1000 2000 3000 4000 5000 6000 7000 8000-70
-60
-50
-40
-30
-20
-10
0
Hz
dB
Y
X
DFWDFWA
Phonetic GMM
December 11, 201241 E.Godoy, Voice Conversion
Spectral Envelopes across a Sentence
X
Hz0 2000 4000 6000
50
100
150
200
250
Phonetic GMM
Hz0 2000 4000 6000
50
100
150
200
250
DFWA
Hz0 2000 4000 6000
50
100
150
200
250
Y
Hz0 2000 4000 6000
50
100
150
200
250
zoom
December 11, 201242 E.Godoy, Voice Conversion
Spectral Envelopes across a SoundX
Hz0 500 1000 1500 2000 2500 3000 3500
265
270
275
280
285
Phonetic GMM
Hz0 500 1000 1500 2000 2500 3000 3500
265
270
275
280
285
DFWA
Hz0 500 1000 1500 2000 2500 3000 3500
265
270
275
280
285
Y
Hz0 500 1000 1500 2000 2500 3000 3500
265
270
275
280
285
DFWA looks like natural speech can see warping appropriately
shifting source events overall, DFWA more closely
resembles target
Important to note that examining spectral envelope evolution in time very informative!
Can see poor quality with GMM right away (less evident w/in frame)
E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 44
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +
Amplitude Scaling
III. Conversion Results Objective Metrics & Subjective Evaluations Sound Samples
E.Godoy, Voice Conversion
Formal Evaluations
Speech Corpora & Analysis CMU ARCTIC, US English speakers (2 male, 2 female) Parallel annotated corpora, speech sampled at 16kHz 200 sentences for learning, 100 sentences for testing Speech analysis & synthesis: Harmonic model All spectral envelopes discrete cepstral coefficients
(order 40)
Spectral envelope transformation methods: Phonetic & Traditional GMMs (diagonal covariance
matrices) DFWA (classification by phoneme) DFWE (Hybrid DFW-GMM, E-energy correction) source (no transformation)
December 11, 201245
Objective Metrics for Evaluation
Mean Squared Error (standard) alone is not sufficient! Does not adequately indicate converted speech quality Does not indicate variance in transformed data
Variance Ratio (VR) Average ratio of the transformed-to-target data variance More global indication of behavior
December 11, 201246 E.Godoy, Voice Conversion
Is the transformed data varying from the mean?
Does the transformed data behave like the target data?
Objective Results
GMM-based methods yield lowest MSE but no variance Confirms over-smoothing
DFWA yields higher MSE but more natural variations in speech VR of DFWA directly from variations in the warped source envelopes (not
model adaptations!)
Hybrid (DFWE) in-between DFWA & GMM (as expected) Compared to DFWA, VR notably decreased after energy correction filter
Different indications from MSE & VR: both important, complementary
E.Godoy, Voice Conversion
DFWA for VC with Nonparallel Corpora
Parallel Corpora: big constraint ! Limits the data that can be used in VC
DFWA for Nonparallel corpora Nonparallel Learning Data: 200 disjoint sentences
DFWA equivalent for parallel or nonparallel corpora!
MSE (dB) VR
Parallel -7.74 0.930
Nonparallel -7.71 0.936
E.Godoy, Voice Conversion
Implementation in a VC System
Converted speech synthesis Pitch modification using TD-PSOLA
(Time-Domain Pitch Synchronous Overlap Add) Pitch modification factor determined by classic linear
transformation
Frame synthesis voiced frames:
harmonic amplitudes from transformed spectral envelopes
harmonic phases from nearest-neighbor source sampling unvoiced frames: white noise through AR filter
SpeechAnalysis
(Harmonic)
Spectral Envelope
Transformation
Speech FrameSynthesis
(pitch modification)
TD-PSOLA(pitch modification)source
speech
)(tsx )(ˆ tsyconverted
speech
December 11, 201249 E.Godoy, Voice Conversion
xfxxf
yfy
fy ff
0
0
0
0 0ˆ
0
Subjective Testing
Listeners evaluate
1. Quality of speech2. Similarity of
voice to the target speaker
Mean Opinion Score MOS scale from 1-
5
December 11, 2012E.Godoy, Voice Conversion 50
Subjective Test Results
No observable differences between acoustic decoding & phonetic classification
Converted speech quality consistently higher for DFWA than GMM-based
Equal (if not slightly more) success for DFWA in capturing identity
No compromise between quality & capturing identity for DFWA!
DFWA yields better quality
than other methods
~Equal success for similarity
December 11, 201251 E.Godoy, Voice Conversion
Source Target GMM DFWA DFWE
slt clb (FF)
bdl clb (MF)
Conversion Examples
DFWA consistently highest quality GMM-based suffer “loss of presence”
Key observation: DFWA can deliver high-quality VC!
Target analysis-synthesis with converted spectral envelopes
December 11, 201252 E.Godoy, Voice Conversion
VC Perspectives
Conversion within gender gives better results (e.g. FF) Speakers share similar voice characteristics
For inter-gender conversion, spectral envelope not enough Large pitch modification factors degrade speech Prosody important
Need to address phase spectrum & glottal source too…
December 11, 2012E.Godoy, Voice Conversion 53
Today’s Lecture: Voice Conversion
December 11, 2012E.Godoy, Voice Conversion 54
I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion
II. Spectral Envelope Transformation in VC Typical: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +
Amplitude Scaling
III. Conversion Results Objective Metrics & Subjective Evaluations Sound Samples
IV. Summary & Conclusions
Summary
Text-to-Speech (TTS)1. Concatenative (unit-selection)2. Parametric (HMM-based speaker models)
Acoustic Parameters in speech1. Segmental (pitch, spectral envelope)2. Supra-segmental (prosody, intonation)
Voice Conversion Modify speech to transform speaker identity Can create new voices! Focus: spectral envelope related to timbre
December 11, 2012E.Godoy, Voice Conversion 55
Summary (continued) Spectral Envelope Transformation
1. Spectral envelope parameterzation (cepstral, LPC, …)2. Level of source-target acoustic mappings3. Method for learning & transformation
1. Standard VC Approach Parallel corpora & source-target frame alignment GMM-based transformation
2. Proposed VC Approach Source-target acoustic mapping on class level Dynamic Frequency Warping + Amplitude Scaling
(DFWA)
December 11, 2012E.Godoy, Voice Conversion 56
Important Considerations Spectral Envelope Transformation
1. Speaker identity need to capture average speaker timbre
2. Speech quality need to maintain spectral details
Source-Target Acoustic Mappings1. Contextual information (Phoneme) helpful2. Mapping features on more global class-level effective
Better conversion quality & more flexible (no parallel corpora)
(Frame-level alignment/mappings too narrow & restrictive)
Speaker characteristics (prosody, voice quality) Success of Voice Conversion can depend on speakers Many features of speech to consider & treat…
December 11, 2012E.Godoy, Voice Conversion 57
Thank You!
Questions?
Top Related