CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL...
-
Upload
natalie-gibbs -
Category
Documents
-
view
219 -
download
1
Transcript of CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL...
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1
PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL
BALANCE OF VOWELS
Jan P.H. van Santen and Xiaochuan Niu
Center for Spoken Language UnderstandingOGI School of Science & Technology at OHSU
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 2
OVERVIEW
1. IMPORTANCE OF SPECTRAL BALANCE2. MEASUREMENT OF SPECTRAL BALANCE3. ANALYSIS METHODS4. RESULTS5. SYNTHESIS6. CONCLUSIONS
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 3
1. IMPORTANCE OF SPECTRAL BALANCE
• Linguistic Control Factors– Stress-like factors– Positional factors– Phonemic factors
• Acoustic Correlates– Traditionally TTS-controlled:
• Pitch, timing, amplitude
– Demonstrated in natural speech, but usually not TTS-controlled:• Spectral tilt, balance• Formant dynamics• …
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 4
2. MEASUREMENT OF SPECTRAL BALANCE
• Data:– 472 greedily selected sentences
• Genre: newspaper• Greedy features: linguistic control factors
– One female speaker– Manual segmentation– Accent: independent rating by 3 judges
• 0-3 score
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 5
2. MEASUREMENT OF SPECTRAL BALANCE
• Energy in 5 formant-range frequency bands– B0: 100-300 Hz [~F0]
– B1: 300-800 Hz [~F1]
– B2: 800-2500 Hz [~F2]
– B3: 2500-3500 Hz [~F3]
– B4: 3500- max Hz [~fricative noise]
• In other words, multidimensional measure• Filter bank Square
Average [1 ms rect.] 20 log10(Bi )
• Subtract estimated per-utterance means
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 6
2. MEASUREMENT OF SPECTRAL BALANCE
• Details:– Confounding with F0
• Measure pitch-corrected and raw– For certain wave shapes, pitch directly related to fixed-frame
energy– Why do both: wave shapes may change in unknown ways
• F0 not confined to B0 [female speech]
– Vowel formants not quite confined to bands [e.g., F1 for /EE/ and F3 for /ER/]
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 7
2. MEASUREMENT OF SPECTRAL BALANCE
• Why not more or different bands?– Multiple interacting Linguistic Control Factors
• Need measurements that minimize interactions
– 5 bands Different vowels “behave similarly”• Can model vowels as a class
• Why not simply spectral tilt?– 5 bands more information than single measure– Supply more information for synthesis
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 8
3. ANALYSIS METHODS
• Measures likely to behave like segmental duration:– Multiple interacting, confounded factors:
• Interaction: Magnitude of effects on one factor may depend on other factors
• Confounding: Unequal frequencies of control factor combinations
– “Directional Invariance”• Direction of effects on one factor
independent of other factors
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 9
3. ANALYSIS METHODS
• Need method that – can handle multiple interacting,
confounded factors and – takes advantage of Directional
Invariance:
• Used: Sums of Products Model:
Ki Ij
jjini
i
cSccB )(),...,( ,0
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 10
3. ANALYSIS METHODS
• Special cases:– Multiplicative model: K = {1}, I1 = {0,…,n}
)()(),...,( ,100,10 nnni cScSccB
)()(),...,( 1,01,00 nnni cScSccB
– Additive model: K = {0,…,n}, Ii = {i}
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 11
3. ANALYSIS METHODS
• Used additive model
• Note: Parameter estimates are:– Estimates of marginal means …– … in balanced design:
),...,,...,()( 0,...,,...,
1,00
niiCcccCc
ii cccBMeancSnnii
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 12
3. ANALYSIS METHODS
• Pitch correction:
)(log20)(log20 10010][
wici tfBB
• Confounding with F0: Show both
<B0, B1, B2, B3, B4> and:
<B0 + B1, B2, B3, B4>
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 13
4. RESULTS: (A) POSITIONAL EFFECTS
5 Bands, not pitch-correctedSolid: right position, dashed: left position. Y-axis: corrected mean
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 14
4. RESULTS: (A) POSITIONAL EFFECTS
5 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 15
4. RESULTS: (A) POSITIONAL EFFECTS
4 Bands, not pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 16
4. RESULTS: (A) POSITIONAL EFFECTS
4 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 17
4. RESULTS: (B) STRESS/ACCENT EFFECTS
5 Bands, not pitch-correctedSolid: stressed syllable, dashed: unstressed. Y-axis: corrected mean
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 18
4. RESULTS: (B) STRESS/ACCENT EFFECTS
5 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 19
4. RESULTS: (B) STRESS/ACCENT EFFECTS
4 Bands, not pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 20
4. RESULTS: (B) STRESS/ACCENT EFFECTS
4 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 21
4. RESULTS: (C) TILT EFFECTS
4
3
2
1
0
)2,1,0,1,2(
B
B
B
B
B
Tilt
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 22
5. SYNTHESIS
• Use ABS/OLA sinusoidal model:s[n] = sum of overlapped short-time signal frames sk[n]
sk[n] = sum of quasi-harmonic sinusoidal components:
sk[n] lAk,l cos(k,l n + k,l
• Each frame of unit is represented by a set of quasi-harmonic sinusoidal parameters;
• Given the desired F0 contour, pitch shift is applied to the sinusoidal parameter component of the unit to obtain the target parameter Ak,l ;
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 23
5. SYNTHESIS
• Considering the differences of prosody factors between original and target unit, band differences:
iii BB ˆ
• Transform the band difference into weights applying to the sinusoidal parameters:
i
2010 iiw
• ,when the j’th harmonic is located in
the i'th band;ikjkj wAA
• Spectral smoothing across unit boundaries.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 24
5. SYNTHESIS
5 Bands modification example [i:]
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 25
CONCLUSIONS
• Described simple methods for predicting and synthesizing spectral balance
• But: Spectral balance is only one “non-standard acoustic correlate”
• Others that remain to be addressed:– Spectral dynamics– Phase