HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ
description
Transcript of HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ
HARMONIC MODELFOR FEMALE VOICE EMOTIONAL SYNTHESIS
Anna PŘIBILOVÁDepartment of Radioelectronics, Slovak University of Technology
Ilkovičova 3, SK-812 19 Bratislava, Slovakia, E-mail: [email protected]
Jiří PŘIBILInstitute of Photonics and Electronics, Academy of Sciences of the Czech RepublicChaberská 57, CZ-182 51 Praha 8, Czech Republic, E-mail: [email protected]
• Introduction
• Harmonic speech model with AR parameterization
• Spectral modifications for emotional synthesis
• Prosodic modifications for emotional synthesis
• Listening tests results
• Conclusion
Harmonic speech model with AR parameterization
F0
pitch harmonics
jωeA
G fv-uv(SF)
G SF {an}
Hilbert transform
randomization <-, >
ln
sum of sine waves + overlap and add
{m} {Am} {fm}
synthetic speech
Fs
uvv Sf
f 80exp2
voicing transition frequency
fv-uv(SF)
M
mmmm lfAls
1
2 cos
Voicing transition frequency
0 1 2 3 4 5 6 7 8-140
-120
-100
-80
-60
-40
-20
Frequency [kHz]
Spe
ctru
m [
dB]
Voicing transition frequency
unvoiced part voiced part
Determination of model parameters
G {an}
abs(FFT)
speech
Hamming window
spectral envelope estimation
autocorrelation method of LPC analysis
exp
ln
real(IFFT)
SF
2/
1
2
/22/
1
2
2/
1
2
2/
1
2
2
2
ln2
exp
F
FF
F
F
N
kk
F
NN
kk
N
kk
F
N
kk
FF
SN
S
SN
SN
S
spectral flatness measure
F1 300 Hz 840 HzF2 840 Hz 2400 HzF3 2400 Hz 3840 HzF4 3840 Hz 4800 Hz
Female formant areas (+20%)
Emotional influence on speech formants
pleasant emotions – faucal and pharyngeal expansion, relaxation of tract walls, mouth corners retracted upward (F1 falling, resonances raised)
unpleasant emotions – faucal and pharyngeal constriction, tensing of vocal tract walls, mouth corners retracted downward(F1 rising, F2 and F3 falling)
pleasant emotionsF1 falling, resonances raised
unpleasant emotions
F1 rising, F2 and F3 falling
Scherer, K., R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, Vol. 40 (2003) 227-256
Male formant areas
F1 250 Hz 700 HzF2 700 Hz 2000 HzF3 2000 Hz 3200 HzF4 3200 Hz 4000 Hz
Fant, G.: Speech Acoustics and Phonetics. Kluwer Academic Publishers, Dordrecht (2004)
700 Hz700 Hz 840 Hz
840 Hz
frequency
scale transformation
exp(real(IFFT))
autocorrelation method of
LPC analysis
G
{an}
jωeA
G G'
{an' }
Spectral modifications for emotional synthesis
frequencyscale
transformation
SF SF'
1.24
1.11
2.02
joy
anger
sadness
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 80.8
0.9
1
1.1
1.2
1.3
1.4
1.5
0 1 2 3 4 5 6 7 80.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Frequency scale transformation
F1,2
F1( < F1,2 )increased
(decreased)
F2, F3, F4( > F1,2)
decreased(increased)
fs/4
F1,2
fs/4
0 ,0
4/ ,2,1 sfF
2/ ,2/ ss ff
1 ,0
1,8/ sf
1 ,4/sf
2,8/3 sf
1 ,2/sf
f [kHz] f [kHz]
f [kHz]
f [kHz]
[-] [-]
Formant ratio between emotional and neutral speech
chosen formant ratio(for frequency after transformation)
1
(214.3 Hz)
2
(2666.7 Hz)
joyous-to-neutral formant ratio (shift) 0.7 ( 30 % ) 1.05 ( + 5 % )
angry-to-neutral formant ratio (shift) 1.35 ( + 35 % ) 0.85 ( 15 % )
sad-to-neutral formant ratio (shift) 1.1 ( + 10 % ) 0.9 ( 10 % )
mean formant ratioin formant areas
F1300840 Hz
F28402400
Hz
F324003840 Hz
F438404800
Hz
joyous-to-neutral formant ratio
(shift)
0.8982
10.18 %)
1.0589
(+ 5.89 %)
1.0334
(+ 3.34 %)
0.9964
( 0.36 %)
angry-to-neutral formant ratio
(shift)1.1289
(+ 12.89 %)0.8849
( 11.51 %)0.8623
13.77 %)0.9012
9.88 %)
sad-to-neutral formant ratio
(shift)1.0432
(+ 4.32 %)0.9383
6.17 %)0.8991
10.09 %)0.9076
9.24 %)
joyous
angry
sad
joyous
angry
sad
30 %
15 %
10 %
10.18 %
13.77 % 9.88 %
10.09 % 6.17 %
+ 5.89 % + 3.34 %
+ 12.89 %
+ 4.32 %
+ 35 %
+ 10 %
+ 5 %
0.36 %
9.24 %
11.51 %
Prosody of emotional speech
Scherer, K., R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, Vol. 40 (2003) 227-256
EMOTION F0 mean F0 range energy duration
JOY higher higher higher shorter
ANGER higher higher higher shorter
SADNESS lower lower lower longer
EMOTION F0 mean F0 range energy duration
JOY 1.18 1.30 1.30 0.81
ANGER 1.16 1.30 1.70 0.84
SADNESS 0.81 0.62 0.95 1.16
OUR CHOICE OF EMOTIONAL-TO-NEUTRAL RATIOS
0 50 100 150 200
0.4
0.6
0.8
1
1.2
1.4
1.6
—›
F0 re
l [-]
—› N [frames]0 50 100 150 200
0.4
0.6
0.8
1
1.2
1.4
1.6
—›
F0 re
l [-]
—› N [frames]
VF0source
VF0LT
LT, start at 122
VF0source
VF0LT
LT, start at 61
Linear trend of F0 at the end of sentences
JOY
EMOTION linear trend type linear trend start
JOY rising 55 % from the end
ANGER falling 35 % from the end
ANGER
Listening tests
“Determination of emotion type”
– 10 evaluation sets selected randomly from the testing corpus
– 60 short sentences (1 s 3.5 s)
– from the Czech stories
– female professional actors
– 4 possibilities: “joy”, “anger”, “sadness”, “other”
20 listeners (16 Czechs and 4 Slovaks, 6 women and 14 men)
http://www.lef.um.savba.sk/Scripts/itstposl2.dll
MS ISAPI/NSAPI DLL script- runs on server PC- communicates with user via HTTP protocol
http://www.lef.um.savba.sk/Scripts/itstposl2.dll
Listening tests
http://www.lef.um.savba.sk/Scripts/itstposl2.dll
MS ISAPI/NSAPI DLL script- runs on server PC- communicates with user via HTTP protocol
http://www.lef.um.savba.sk/Scripts/itstposl2.dll
Listening tests results
EMOTION JOY ANGER SADNESS OTHER
JOY 59.0 % 0.5 % 16.0 % 24.5 %
ANGER 2.5 % 73.5 % 2.0 % 22.0 %
SADNESS 0.5 % 0.5 % 90.0 % 9.0 %
Successful determination of emotions (summed for all emotions)
Confusion matrix
correct not classified
exchanged
best evaluated sentence * 88.1 % 11.9 % 0 %
worst evaluated sentence ** 57.6 % 30.3 % 12.1 %* “Vše co potřeboval.” (“All he needed.”)** “Máš ho mít.” (“You ought to have it.”)
Conclusion
Female voice emotional conversion:– harmonic speech model with AR parameterization
Spectral modifications:– spectral envelope: formant shift– spectral flatness => voicing transition frequency
Prosodic modifications:– energy, duration, F0 mean, range, linear trend at the end of sentences
Listening tests:best synthesized: sadnessworst synthesized: joy
Next research:
– inclusion of microprosodic features in emotional voice conversion
– modifications of F0 linear trend at the beginning of sentences