HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ

HARMONIC MODELFOR FEMALE VOICE EMOTIONAL SYNTHESIS

Anna PŘIBILOVÁDepartment of Radioelectronics, Slovak University of Technology

Ilkovičova 3, SK-812 19 Bratislava, Slovakia, E-mail: [email protected]

Jiří PŘIBILInstitute of Photonics and Electronics, Academy of Sciences of the Czech RepublicChaberská 57, CZ-182 51 Praha 8, Czech Republic, E-mail: [email protected]

• Introduction

• Harmonic speech model with AR parameterization

• Spectral modifications for emotional synthesis

• Prosodic modifications for emotional synthesis

• Listening tests results

• Conclusion

Harmonic speech model with AR parameterization

F0

pitch harmonics

jωeA

G fv-uv(SF)

G SF {an}

Hilbert transform

randomization <-, >

ln

sum of sine waves + overlap and add

{m} {Am} {fm}

synthetic speech

Fs

uvv Sf

f 80exp2

voicing transition frequency

fv-uv(SF)

M

mmmm lfAls

1

2 cos

Voicing transition frequency

0 1 2 3 4 5 6 7 8-140

-120

-100

-80

-60

-40

-20

Frequency [kHz]

Spe

ctru

m [

dB]

Voicing transition frequency

unvoiced part voiced part

Determination of model parameters

G {an}

abs(FFT)

speech

Hamming window

spectral envelope estimation

autocorrelation method of LPC analysis

exp

ln

real(IFFT)

SF

2/

1

2

/22/

1

2

2/

1

2

2/

1

2

2

2

ln2

exp

F

FF

F

F

N

kk

F

NN

kk

N

kk

F

N

kk

FF

SN

S

SN

SN

S

spectral flatness measure

F1 300 Hz 840 HzF2 840 Hz 2400 HzF3 2400 Hz 3840 HzF4 3840 Hz 4800 Hz

Female formant areas (+20%)

Emotional influence on speech formants

pleasant emotions – faucal and pharyngeal expansion, relaxation of tract walls, mouth corners retracted upward (F1 falling, resonances raised)

unpleasant emotions – faucal and pharyngeal constriction, tensing of vocal tract walls, mouth corners retracted downward(F1 rising, F2 and F3 falling)

pleasant emotionsF1 falling, resonances raised

unpleasant emotions

F1 rising, F2 and F3 falling

Scherer, K., R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, Vol. 40 (2003) 227-256

Male formant areas

F1 250 Hz 700 HzF2 700 Hz 2000 HzF3 2000 Hz 3200 HzF4 3200 Hz 4000 Hz

Fant, G.: Speech Acoustics and Phonetics. Kluwer Academic Publishers, Dordrecht (2004)

700 Hz700 Hz 840 Hz

840 Hz

frequency

scale transformation

exp(real(IFFT))

autocorrelation method of

LPC analysis

G

{an}

jωeA

G G'

{an' }

Spectral modifications for emotional synthesis

frequencyscale

transformation

SF SF'

1.24

1.11

2.02

joy

anger

sadness

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 80.8

0.9

1

1.1

1.2

1.3

1.4

1.5

0 1 2 3 4 5 6 7 80.8

0.9

1

1.1

1.2

1.3

1.4

1.5

Frequency scale transformation

F1,2

F1( < F1,2 )increased

(decreased)

F2, F3, F4( > F1,2)

decreased(increased)

fs/4

F1,2

fs/4

0 ,0

4/ ,2,1 sfF

2/ ,2/ ss ff

1 ,0

1,8/ sf

1 ,4/sf

2,8/3 sf

1 ,2/sf

f [kHz] f [kHz]

f [kHz]

f [kHz]

[-] [-]

Formant ratio between emotional and neutral speech

chosen formant ratio(for frequency after transformation)

1

(214.3 Hz)

2

(2666.7 Hz)

joyous-to-neutral formant ratio (shift) 0.7 ( 30 % ) 1.05 ( + 5 % )

angry-to-neutral formant ratio (shift) 1.35 ( + 35 % ) 0.85 ( 15 % )

sad-to-neutral formant ratio (shift) 1.1 ( + 10 % ) 0.9 ( 10 % )

mean formant ratioin formant areas

F1300840 Hz

F28402400

Hz

F324003840 Hz

F438404800

Hz

joyous-to-neutral formant ratio

(shift)

0.8982

10.18 %)

1.0589

(+ 5.89 %)

1.0334

(+ 3.34 %)

0.9964

( 0.36 %)

angry-to-neutral formant ratio

(shift)1.1289

(+ 12.89 %)0.8849

( 11.51 %)0.8623

13.77 %)0.9012

9.88 %)

sad-to-neutral formant ratio

(shift)1.0432

(+ 4.32 %)0.9383

6.17 %)0.8991

10.09 %)0.9076

9.24 %)

joyous

angry

sad

joyous

angry

sad

30 %

15 %

10 %

10.18 %

13.77 % 9.88 %

10.09 % 6.17 %

+ 5.89 % + 3.34 %

+ 12.89 %

+ 4.32 %

+ 35 %

+ 10 %

+ 5 %

0.36 %

9.24 %

11.51 %

Prosody of emotional speech

Scherer, K., R.: Vocal Communication of Emotion: A Review of Research Paradigms. Speech Communication, Vol. 40 (2003) 227-256

EMOTION F0 mean F0 range energy duration

JOY higher higher higher shorter

ANGER higher higher higher shorter

SADNESS lower lower lower longer

EMOTION F0 mean F0 range energy duration

JOY 1.18 1.30 1.30 0.81

ANGER 1.16 1.30 1.70 0.84

SADNESS 0.81 0.62 0.95 1.16

OUR CHOICE OF EMOTIONAL-TO-NEUTRAL RATIOS

0 50 100 150 200

0.4

0.6

0.8

1

1.2

1.4

1.6

—›

F0 re

l [-]

—› N [frames]0 50 100 150 200

0.4

0.6

0.8

1

1.2

1.4

1.6

—›

F0 re

l [-]

—› N [frames]

VF0source

VF0LT

LT, start at 122

VF0source

VF0LT

LT, start at 61

Linear trend of F0 at the end of sentences

JOY

EMOTION linear trend type linear trend start

JOY rising 55 % from the end

ANGER falling 35 % from the end

ANGER

Listening tests

“Determination of emotion type”

– 10 evaluation sets selected randomly from the testing corpus

– 60 short sentences (1 s 3.5 s)

– from the Czech stories

– female professional actors

– 4 possibilities: “joy”, “anger”, “sadness”, “other”

20 listeners (16 Czechs and 4 Slovaks, 6 women and 14 men)

http://www.lef.um.savba.sk/Scripts/itstposl2.dll

MS ISAPI/NSAPI DLL script- runs on server PC- communicates with user via HTTP protocol


Listening tests


MS ISAPI/NSAPI DLL script- runs on server PC- communicates with user via HTTP protocol


Listening tests results

EMOTION JOY ANGER SADNESS OTHER

JOY 59.0 % 0.5 % 16.0 % 24.5 %

ANGER 2.5 % 73.5 % 2.0 % 22.0 %

SADNESS 0.5 % 0.5 % 90.0 % 9.0 %

Successful determination of emotions (summed for all emotions)

Confusion matrix

correct not classified

exchanged

best evaluated sentence * 88.1 % 11.9 % 0 %

worst evaluated sentence ** 57.6 % 30.3 % 12.1 %* “Vše co potřeboval.” (“All he needed.”)** “Máš ho mít.” (“You ought to have it.”)

Conclusion

Female voice emotional conversion:– harmonic speech model with AR parameterization

Spectral modifications:– spectral envelope: formant shift– spectral flatness => voicing transition frequency

Prosodic modifications:– energy, duration, F0 mean, range, linear trend at the end of sentences

Listening tests:best synthesized: sadnessworst synthesized: joy

Next research:

– inclusion of microprosodic features in emotional voice conversion

– modifications of F0 linear trend at the beginning of sentences

HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ

Documents

Transcript of HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ