Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology &

30
Towards Conversational Speech Synthesis; “Lessons Learned from the Expressive Speech Processing Project” Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology & ATR Spoken Language Communication Research La bs Keihanna Science City, Kyoto 619-0288, Japan [email protected], [email protected]

description

Towards Conversational Speech Synthesis; “Lessons Learned from the Expressive Speech Processing Project”. Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology & ATR Spoken Language Communication Research Labs - PowerPoint PPT Presentation

Transcript of Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology &

Page 1: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Towards Conversational Speech Synthesis;

“Lessons Learned from the Expressive Speech Processing Project”

Nick Campbell

NiCT / ATR-SLCNational Institute of Information and

Communications Technology&

ATR Spoken Language Communication Research LabsKeihanna Science City, Kyoto 619-0288, Japan

[email protected], [email protected]

Page 2: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

The JST/CREST ‘ESP’ corpus

• The ATR “Expressive Speech Processing” project (JST/CREST) lasted from 4/’00 to 3/’05 and resulted in a corpus of 1,500 hours of natural conversational speech

• All recordings were transcribed, and about 10% are annotated for speaking-style, etc.

• The corpus is divided into 3 sections : i: esp_f, ii: esp_c, and iii: esp_m

Page 3: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Transcription example

匿名One “utterance” per line

Page 4: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Sections of the ESP corpus• esp_f

– one female speaker, head-mounted mic, 600 hours of daily spoken interactions, emotion/speech-act/etc …

• esp_c– 10 adult speakers, 5m 5f, 2 Chinese, 2 English, – 30-minute telephone conversations x 10 weeks– all conversations in Japanese, free content

• esp_m– multi-speaker, head-mounted microphones, variety of

interaction settings (like esp_f but many more voices)

Page 5: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &
Page 6: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

finding #1: the Function of Conversational Speech

• To establish a rapport with the listener• To show interest and attention• To convey propositional content

• Contrast “broadcast mode” (one-way) with “interactive mode” (two-way) speech

• Speech Synthesis can do broadcast mode but Conversational Speech is two-way!

Page 7: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &
Page 8: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Backchannels & affect bursts

the hundred most common utterances

Page 9: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Non-Verbal Speech Sounds• Short, simple, repetitive noises• How they are spoken is usually more

important than what is being said

• Some examples:– The word ほんま– Means “really”– Used a lot in Osaka conversations …

Page 10: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &
Page 11: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Synthesis of Non-Verbal Speech Sounds

• The challenge now is how to synthesise these non-lexical speech sounds

• the same speaker says the same word in many consistently different ways …

• How should they best be (a) described (b) realised?

Page 12: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Tap-to-talk

http:feast.atr.jp/imode

Page 13: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Characteristics of Non-Verbal Utterances

• Better described by icons?

• Short, expressive sounds• Phonetically ambiguous• Prosodically marked

• Not well specified by text input!– But frequent and textually ‘transpar

ent’

Page 14: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

‘Wrappers’ and ‘Fillings’ - Interaction Devices

• Often used as “edge-markers”– At beginning and end of utterance chunks

• Add expressivity to propositional content– Not just “fillers” –they ‘wrap’ the utterance

e.g., “erm, it’s very simple, you know”

Page 15: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

The Acoustic features of Wrappers (and Fillers)

• Prosodically very variable in more than just pitch & duration …

• pca dimension reduction shows– 3 components account for more than 50% – 7 components account for more than 80% – Voice-quality comes up in the 1st component!

Page 16: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &
Page 17: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Voice Quality in Synthesis

• Chakai – affect-based unit selection

• using “whole-phrase” units• that vary according to expressivity• selected by their acoustics (princomps)

• They show affective relationships• and serve a pragmatic (phatic) function

Page 18: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Chakai

Page 19: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

KeyTalk

Page 20: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Touch-sensitive Selection• One big advantage of using a midi keyboard is

touch-sensitivity – controlled sustain & attack

i.e., (perfect for the natural input of prosody)– with pitch-blend as well …

• Another is that keys can be intuitively grouped into related sets of utterances

Page 21: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Greetings replies opinion calling etc …

Octave or sub-octave clusters …. 5 & 7 black & white keys

Page 22: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Grouping Related Utterances• It remains as future work to group related

utterance types and plan a full keyboard for non-verbal speech sound synthesis

• Demo software is provided on the cd-rom proceedings – please let me know if you have any helpful ideas or suggestions

:-)

Page 23: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Summary

• Non-verbal sounds offer a challenge for the synthesis of interactive speech

• They are frequent and carry important affective and discourse-flow information

• Segments can be selected and reused from a conversational speech corpus

Page 24: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Conclusion • This paper has presented some examples

of non-linguistic uses of speech prosody

• Synthesis of expressive sounds is easy!– ‘Units’ can be whole phrases

• But unit selection is difficult!– They carry subtle differences of meaning– That can be very hard to specify in text

Page 25: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Listen:• Some examples of conversational speech

• (a) taken from the corpus (natural)• (b) synthesised using current technology• (c) concatenated from a very-large corpus

• Listen to the non-linguistic prosody!

Page 26: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

NATRnext-generation

advanced text rendering

– The original dialogue – ditto - synthesised– CHATR (& original )– NATR – large-corpus– NATR – more lively

Morning もしもしMorning もしもしHello こんにちはhi_there_ まいどHaha ハハハbeen_a_long_time 久しぶりですねー came_staight_to_the_eighth_floor もう直接、八階の方に、はいReally あ、そうなんReally あ、ほんまseventh_floor_today 七階すか、今日yeah_yeah うーん、そうそうHahaha ワーハハハー what_time_did_you_come 何時頃来たんすかjust_now さっきabout_now さっきぐらいウアハハハハハハ、まじで bit_late ちょっと遅なっ<てんやん、アハjust_in_time ぎりぎりー not_really いや、そういうわけじゃないねんけどyeah_yeah_yeah はーいあいはいはい Umm うんYeah そっかそっかSo そう came_by_bike あたし自転車やからfrom_Kyooubashi 京橋のほうじゃなかったですっけReally そうやYeah でしょうUmm うんfrom_Kyoubashi 京橋からby_bike チャリンコですぐですか、あーそんなもんで来れるんやー Yeah そうYeah うん、だいたいReally あそうなーんすかYeah うん

Page 27: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

7. Acknowledgements• This work is supported by the National Institute o

f Information and Communications Technology (NiCT), and includes contributions from the Japan Science & Technology Corporation   (JST), and the Ministry of Public Management, Home Affairs,   Posts and Telecommunications, Japan (SCOPE).

• The author is especially grateful to the management of ATR Spoken Language Communication Research Labs for their continuing encouragement and support.

Page 28: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Thank you

coming next:

Page 29: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &

Thank you

Page 30: Nick Campbell NiCT / ATR-SLC National Institute of Information and  Communications Technology &