Towards Conversational Speech Synthesis; “Lessons Learned from the Expressive Speech Processing...
Embed Size (px)
Transcript of Towards Conversational Speech Synthesis; “Lessons Learned from the Expressive Speech Processing...
Towards Conversational Speech Synthesis;
“Lessons Learned from the Expressive Speech Processing Project”
NiCT / ATR-SLCNational Institute of Information and
ATR Spoken Language Communication Research LabsKeihanna Science City, Kyoto 619-0288, Japan
The JST/CREST ‘ESP’ corpus
• The ATR “Expressive Speech Processing” project (JST/CREST) lasted from 4/’00 to 3/’05 and resulted in a corpus of 1,500 hours of natural conversational speech
• All recordings were transcribed, and about 10% are annotated for speaking-style, etc.
• The corpus is divided into 3 sections : i: esp_f, ii: esp_c, and iii: esp_m
匿名One “utterance” per line
Sections of the ESP corpus
• esp_f– one female speaker, head-mounted mic, 600 hours of
daily spoken interactions, emotion/speech-act/etc …
• esp_c– 10 adult speakers, 5m 5f, 2 Chinese, 2 English, – 30-minute telephone conversations x 10 weeks– all conversations in Japanese, free content
• esp_m– multi-speaker, head-mounted microphones, variety of
interaction settings (like esp_f but many more voices)
finding #1: the Function of Conversational Speech
• To establish a rapport with the listener• To show interest and attention• To convey propositional content
• Contrast “broadcast mode” (one-way) with “interactive mode” (two-way) speech
• Speech Synthesis can do broadcast mode but Conversational Speech is two-way!
Backchannels & affect bursts
the hundred most common utterances
Non-Verbal Speech Sounds
• Short, simple, repetitive noises
• How they are spoken is usually more important than what is being said
• Some examples:– The word ほんま– Means “really”– Used a lot in Osaka conversations …
Synthesis of Non-Verbal Speech Sounds
• The challenge now is how to synthesise these non-lexical speech sounds
• the same speaker says the same word in many consistently different ways …
• How should they best be (a) described
Characteristics of Non-Verbal Utterances
• Better described by icons?
• Short, expressive sounds• Phonetically ambiguous• Prosodically marked
• Not well specified by text input!– But frequent and textually ‘transpar
‘Wrappers’ and ‘Fillings’ - Interaction Devices
• Often used as “edge-markers”– At beginning and end of utterance chunks
• Add expressivity to propositional content– Not just “fillers” –they ‘wrap’ the utterance
e.g., “erm, it’s very simple, you know”
The Acoustic features of Wrappers (and Fillers)
• Prosodically very variable
in more than just pitch & duration …
• pca dimension reduction shows– 3 components account for more than 50% – 7 components account for more than 80% – Voice-quality comes up in the 1st component!
Voice Quality in Synthesis
• Chakai – affect-based unit selection
• using “whole-phrase” units• that vary according to expressivity• selected by their acoustics (princomps)
• They show affective relationships• and serve a pragmatic (phatic) function
• One big advantage of using a midi keyboard is touch-sensitivity – controlled sustain & attack
i.e., (perfect for the natural input of prosody)– with pitch-blend as well …
• Another is that keys can be intuitively grouped into related sets of utterances
Greetings replies opinion calling etc …
Octave or sub-octave clusters …. 5 & 7 black & white keys
Grouping Related Utterances
• It remains as future work to group related utterance types and plan a full keyboard for non-verbal speech sound synthesis
• Demo software is provided on the cd-rom proceedings – please let me know if you have any helpful ideas or suggestions
• Non-verbal sounds offer a challenge for the synthesis of interactive speech
• They are frequent and carry important affective and discourse-flow information
• Segments can be selected and reused from a conversational speech corpus
• This paper has presented some examples of non-linguistic uses of speech prosody
• Synthesis of expressive sounds is easy!– ‘Units’ can be whole phrases
• But unit selection is difficult!– They carry subtle differences of meaning– That can be very hard to specify in text
• Some examples of conversational speech
• (a) taken from the corpus (natural)
• (b) synthesised using current technology
• (c) concatenated from a very-large corpus
• Listen to the non-linguistic prosody!
advanced text rendering
– The original dialogue – ditto - synthesised– CHATR (& original ）– NATR – large-corpus– NATR – more lively
Morning もしもしMorning もしもしHello こんにちはhi_there_ まいどHaha ハハハbeen_a_long_time 久しぶりですねー came_staight_to_the_eighth_floor もう直接、八階の方に、はいReally あ、そうなんReally あ、ほんまseventh_floor_today 七階すか、今日yeah_yeah うーん、そうそうHahaha ワーハハハー what_time_did_you_come 何時頃来たんすかjust_now さっきabout_now さっきぐらいウアハハハハハハ、まじで bit_late ちょっと遅なっ＜てんやん、アハjust_in_time ぎりぎりー not_really いや、そういうわけじゃないねんけどyeah_yeah_yeah はーいあいはいはい Umm うんYeah そっかそっかSo そう came_by_bike あたし自転車やからfrom_Kyooubashi 京橋のほうじゃなかったですっけReally そうやYeah でしょうUmm うんfrom_Kyoubashi 京橋からby_bike チャリンコですぐですか、あーそんなもんで来れるんやー Yeah そうYeah うん、だいたいReally あそうなーんすかYeah うん
7. Acknowledgements• This work is supported by the National Institute o
f Information and Communications Technology (NiCT), and includes contributions from the Japan Science & Technology Corporation (JST), and the Ministry of Public Management, Home Affairs, Posts and Telecommunications, Japan (SCOPE).
• The author is especially grateful to the management of ATR Spoken Language Communication Research Labs for their continuing encouragement and support.