Analysis and Modelling of Speech Prosody and Speaking Style filedescription level symbol description...
Transcript of Analysis and Modelling of Speech Prosody and Speaking Style filedescription level symbol description...
1/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Analysis and Modelling ofSpeech Prosody and Speaking Style
Nicolas Obin
Sound Analysis and Synthesis Department
IRCAM - CNRS - UMR 9912 - STMS
23 June 2011
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
2/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
3/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Text-To-Speech Synthesis
Current Methods� Unit selection [Hunt and Black, 1996]� HMM-based [Yoshimura et al., 1999]: model
speech characteristics based on parametricstatistical methods
Current State� intelligible �
Limitations� natural
� variety
Improvement� modelling speech prosody
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
4/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Speech Prosody
Definition
� Suprasegmental variations of speech - “the music of speech” (e.g., intonation,
accent)
� Conveys meaning, emotions, intentions, and many information about the
background of a speaker
� Vocal signature of a speaker, contribute as a part of his identity
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
5/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Speech Prosody
Continuous Characteristics
� fundamental frequency
� duration
� intensity
� voice quality
� articulation degree
Discrete Characteristics
� Description of relevant
prosodic events
1. accent
2. boundaries (e.g., pause)
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bOn9R
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
6/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Levels of Speech Communication
HEY PATRICK!
phonetics
phonology
morphology
syntax
semantics
pragmatics
acoustic signal
SPEAKER
LISTENER
SPEECH:PROSODY
TEXT:ABSTRACTLINGUISTICLEVELS
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
7/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Major Contributions of this Work
Statistical Modelling of Speech Prosody
1. Statistical modelling of discrete (position of prosodic events) and continuous (F0variations, duration) characteristics of speech prosody
2. Combination of syntactic and metric constraints to assign pauses3. Stylization and trajectory modelling of F0 contours
Integration of a Rich Linguistic Description
4. Use of deep syntactic parsing to model speech prosody characteristics
Application to Speaking Style Modelling
5. Study on the ability of listeners to identify a speaking style6. Reference for the evaluation of speaking style modelling7. Ability of discrete/continuous HMMs to model the characteristics of a speaking
style
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
8/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Statistical Framework Used in this Work
Hidden Markov Models� commonly used in speech recognition/synthesis
Used in Speech Synthesis� discrete/continuous HMMs [Black and Taylor, 1994, Yoshimura et al., 1999]
Paradigms� modelling speech characteristics� modelling variability due to the context (e.g. phonemic, lexical, syntactic, or even
semantic)� context clustering: modelling contexts that are acoustically relevant
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
9/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
10/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Introduction
Objective
Determine the position of prosodic events (accent, pause) from a text
Linguistic Studies
have pointed out that speech prosody is partially constrained by:
� syntactic constraint [Selrik, 1984]
� metric constraint [Liberman and Prince, 1977]
Issues
� Modelling syntactic constraint
� Modelling metric constraint
� Combining syntactic & metric constraints
Contributions of my Work
� Integration of a deep syntactic description
� Use of Segmental HMMs + Dempster-Shafer Fusion
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
11/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Transcription of Speech Prosody Used in this Work1
Specificities of French prosody
� French: syllable-based language (vs. English: stress-based language)
� prosodic prominences are primary cues used to segment speech into
syntactico-semantic units
� prosodic boundaries have a central role in French prosody
My contribution for the
transcription of French Prosody
proposed alternative to the TOBI
standard [Silverman et al., 1992]
� major prosodic boundary (FM)
� minor prosodic boundary (Fm)
� accent (P)
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
3.3. PHONOLOGICAL REPRESENTATION OF SPEECH PROSODY 51
%H
H%
L*
L%
FM
Fm
frontier
3.3.4 Rhapsodie
The Rhapsodie is a transcription system that has been developed in the Rhapsodie Project(Rhapsodie: Reference Prosody Corpus of Spoken French) [Lacheret et al., 2010].
The Rhapsodie transcription system intends at providing a simple and unified transcriptionground that can be shared among the existing phonological theories and description systems.The description of the prosodic variations is based on the perception of prosodic objectsthat are implictely shared among the phonological theories, such as prosodic prominence andprosodic packaging. The prosodic prominence is defined as an acoustic saliency, and coversprosodic events that are marked by intonation or by any other acoustic cue. The perceptualdescription of prosodic variations present several advantages over more sophisticated systems.First, a perceptual description does not require expert knowledge, and can be processed bymoderately-trained individuals. Second, the transcription can be easily integrated into mostof the existing models for further phonetic and phonological descriptions. In particular,the perceptual level provides a minimal description of the prosodic events that can be used toprecise and describe the acoustic dimensions that may be phonetically and phonologically relevant.
The minimal prosodic unit used for the description is the syllable, and the maximal prosodic unitis the prosodic group. The transcription of prosodic events is based on the perception of prosodicprominences (P) and prosodic packages (minor and major prosodic frontiers). The transcriptionis processed recursively to account for the hierarchical organization of prosodic events. For thispurpose, a variable temporal resolution is used to manage the relativity in the perception of prosodicprominences and to refine gradually the prosodic description. First, a segmentation into prosodicgroups (PGs) is achieved within a large integration domain (typically 5-10 s. depending on thespeaking style). The segmentation is based on the perception of a major prosodic prominence thatis associated with the end of a prosodic package. Second, a segmentation into internal prosodicgroups (IPGs) is achieved within each prosodic group. The segmentation is based on the perceptionof a minor prosodic prominence that are associated with the end of a prosodic package. Finally,residual prosodic prominences (P) are identified as the remaining perceived prosodic prominencesthat occur within the internal prosodic group. Additionnal symbols are used to manage uncertaintyand underspecification on the presence and nature of a prosodic frontier, and on the presence of aprosodic prominence. Speech disfluencies are transcribed in parallel to the prosodic transcription
3.3. PHONOLOGICAL REPRESENTATION OF SPEECH PROSODY 51
%H
H%
L*
L%
FM
Fm
frontier
3.3.4 Rhapsodie
The Rhapsodie is a transcription system that has been developed in the Rhapsodie Project(Rhapsodie: Reference Prosody Corpus of Spoken French) [Lacheret et al., 2010].
The Rhapsodie transcription system intends at providing a simple and unified transcriptionground that can be shared among the existing phonological theories and description systems.The description of the prosodic variations is based on the perception of prosodic objectsthat are implictely shared among the phonological theories, such as prosodic prominence andprosodic packaging. The prosodic prominence is defined as an acoustic saliency, and coversprosodic events that are marked by intonation or by any other acoustic cue. The perceptualdescription of prosodic variations present several advantages over more sophisticated systems.First, a perceptual description does not require expert knowledge, and can be processed bymoderately-trained individuals. Second, the transcription can be easily integrated into mostof the existing models for further phonetic and phonological descriptions. In particular,the perceptual level provides a minimal description of the prosodic events that can be used toprecise and describe the acoustic dimensions that may be phonetically and phonologically relevant.
The minimal prosodic unit used for the description is the syllable, and the maximal prosodic unitis the prosodic group. The transcription of prosodic events is based on the perception of prosodicprominences (P) and prosodic packages (minor and major prosodic frontiers). The transcriptionis processed recursively to account for the hierarchical organization of prosodic events. For thispurpose, a variable temporal resolution is used to manage the relativity in the perception of prosodicprominences and to refine gradually the prosodic description. First, a segmentation into prosodicgroups (PGs) is achieved within a large integration domain (typically 5-10 s. depending on thespeaking style). The segmentation is based on the perception of a major prosodic prominence thatis associated with the end of a prosodic package. Second, a segmentation into internal prosodicgroups (IPGs) is achieved within each prosodic group. The segmentation is based on the perceptionof a minor prosodic prominence that are associated with the end of a prosodic package. Finally,residual prosodic prominences (P) are identified as the remaining perceived prosodic prominencesthat occur within the internal prosodic group. Additionnal symbols are used to manage uncertaintyand underspecification on the presence and nature of a prosodic frontier, and on the presence of aprosodic prominence. Speech disfluencies are transcribed in parallel to the prosodic transcription
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
50 CHAPTER 3. PROSODY ANALYSIS: SIGNAL, FORMS & FUNCTIONS
description level symbol description
prominence P prosodic prominence
localphonetic variations
l/L low pitchm/M middle pitchh/H high pitch
globalphonetic variations
R resetD downstep
phonological
toneL*/L low pitch accentH*/H high pitch accent
+
modifier∧ upstep! downstep> propagation
frontier%L/L% low pitch boundary tone%H/H% high pitch boundary tone
%/%
Table 3.4: Description of the IVTS transcription system with the tone inventory proposed forFrench.
phonologicalglobal phonetic * *local phonetic * * *prominence % * P *
syllable Long- temps ## je me suis cou- che de bonne heure ##
sentence Longtemps , je me suis couche de bonne heure .
Table 3.5: Illustration of the text-to-prosodic-structure conversion.
LH
H
mL
lL
D
%H
H%
L*
L%
3.3. PHONOLOGICAL REPRESENTATION OF SPEECH PROSODY 51
%H
H%
L*
L%
FM
Fm
frontier
3.3.4 Rhapsodie
The Rhapsodie is a transcription system that has been developed in the Rhapsodie Project(Rhapsodie: Reference Prosody Corpus of Spoken French) [Lacheret et al., 2010].
The Rhapsodie transcription system intends at providing a simple and unified transcriptionground that can be shared among the existing phonological theories and description systems.The description of the prosodic variations is based on the perception of prosodic objectsthat are implictely shared among the phonological theories, such as prosodic prominence andprosodic packaging. The prosodic prominence is defined as an acoustic saliency, and coversprosodic events that are marked by intonation or by any other acoustic cue. The perceptualdescription of prosodic variations present several advantages over more sophisticated systems.First, a perceptual description does not require expert knowledge, and can be processed bymoderately-trained individuals. Second, the transcription can be easily integrated into mostof the existing models for further phonetic and phonological descriptions. In particular,the perceptual level provides a minimal description of the prosodic events that can be used toprecise and describe the acoustic dimensions that may be phonetically and phonologically relevant.
The minimal prosodic unit used for the description is the syllable, and the maximal prosodic unitis the prosodic group. The transcription of prosodic events is based on the perception of prosodicprominences (P) and prosodic packages (minor and major prosodic frontiers). The transcriptionis processed recursively to account for the hierarchical organization of prosodic events. For thispurpose, a variable temporal resolution is used to manage the relativity in the perception of prosodicprominences and to refine gradually the prosodic description. First, a segmentation into prosodicgroups (PGs) is achieved within a large integration domain (typically 5-10 s. depending on thespeaking style). The segmentation is based on the perception of a major prosodic prominence thatis associated with the end of a prosodic package. Second, a segmentation into internal prosodicgroups (IPGs) is achieved within each prosodic group. The segmentation is based on the perceptionof a minor prosodic prominence that are associated with the end of a prosodic package. Finally,residual prosodic prominences (P) are identified as the remaining perceived prosodic prominencesthat occur within the internal prosodic group. Additionnal symbols are used to manage uncertaintyand underspecification on the presence and nature of a prosodic frontier, and on the presence of aprosodic prominence. Speech disfluencies are transcribed in parallel to the prosodic transcription
1Rhapsodie: reference prosody corpus of spoken French
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
12/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
13/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Rich Syntactic Description [Obin et al., 2011a]
Objective
Modelling the linguistic constraint to assign prosodic events
State of the Art
� Surface description of syntactic characteristics [Black and Taylor, 1994]
� Some attempts to integrate a rich description of syntactic characteristics
[Ingulfen et al., 2005]
Issue
� Integration of a rich description of syntactic characteristics
Contribution
� Use of deep syntactic parsing to model speech prosody
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
14/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Linguistic Processing Used in this Work: ALPAGE
ALPAGE
[Villemonte de La Clergerie, 2005]
� Linguistic Processing Chain developed for
French
� Three modules: text pre-processing, surface
and deep parsing
Text Pre-Processing
� segmentation of a text into words and
sentences
Surface Parsing
� morpho-syntactic parsing (POS)
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
15/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Linguistic Processing Used in this Work: ALPAGE
Deep Parsing
Deep parsing is used to describe the deep syntactic structure of a sentence
� Formalism: Tree Adjoining Grammar (TAG) [Joshi et al., 1975]
The syntactic structure can be described in terms of constituency or dependency.
Constituency
� describes the hierarchical structure of a sentence
Dependency
� describes the local dependency that relates words of a sentence
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
16/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Linguistic Processing Used in this Work: ALPAGE
Constituency [Villemonte de La Clergerie, 2005]
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
|
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
N
incise
adjP
adj
�N
Figure 2: Arbre 113: adjectif antpos
S
�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY
Figure 3: Arbre 23: ponctuation finale
N2
N
nc
Figure 4: Arbre 59: groupe nominal
Infl
VMod
V1
Infl
aux
�Infl
Figure 5: Arbre 67: auxiliaire
VMod
�VMod incise
↓PP
Figure 6: Arbre 166: attachement prpositionel sur un verbe
2
N
incise
adjP
adj
�N
Figure 2: Arbre 113: adjectif antpos
S
�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY
Figure 3: Arbre 23: ponctuation finale
N2
N
nc
Figure 4: Arbre 59: groupe nominal
Infl
VMod
V1
Infl
aux
�Infl
Figure 5: Arbre 67: auxiliaire
VMod
�VMod incise
↓PP
Figure 6: Arbre 166: attachement prpositionel sur un verbe
2
N
incise
adjP
adj
�N
Figure 2: Arbre 113: adjectif antpos
S
�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY
Figure 3: Arbre 23: ponctuation finale
N2
N
nc
Figure 4: Arbre 59: groupe nominal
Infl
VMod
V1
Infl
aux
�Infl
Figure 5: Arbre 67: auxiliaire
VMod
�VMod incise
↓PP
Figure 6: Arbre 166: attachement prpositionel sur un verbe
2
N
incise
adjP
adj
�N
Figure 2: Arbre 113: adjectif antpos
S
�S !|!!|!!!|!?|!?!?|,|.|:|?!|?!?!|_SMILEY
Figure 3: Arbre 23: ponctuation finale
N2
N
nc
Figure 4: Arbre 59: groupe nominal
Infl
VMod
V1
Infl
aux
�Infl
Figure 5: Arbre 67: auxiliaire
VMod
�VMod incise
↓PP
Figure 6: Arbre 166: attachement prpositionel sur un verbe
2
Pour la phrase Longtemps, je me suis couch de bonne heure..
E1F1|Longtemps
E1F2|, E1F3|je
E1F4|me
E1F5|suis
E1F6|couché
E1F7|de
E1F8|bonne
E1F9|heure
E1F10|..:_:lexical
cln:cln:lexical
être:aux:67
_:VMod:166
de:prep:4
coucher:v:198
clr:clr:lexical
_:incise:incise
_:S:23
,:_:lexical
bon:adj:113
longtemps:adv:171
heure:nc:59
void
S
vmod
subject
N2
N
S
void
clr
incise
PP
Infl
PP
prep N2
Figure 1: Arbre 4: groupe prpositionel
1
Longtemps je me suis couchà c� de bonne heure
4
Longtemps je me suis couchà c� de bonne heure
4
Longtemps je me suis couchà c� de bonne heure
4
Longtemps je me suis couchà c� de bonne heure
4
Longtemps je me suis couchà c� de bonne heure
4
Longtemps je me suis couchà c� de bonne heure
4
Longtemps je me suis couchà c� de bonne heure
4
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
S
incise
adv
�S
Figure 7: Arbre 171: adverbes
S
VMod
cln
V
VMod
V1
clr Infl
v
Figure 8: Arbre 198: verbe en construction canonique
3
Longtemps je me suis couché de bonne heure
4
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
17/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Linguistic Processing Used in this Work: ALPAGE
Conversion into Dependency [Villemonte de La Clergerie, 2010]
longtempsadverb
jenominative
clitic
bonadjective
heurecommon
noundepreposition
mereflexive
clitic
êtreauxiliary verb
coucherverb
_incise
, _
_sentence
._
_verb modifier
incise
void
reflexiveclitic
subject
Infl.
S
V-mod
PP
N2
N2
S void
E1F1 | Longtemps
E1F2 | ,
E1F3 | je
E1F4 | me
E1F5 | suis
E1F6 | couché
E1F7 | de
E1F8 | bonne
E1F9 | heure
E1F10 | .
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
18/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Summary
Syntactic characteristics used to model speech prosody
conventional characteristics
Surface Parsing
(1) morpho-syntactic (M)
proposed characteristics
Deep Parsing
(2) dependency (D)
(3) constituency (C)
(4) adjunction (A): specific TAG operation
� describes a large variety of syntactic constructions (e.g., clause, incise)
� shown to be relevant to model speech prosody
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
19/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Rich Syntactic Description
Evaluation of the Automatic Assignment of Prosodic Events
Scheme
� comparison of different sets
of syntactic characteristics to
model speech prosody
� speech database: neutral
read speech (9hrs)
� procedure: 10-fold
cross-validation
� metric: F-measure
Results
� dramatical improvement forprosodic boundaries
� FM = 20%� Fm = 10%
� no improvement for prosodicprominence: probably related tosemantic constraint, not syntax
� Performance of the automatic
assignment of Prosodic Events
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
FM Fm P0
0.2
0.4
0.6
0.8
1
0.95
0.54
0.34
0.85
0.50
0.34
0.76
0.44
0.33
0.74
0.44
0.33
F−m
easu
re
MORPHODEPENDENCYCHUNKADJUNCTION
6
�
6
�
6
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
20/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Combination of Syntactic and Metric Constraints to Assign Pauses
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
21/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Combination of Syntactic and Metric Constraints to Assign Pauses
Syntactic and Metric Constraints [Obin et al., 2011d]
ObjectiveModelling syntactic and metric constraints to assign pauses
State of the Art� Explicit integration of metric constraint in statistical modelling
[Schmid and Atterer, 2004]
Issue� Determine the adequate combination of syntactic & metric constraints
Contributions� Use of Segmental HMMs + Dempster-Shafer Fusion
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
22/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Combination of Syntactic and Metric Constraints to Assign Pauses
Syntactic and Metric Constraints [Obin et al., 2011d]
Segmental HMMSegmental HMM [Ostendorf et al., 1996] addresses several limitations of conventionalHMM:
� in particular, segment duration is explicitly modelled (metric constraint)
d= sequence that described the distance between consecutive pauseso= sequence that describes observed syntactic characteristics
p(d|o) ∝ p(o|d)| {z }
linguisticcontribution
× p(d)|{z}
metriccontribution
(1)
Dempster-Shafer Fusion� the reliability that can be conferred to different sources of information is explicitly
formulated� used to balance the syntactic and metric contributions to assign pauses
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
23/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Combination of Syntactic and Metric Constraints to Assign Pauses
Evaluation of the Automatic Assignment of Pauses
Scheme
� same database, procedure, metric
� only for FM (can be also used for Fm)
Results
� optimal configuration ofsyntactic/metric constraintssignificantly outperformsconventional methods
� optimal: 96.3%� linguistic: 95.0%� linguistic/metric: 92.1%
� syntactic constraint seems to have agreater influence than metricconstraint
� example of speech synthesis with
automatic assignment of prosodic
events
� Performance of the automatic
assignment of Pauses
−1−0.5
00.5
1
DMMDCMCMDCDCA
MAMDADADCAMCACAMDCA
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
α,βlinguistic context
F 10.65
0.7
0.75
0.8
0.85
0.9
0.95
−1
−0.5
0
0.5
1
DMMDCMCMD
CDCAMAMD
ADADCAMC
ACAMDCA
α,β
lingui
sticcon
text
0.65
0.7
0.75
0.8
0.85
0.9
0.95
metricmetric/linguistic
linguistic
weight
optimal
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
24/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Stylization and Trajectory Modelling of F0 contours
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
25/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Stylization and Trajectory Modelling of F0 contours
ObjectiveModelling and Synthesizing F0 variations from a text
State of the Art� short-term modelling + local trajectory constraint (conventional HMM)
[Tokuda et al., 2003]� stylization + no trajectory constraint [Mishra et al., 2006]� short-term modelling + long-term trajectory constraint
[Latorre and Akamine, 2008, Qian et al., 2009]
Issues� combining stylization with trajectory modelling of F0 variations
Contributions� trajectory model fully based on the stylization of F0 contours over various
temporal domains
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
26/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Stylization and Trajectory Modelling of F0 contours
Stylization of Speech Prosody [Obin et al., 2011b]
Stylization of F0 contours
� Modelling relevant melodic
variations
0 0.470
75
80
85
90
95
100
105
110
115
time [s]
f0
[Hz]
� Modelling contours over Various
Temporal Domains
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
syllablephrase
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
27/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Stylization and Trajectory Modelling of F0 contours
Trajectory Modelling
Stylization + Conventional Modelling
� During synthesis, the sequence of contours is determined as the sequence of
mean contours (assumption of conditional independence)
� Example:
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R1 1.5 2 2.5 3 3.5 4 4.5
30
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
6
6
� This may result into inadequate phrasing
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
28/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Stylization and Trajectory Modelling of F0 contours
Trajectory Modelling
Stylization + Trajectory Modelling
� During synthesis, the sequence of contours is determined under the constraintof long-term trajectories
� Example:
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
1 1.5 2 2.5 3 3.5 4 4.530
40
50
60
70
80
90
100
110
120
time [s]
f 0[H
z]
lo
ta
Z@
m@
sHi
ku
Se
d2
bO
n9R
6
�6
�
� The long-term trajectory constraints adequate phrasing
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
29/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Stylization and Trajectory Modelling of F0 contours
Subjective Evaluation of Stylization/Trajectory ModelComparison of F0 models in SpeechSynthesis
� conventional HMM (HTS) [Yoshimura et al., 1999]
� stylization + trajectory1. syllable + 1-order adjacent syllables (1-ORDER)
2. syllable + minor prosodic phrase (AG)
3. syllable + major prosodic phrase (PG)
Procedure� CMOS preference experiment� 8x4 synthesized speech utterances� 20 native French speakers
� 1-order trajectory model is significantlypreferred
� long-term trajectories (AG/PG) partiallysuccessful (due to increase in complexity)
� CMOS
HTS 1ORDER AG PG−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
−0.38
−0.18
0.53
−0.34
CMOS
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
30/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
31/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Style: a matter of Situation
Individual and Shared Identities
Each communication act instantiates a style which is composed of:
� an individual speaking style that depends on the speaker identity
SPEAKER
LISTENER
� a conventional speaking style conditioned by a specific situation
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
32/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Objective
Modelling speaking style in speech synthesis
State of the Art
� modelling continuous characteristics of a speaking style (emotions)
[Yamagishi et al., 2004]
� modelling discrete characteristics of a speaking style [Bell et al., 2006]
Issues
� modelling the discrete and continuous characteristics of a speaking style
Contributions
� study of the ability of human listeners to identify a speaking style
� reference for the evaluation of speaking style modelling
� study of the capacity of discrete/continuous HMMs to model characteristics
of a speaking style
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
33/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Ability of Human Listeners to Identify a Speaking Style
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
34/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Ability of Human Listeners to Identify a Speaking Style
Identification Ability of Speaking Style [Obin et al., 2010]
Objective� Study of the ability of human listeners
to identify the speaking style of naturalspeech
Experiment� 40 speech utterances with 4 speaking
styles (church office, political speech,journalistic speech, sportcommentary)
� delexicalized to remove lexical access
� 72 participants (various languagebackground)
� multiple choice identification ofspeaking style by human listeners
� Confusion of natural speaking stylesby human listeners
−80−60
−40−20
020
4060
−80
−60
−40
−20
0
20
40
60
−40
−30
−20
−10
0
10
20
30
40
MDS #2
JJJ
J
S
J
SS
S
J
S
SS
S
S
J
S
JJJM
P
M
P
PM
MDS #1
M
PP
MMM
P
M
P
PP
PMM
MD
S #3
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
35/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Discrete/Continuous Modelling of Speaking Style
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
36/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Discrete/Continuous Modelling of Speaking Style
Modelling Speaking Style [Obin et al., 2011c]
Objectives� application of discrete/continuous
model to speaking style� capacity of HMM-based speech
synthesis to model speaking style
Principle� average modelling of discrete and
continuous characteristics of aspeaking style (multiple speakers)
� Description of speakers characteristics
−2.2 −2 −1.8 −1.6 −1.4 −1.2
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
M1
log syllable duration
log
f 0
M2
M3
M4
M5M6M7
P1P2
P3
P4P5J1
J2
J3J4J5
S1
S2S3 S4
S5S6
DISC
RETE
/CON
TINU
OUSM
ODEL
LING
OFSP
EAKI
NGST
YLE
INHM
M-B
ASED
SPEE
CHSY
NTHE
SIS:
DESI
GNAN
DEV
ALUA
TION
Nic
ola
sO
bin
1,2 ,
Pie
rre
La
nch
an
tin
1
An
ne
La
ch
ere
t2 ,
Xa
vie
rR
od
et
1
1An
alysis
-Syn
thes
isTe
am,I
RCAM
,Par
is,Fr
ance
2M
odyc
oLa
b.,Un
iversi
tyof
Paris
Oues
t-La
Defe
nse,
Nant
erre
,Fra
nce
nobi
n@ir
cam.
fr,
lanc
hant
in@i
rcam
.fr,
anne
.lac
here
t@u-
pari
s10.
fr,
rode
t@ir
cam.
fr
ABST
RACT
This
pape
ras
sess
esth
eab
ility
ofa
HMM
-bas
edsp
eech
synt
he-
sissy
stem
sto
mod
elth
esp
eech
char
acter
istics
ofva
rious
spea
k-in
gsty
les1 .A
disc
rete/
cont
inuo
usHM
Mis
pres
ented
tom
odel
the
sym
bolic
and
acou
stic
spee
chch
arac
terist
icsof
asp
eaki
ngsty
le.Th
epr
opos
edm
odel
isus
edto
mod
elth
eav
erag
ech
arac
terist
icsof
asp
eaki
ngsty
leth
atis
shar
edam
ong
vario
ussp
eake
rs,de
pend
-in
gon
spec
ifics
ituati
onso
fspe
ech
com
mun
icatio
n.Th
eeva
luati
onco
nsist
sofa
nid
entifi
catio
nex
perim
ento
f4sp
eaki
ngsty
lesba
sed
onde
lexica
lized
spee
ch,a
ndco
mpa
red
toa
simila
rexp
erim
ento
nna
tura
lspe
ech.
The
com
paris
onis
disc
usse
dan
dre
veals
that
dis-
crete
/cont
inuo
usHM
Mco
nsist
ently
mod
elsth
espe
ech
char
acter
is-tic
sofa
spea
king
style.
Inde
xTer
ms:
spea
king
style,
spee
chsy
nthe
sis,s
peec
hpr
osod
y,av
-er
agem
odell
ing.
1.IN
TROD
UCTI
ON
Each
spea
kerh
ashi
sown
spea
kin
gst
yle
which
cons
titut
eshi
svoc
alsig
natu
re,a
ndap
arto
fhis
iden
tity.
Neve
rthele
ss,a
spea
kerc
ontin
u-ou
slyad
apth
issp
eaki
ngsty
leac
cord
ing
tosp
ecifi
ccom
mun
icatio
nsit
uatio
ns,a
ndto
hise
mot
iona
lstat
e.In
parti
cular
,eac
hsit
uatio
nal
cont
extd
eterm
ines
aspe
cificm
odeo
fpro
ducti
onas
socia
tedwi
thit
-agen
re-w
hich
isde
fined
byas
etof
conv
entio
nsof
form
andc
on-
tentt
hata
resh
ared
amon
gall
ofits
prod
uctio
ns[1
].In
parti
cular
,a
spec
ificd
iscou
rsege
nre
(DG)
relat
esto
asp
ecifi
csp
ea
kin
gst
yle
.Co
nseq
uent
ly,a
spea
kera
dapt
shis
spea
king
style
toea
chsp
ecifi
csit
uatio
nde
pend
ing
onth
efo
rmal
conv
entio
nsth
atar
eas
socia
tedwi
thth
esit
uatio
n,hi
sa-p
riori
know
ledge
abou
tthe
seco
nven
tions
,an
dhi
scom
peten
ceto
adap
this
spea
king
style.
Thus
,eac
hco
m-
mun
icatio
nac
tins
tantia
tesas
tyle
which
isco
mpo
sed
ofas
tyle
that
depe
ndso
nth
esp
eake
ride
ntity
,and
aco
nven
tiona
lspe
akin
gsty
leth
atis
cond
ition
edby
aspe
cifics
ituati
on.
Insp
eech
synt
hesis
,meth
odsh
aveb
eenp
ropo
sedt
omod
elan
dada
ptth
esym
bolic
[3,4
]and
acou
stics
peec
hcha
racte
ristic
sofa
spea
king
style,
with
appl
icatio
nto
emot
iona
lspe
ech
synt
hesis
[2].
Howe
ver,
nostu
dyex
istso
nth
ejoi
ntm
odell
ing
ofth
esym
bolic
and
acou
stic
char
acter
istics
ofsp
eaki
ngsty
le,an
dsp
eaki
ngsty
leac
ousti
cm
od-
ellin
gge
nera
llylim
itsto
them
odell
ing
ofem
otio
n,wi
thra
reex
ten-
sions
toot
hers
ourc
esof
spea
king
styles
varia
tions
[5].
This
pape
rpre
sent
san
aver
age
disc
rete/
cont
inuo
usHM
Mwh
ichis
appl
iedto
thes
peak
ing
style
mod
ellin
gof
vario
usdi
scou
rsege
nres
1 This
study
was
supp
orted
byAN
RRh
apso
die
07Co
rp-0
30-0
1;re
fer-
ence
pros
odyc
orpu
sofs
poke
nFre
nch;
Fren
chNa
tiona
lAge
ncyo
fres
earc
h;20
08-2
012.
insp
eech
synt
hesis
,and
asse
sses
wheth
erth
emod
elad
equa
telyc
ap-
ture
sthe
spee
chpr
osod
ycha
racte
ristic
sofa
spea
king
style.
Incid
en-
tally,
ther
obus
tnes
soft
heHM
M-b
ased
spee
chsy
nthe
sisis
evalu
ated
inth
econ
ditio
nsof
real-
world
appl
icatio
ns.T
hepa
peri
sorg
anize
das
follo
ws:t
hesp
eaki
ngsty
leco
rpus
desig
nis
desc
ribed
inse
ction
2;th
eave
rage
disc
rete/
cont
inuo
usHM
Mm
odel
ispr
esen
tedin
sec-
tion
3;th
eev
aluati
onis
pres
ented
and
disc
usse
din
secti
ons4
and
5.2.
SPEE
CH&
TEXT
MAT
ERIA
L
2.1.
Corp
usDe
sign
Fort
hepu
rpos
eofs
peak
ing
style
spee
chsy
nthe
sis,a
4-ho
urm
ulti-
spea
kers
spee
chda
tabas
ewa
sdes
igne
d.Th
esp
eech
datab
ase
con-
sists
offo
urdi
ffere
ntDG
’s:ca
thol
icm
assc
erem
ony,
polit
ical,
jour
-na
listic
,and
spor
tcom
men
tary.
Inor
dert
ore
duce
the
DGin
tra-
varia
bilit
y,th
edi
ffere
ntDG
swer
ere
strict
edto
spec
ific
situa
tiona
lco
ntex
ts(se
elist
belo
w)an
dto
male
spea
kers
only.
−2.2
−2−1
.8−1
.6−1
.4−1
.2
4.54.64.74.84.955.15.25.35.45.5
M1
log sy
llable
dura
tion
log f0
M2
M3
M4
M5M6
M7 P1P2P3
P4P5
J1
J2J3J4
J5S1
S2S3
S4S5
S6
Fig.
1.Pr
osod
icde
scrip
tion
ofth
espe
akin
gsty
lesde
-pe
ndin
gon
thes
peak
er.M
ean
and
varia
nceo
ff0
and
spee
chra
te.
log
f 0
log(
1/sp
eech
rate
)Th
efo
llowi
ngis
ade
scrip
tion
ofth
efo
urse
lected
DG’s:
DISCRETE/CONTINUOUS MODELLING OF SPEAKING STYLEIN HMM-BASED SPEECH SYNTHESIS:
DESIGN AND EVALUATION
Nicolas Obin1,2
, Pierre Lanchantin1
Anne Lacheret2, Xavier Rodet
1
1 Analysis-Synthesis Team, IRCAM, Paris, France2 Modyco Lab., University of Paris Ouest - La Defense, Nanterre, France
[email protected], [email protected], [email protected], [email protected]
ABSTRACT
This paper assesses the ability of a HMM-based speech synthe-sis systems to model the speech characteristics of various speak-ing styles1. A discrete/continuous HMM is presented to model thesymbolic and acoustic speech characteristics of a speaking style.The proposed model is used to model the average characteristicsof a speaking style that is shared among various speakers, depend-ing on specific situations of speech communication. The evaluationconsists of an identification experiment of 4 speaking styles basedon delexicalized speech, and compared to a similar experiment onnatural speech. The comparison is discussed and reveals that dis-crete/continuous HMM consistently models the speech characteris-tics of a speaking style.Index Terms: speaking style, speech synthesis, speech prosody, av-erage modelling.
1. INTRODUCTION
Each speaker has his own speaking style which constitutes his vocalsignature, and a part of his identity. Nevertheless, a speaker continu-ously adapt his speaking style according to specific communicationsituations, and to his emotional state. In particular, each situationalcontext determines a specific mode of production associated with it- a genre - which is defined by a set of conventions of form and con-tent that are shared among all of its productions [1]. In particular,a specific discourse genre (DG) relates to a specific speaking style.Consequently, a speaker adapts his speaking style to each specificsituation depending on the formal conventions that are associatedwith the situation, his a-priori knowledge about these conventions,and his competence to adapt his speaking style. Thus, each com-munication act instantiates a style which is composed of a style thatdepends on the speaker identity, and a conventional speaking stylethat is conditioned by a specific situation.In speech synthesis, methods have been proposed to model and adaptthe symbolic [3, 4] and acoustic speech characteristics of a speakingstyle, with application to emotional speech synthesis [2]. However,no study exists on the joint modelling of the symbolic and acousticcharacteristics of speaking style, and speaking style acoustic mod-elling generally limits to the modelling of emotion, with rare exten-sions to other sources of speaking styles variations [5].This paper presents an average discrete/continuous HMM which isapplied to the speaking style modelling of various discourse genres
1This study was supported by ANR Rhapsodie 07 Corp-030-01; refer-ence prosody corpus of spoken French; French National Agency of research;2008-2012.
in speech synthesis, and assesses whether the model adequately cap-tures the speech prosody characteristics of a speaking style. Inciden-tally, the robustness of the HMM-based speech synthesis is evaluatedin the conditions of real-world applications. The paper is organizedas follows: the speaking style corpus design is described in section2; the average discrete/continuous HMM model is presented in sec-tion 3; the evaluation is presented and discussed in sections 4 and5.
2. SPEECH & TEXT MATERIAL
2.1. Corpus Design
For the purpose of speaking style speech synthesis, a 4-hour multi-speakers speech database was designed. The speech database con-sists of four different DG’s: catholic mass ceremony, political, jour-nalistic, and sport commentary. In order to reduce the DG intra-variability, the different DGs were restricted to specific situationalcontexts (see list below) and to male speakers only.
−2.2 −2 −1.8 −1.6 −1.4 −1.2
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
M1
log syllable duration
log
f 0
M2
M3
M4
M5M6M7
P1P2
P3
P4P5J1
J2
J3J4J5
S1
S2S3 S4
S5S6
Fig. 1. Prosodic description of the speaking styles de-pending on the speaker. Mean and variance of f0 andspeech rate.
log f0
log(1/speech rate)The following is a description of the fourselected DG’s:
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
37/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Discrete/Continuous Modelling of Speaking Style
Evaluation of Ability of Human Listeners to Identify a SyntheticSpeaking Style
Experiment� 40 synthesized speech utterances +
delexicalized (same as previously)� 50 participants (various language background)� multiple choice identification of speaking style
by human listeners
Results� discrete/continuous HMMs
consistently model the
characteristics of a speaking style
(with exception of church office)
Question� what is the contribution of
discrete/continuous characteristics?
� Comparison of identificationobtained for natural andsynthetic speech
MASS POLITICAL JOURNAL SPORT0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.12
0.28
0.51
0.68
0.38
0.34
0.54
0.70
Cohe
n’s
Kapp
a
NATURAL SPEECHSYNTHESIZED SPEECH6
�
6
�
6
�
6
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
38/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Discrete/Continuous Modelling of Speaking Style
Synthesis of Speaking Styles
TEXT
2
SP
EA
KIN
GS
TYLE
2Neutral, Fairy Tale, “Litlle Tom Thumb”
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
39/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Introduction
Discrete ModellingUse of Rich Syntactic Description to Assign Prosodic EventsCombination of Syntactic and Metric Constraints to Assign Pauses
Continuous ModellingStylization and Trajectory Modelling of F0 contours
Speaking StyleAbility of Human Listeners to Identify a Speaking StyleDiscrete/Continuous Modelling of Speaking Style
Conclusion & Further Directions
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
40/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Conclusion
Contributions to the Statistical Modelling of Speech Prosody
1. Statistical modelling of discrete and continuous characteristics of speech prosody2. Combination of linguistic and metric constraints to assign pauses3. Stylization and trajectory modelling of F0 contours
Contribution to the Integration of a Rich Linguistic Description
4. Use of deep syntactic parsing to model speech prosody characteristics
Contributions to the Modelling of Speaking Style
5. Study of the ability of listeners to identify a speaking style6. Reference identification performance for the evaluation of speaking style modelling7. Ability of discrete/continuous HMMs to model the characteristics of a speaking
style
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
41/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Further Directions
Modelling the Variety of Speech Prosody
� a speaker has various alternatives to
pronounce a same utterance (intra-speaker
variability)
� varying speech prosody will certainly improve
the naturalness of synthetic speech
� examples of various speech prosody
determined for the same sentence
Unified Modelling
� joint modelling of discrete/continuous
characteristics
� that also accounts for alternatives
� objective: improving the coherence and variety
of speech prosody
SPEAKER
HEY PATRICK! HEY PA-
TRICK!HEY ##
PA-
TRICK!
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos
42/42
Introduction Discrete Modelling Continuous Modelling Speaking Style Conclusion
Nicolas Obin Sound Analysis and Synthesis Department IRCAM - CNRS - UMR 9912 - STMS
MeLos