Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000...
Transcript of Personalized Singing Synthesis - NEDOConverted singing voice time (s)) 0 0.5 1 1.5 2 0 2000 4000...
Acknowledgement:
Yvonne Lee, Paul Chan, Dong Minghui
International Symposium on Next-Generation Artificial Intelligence
3 March 2016, Tokyo
Personalized Singing Synthesis - Science Meeting Arts
Haizhou Li
Institute for Infocomm Research, A*STAR
Singapore
1. Speech
2. Singing
3. Music
4. Speech to Singing
Agenda: Personalized Singing Synthesis
• Speech is the most natural way of human communication.
• Singing is to augment regular speech by the use of both tonality and rhythm.
• Music is an artistic way of expression, with instrumental sounds and vocals.
Speech, Singing and Music
Speech
Singing
Music
Speech
4
• Who you are
Elements in Human Voice
Human Language Technology
• what you want to say
Speech
Prosody
Timbre Content
• Expression of affective
state/emotion
6
• Speech and singing can be modelled as two processes: source generation and filtering
• Air flow passes through the vocal folds as the sound source (periodic pulse & noise)
• It is then filtered by our vocal tract to produce the sounds
Vocal Tract (Filter)
Air Flow (Source)
Speech
Speech Theory
7
• Kratzenstein’s acoustic resonators – Apparatus created in St. Petersburg (1779)
– Figure shown from Schroeter (1993)
• Vowel tubes (SF Science Museum)
First Set of Vowel Tubes
8
Homer Dudley (1896-1987) , 1939 World Fair in New
York City – Bell Labs VODER
Electronic Synthesizer: VODER
Source: 120 Years of Electronic Music: The history of electronic music from 1800 to
2015, http://120years.net/the-voder-vocoderhomer-dudleyusa1940/
9
• Haskins, 1959
• KTH – Stockholm, 1962
• Bell Labs, 1973
• MIT, 1976
• MIT-talk, 1979
• Speak ‘N Spell, 1980
• BELL Labs, 1985
• DECtalk (voice morphing), 1987
• I2R Abacus Engine 2013
Continuing Evolution (1959-2013)
Elements in Human Voice
Human Language Technology
speech
Prosody
Timbre Content
11
Analysis and Synthesis
Content
Prosody
Timbre
Analysis
Synthesis
speech speech
12
Singing
• Singing is emotional musical vocalization. It is an emotional expression of feelings which has the power to alter the mood of both the singer and the listener.
• Singing requires expert knowledge and operations. Vocalists acquire singing skills by extensive training.
• By singing, we – Reduce stress and improve mood
– Lower blood pressure
– Help improve sleep
– Reduce perceived pain
– Motivate and empower …
13
Singing – a part of our life
14
• Almost all singers develop vibrato during voice training. Vibrato is a periodic modulation of the pitch frequency. Vibrato reduces the demand of accuracy in pitch frequency and the singer can use vibrato artistically for expressive purposes [Sundberg87]
[Sundberg87] J. Sundberg, The Science of the Singing Voice, Northern Illinois University Press, 1987.
Singing Theory – Vibrato
15
• By varying the vocal tract to achieve different voice spectra, singers may produce different voice timbres, styles, and achieve efficient sound transmission
[Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Sundberg77] J. Sundberg, The Acoustics of the Singing Voice, Scientific American, 1977.
• Singing formant
By adjusting the larynx and the vocal tract near glottis, singing formant is made and the singing can be projected further far away [Wolfe09]
Singing Theory – Singing Formant (Tenor)
La Lore Loo Ler Lee
low pitch
high pitch
high pitch, order changed
16
[Joliveau04] E. Joliveau, J. Smith, and J. Wolfe, “Tuning of vocal tract resonance by sopranos”, Nature, vol. 427, pp. 116, Jan. 2004. [Wolfe09] J. Wolfe, M. Garnier, and J. Smith, Voice Acoustics: An Introduction to the Science of Speech and Singing, 2009. [Wolfe12] J. Wolfe, Sopranos: Resonance Tuning and Vowel Changes, 2012.
[Wolfe12] [Joliveau04]
[Wolfe09]
Singing Theory – Resonance Tuning (Soprano)
17
Music
18
• A piece of music is described by a sequence of notes, showing the pitch and the relative durations
• In Western music, 12 notes of fixed frequencies are used. They are different from each other by a semitone (a ratio of ). An octave span over 12 semitones.
12 2
name whole half quarter eighth sixteenth thirty-second
sixty-fourth
pitched note
rest note
name C4 C4# D4 D4# E4 F4 F4# G4 G4# A4 A4# B4
freq. (Hz)
262 277 294 311 330 349 370 392 415 440 466 494
Music Theory
19
Speech to Singing Synthesis
Are these possible? – Projecting vocal techniques in singing
– Personalizing a voice [Kenmochi10]
– Automating the accompaniment
[Kenmochi10] H. Kenmochi, “VOCALOID and Hatsune Miku phenomenon in Japan,” in Proc. Intersinging, pp. 1-4, 2010.
Singing Synthesis
Singing
Prosody
Timbre Content
Synchronization
21
Analysis and Synthesis
Content
Prosody
(F0)
Timbre
Synthesized
Singing
Music
Accompaniment
Speech Singing
Analysis
Synthesis
• Input: speech or singing
• Output: time-aligned vocal and music
• Changing prosody and timbre from speech to singing vocal
Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based
personalized singing synthesis, US Patent: 20150025892 A1
Personalized Singing Synthesis
(shift to a higher pitch)
pitch
time
pitch
time
pitch
time
pitch
time
Singing
Syncopated Singing
(horizontal nudge
time warping)
Transposed Singing
(shift to a lower pitch)
pitch
time
pitch
time
pitch
time
Example of Speech to Singing Synthesis
Harmonized Singing
(Sum of Melody
& Accompaniment
Vocals)
Speech
pitch
time
24
Template singing voice
time (s)
frequ
ency
(Hz)
0 0.5 1 1.5 20
2000
4000
6000
8000
Speaking voice under conversion
time (s)
frequ
ency
(Hz)
0 0.5 10
2000
4000
6000
8000
Converted singing voice
time (s)
frequ
ency
(Hz)
0 0.5 1 1.5 20
2000
4000
6000
8000
Personalized Singing Synthesis – voice alignment
pitch
time
Singing
pitch
time
Speech
Siu Wa Lee, Ling Cen, Haizhou Li, Yaozhu Paul Chan, Minghui Dong, Method and system for template-based
personalized singing synthesis, US Patent: 20150025892 A1
25
• Generalized Pitch Modeling
– To learn and generate these fluctuations note-by-note [Lee12]
• Various types of fluctuation are implicitly modeled using the same representation
[Lee12] S. W. Lee, S. T. Ang, M. Dong, and H. Li, “Generalized F0 modeling with absolute and relative pitch features for signing voice synthesis,” in Proc. ICASSP, pp. 429-432, 2012.
Generalized Pitch Modeling
Content Independent Method
Pitch Modeling
• Context Independent Method – A fixed formula to assign pitch
fluctuation patterns
• However, pitch fluctuations are context dependent, e.g.
overshoot1 in high frequency notes is different from that in low frequency notes. Vibrato is present more often in long notes, etc.
1. Overshoot is a deflection exceeding the target note frequency after a note change 2. LEE S.W.Y, , DONG M.H., "Singing voice synthesis: Singer-dependent vibrato modeling and coherent processing of spectral envelope", INTERSPEECH 2011
Singing with Vibrato
Personalized Singing Voice Synthesis
(NDP 2013 Mobile App)
• Thank you!
28