2013 an Attempt to Develop a Singing Synthesizer by Collaborative Creation

AN ATTEMPT TO DEVELOP A SINGING SYNTHESIZER BYCOLLABORATIVE CREATION

Masanori MoriseFaculty of Engineering, University of Yamanashi, Japan

[email protected]

ABSTRACT

This paper presents singing synthesizers collaboratively de-signed by several developers. In the video-sharing Web siteNico Nico Douga, many creators jointly create songs witha singing synthesis system called Vocaloid. To synthe-size various styles of singer, another singing system UTAUwhich is a free software, is being developed and used bymany creators. However, the sound quality of this sys-tem has not been yet as good as Vocaloid. The purposeof this study is to develop a singing synthesizer for UTAUby collaborative creation. Developers were encouraged todesign a singing synthesizer by using a high-quality speechsynthesis system named WORLD that can synthesize asinging voice that sounds as natural as a human voice. Wereleased WORLD and a singing synthesizer for UTAU asfree software with C language source code and attemptedto encourage collaborative creation. As a result of our at-tempt, six singing synthesizers for UTAU and two origi-nal singing synthesis systems were developed and released.These were used to create many songs that were evaluatedas high-quality singing by audiences on a video-sharingWeb site Nico Nico Douga.

1. INTRODUCTION

Singing synthesis is a major research target in the field ofsound synthesis, and several commercial applications suchas Melodyne and Auto-Tune have already been used totune singing voices. Text-To-Speech synthesis systems forsinging have been released as computers are sufficientlydeveloped. However, the sales of these applications havebeen poor.

After the releasing of the Vocaloid 2 Hatsune Miku [1],singing synthesis systems have played an important role inentertainment culture on the video-sharing Web site NicoNico Douga, and many amateur creators have been upload-ing songs to the site. Several studies for Vocaloid havebeen carried out to synthesize natural singing voices [2,3]. As a result, “Vocaloid music” is now a category ofJapanese pop culture, in what has been termed the “Hat-sune Miku Phenomenon” [1].

“Social Creativity” [4] which is a collaborative creation[5] by multiple creators, has been gaining popularity as a

Copyright: c⃝2013 First author et al. This is an open-access article distributed

under the terms of theCreative Commons Attribution 3.0 Unported License, which

permits unrestricteduse, distribution, and reproduction in any medium, provided

the original author and source are credited.

new style of creation to improve the quality of contentson video-sharing Web sites. Today, many creators jointlycreate many contents including songs, promotional videos,and comments1 .

The purpose of this study is to develop singing synthe-sizers by collaborative creation. We implemented a basetechnology named WORLD and a singing synthesizer forUTAU and released them on the Web site to encourage col-laboration. This attempt functions as indirect support forcreating songs by the developed synthesizers.

The rest of this article is organized as follows. Section 2describes conventional singing synthesis systems and out-lines the requirements for developing one. In Section 3,we explain the principle behind and effectiveness of thebase technology, WORLD. Section 4 reveals whether thesynthesizers could be developed by other developers anddiscusses the result of our attempt. We conclude in Sec-tion 5 with a brief summary of our research and a mentionof future work.

2. SINGING SYNTHESIS SYSTEMS: VOCALOIDAND UTAU

Vocaloid is a Text-To-Speech synthesis system for singingin which creators can synthesize the singing using lyricsand scores. The demands to synthesize various kinds ofsinging have been growing due to the rapid growth of Vocaloid,but it has been virtually impossible to support this demand.To solve this problem, UTAU2 was developed as a singingsynthesis system.

2.1 UTAU

UTAU is a Japanese singing synthesis system similar toVocaloid. As shown in Fig. 1, the framework consists ofan editor to manipulate parameters, a synthesizer, and avoice library associated with the singer. UTAU can switchthe voice library to synthesize various styles of singingand synthesizers to improve the sound quality. AlthoughVocaloid has a few voice libraries (around 20), UTAU hasfar more (over 5,000) because creating a voice library ofamateur singers is easy.

The following sections describe the voice library for UTAUand the requirements for the singing synthesizer. To de-velop a synthesizer for UTAU, it is necessary to adjust theformat for the voice and labeling data.

1 On Nico Nico Douga,audiences can effectively overlay text com-ments onto video content. This feature provides the audience with thesense of sharing the viewing experience [6].

2 http://en.wikipedia.org/wiki/Utau

Proceedings of the Stockholm Music Acoustics Conference 2013, SMAC 2013, Stockholm, Sweden

287

Vocaloid

UTAU

Voice library Synthesizer Editor

Voice librarySynthesizer

Editor

Voice

Label data

Time-scale

modification

Articulation

information

Articulation

information

Voice library created by singers

Switching

Original synthesizer

Target of this study

Switching

Synthesis

method

Voice Label data

Figure 1. Overview of Vocaloid and UTAU. They consistof an editor, voice library, and synthesis method. Switch-ing voice libraries and synthesizers is possible in UTAU.

Time

Am

plitu

de

(c)

(a)

(b)

Figure 2. Labeling dataof a singing voice /shi/. UTAUrequires determining these positions.

2.1.1 Voice library

UTAU supports both CV and VCV synthesis for Japanesesinging3 , and recording all phonemes is required for thevoice library. The recorded voices are then labeled to fulfillthe requirement for UTAU. This is done automatically byusing free software developed by another developer.

2.1.2 Labeling data

Figure 2 shows an example of labeling data in a CV voice/shi/. Three intervals are used for synthesis: Interval (a)represents the interval for smoothly mixing two voices, In-terval (b) represents the consonant interval, with the end-point used as the origin of a CV voice, and Interval (c)represents the voiced speech and is used to stretch the du-ration of the voice. A phoneme boundary between V andC is added for VCV voices.

3 Singing in other languages isavailable, but it is difficult because theeditor cannot support other languages.

Input

Analysis

F0

Spectral

envelope

Excitation

signalOutput

Synthesis

F0

Spectral

envelope

Excitation

signal

Time stretchingF0 modificationTimbre modification

.

.

.

Figure 3. Overview of WORLD.

To develop the singing synthesizer for UTAU, developersmust implement at least the following functions

• Time-stretching function

• F0-modification function

• Timbre-modification (at least Formant shift) func-tions

These functions are used to adjust the voices to the desiredmusical note, and developers can implement these func-tions by their original algorithm. However, F0-modificationfunction must be implemented based on two processes thatnormalize F0 contour to musical scale and add the detailedcontour given by the editor.

2.2 Problem of UTAU

Three synthesizers officially released by the UTAU devel-oper to manipulate sensitive atmospheres. Other three syn-thesizers have been released by other developers becausedifferences between synthesizers affect the quality of thecontent. Since the sound quality often deteriorates by thecompatibility between the voice library and synthesizer,various types of synthesizers should be developed.

In this study, we developed a singing synthesizer for UTAUbased on a high-quality speech synthesis and attempted toinduce collaborative creation by releasing its C languagesource code.

3. DEVELOPMENT OF A SINGINGSYNTHESIZER BASED ON WORLD

The base technology WORLD is a vocoder-based system[7] that decomposes the input speech into an F0, a spectralenvelope, and an excitation signal and is able to synthesizea speech that sounds as natural as a human speech. Threeparameters can be modified to fulfill the requirements forUTAU: time stretching, F0 modification, and timbre mod-ification (as shown in Fig. 3).

WORLD estimates the F0 and then uses the F0 informa-tion to estimate the spectral envelope. The excitation signalis then extracted by the F0 and spectral envelope informa-tion.

3.1 DIO: F0 estimation method

F0 of a voiced sound is defined as the inverse value ofthe shortest period of glottal vibrations. It is one of the


288

Time

Am

plitu

de

t1 t2

Figure 4. Four intervals used for determining the F0. In-verse value of the average is an F0 candidate, and that ofthe standard deviation is used as the index to determine thebest of the candidates.

most important parameters for speech modification. ManyF0 estimation methods, (such as Cepstrum [8] and auto-correlation-based method [9]) have therefore been proposedfor accurate estimation. Although these methods can accu-rately estimate F0, they require extensive calculation suchas FFT.

DIO [10] is a rapid F0 estimation method for high-SNRspeech that is based on fundamental component extraction.The fundamental component is extracted by low-pass fil-ters and the F0 is calculated as its frequency. Since the cut-off frequency to extract only the fundamental componentis unknown, DIO uses many low-pass filters with differentcut-off frequencies and the periodicity score to determinethe final F0 of all candidates.

DIO consists of three steps to calculate F0 candidates andperiodicity scores:

• Step1: Filtering by many low-pass filters with thedifferent cutoff frequencies from low frequency tohigh frequency.

• Step2: Calculation of F0 candidates and periodicityscores.

• Step3: Determination of final F0 based on periodic-ity scores.

In the first step, the input waveform is filtered by manylow-pass filters. DIO uses a Nuttall window [11] as a low-pass filter with a sidelobe of around−90 dB. The filteredsignal is the sine wave with F0 Hz, provided that the fil-ter is designed so that only the fundamental component isextracted.

In the second step, four intervals (shown in Fig. 4) arecalculated at all temporal positions. The four intervals aredefined as the negative and positive going zero-crossing in-tervals and the intervals between the peaks and dips. In thecase that the filtered signal is a sine wave, four intervalsindicate the same value, and the inverse value of the aver-age of four intervals indicates the F0. That of the standarddeviation can be used as its periodicity score.

In the final step, the F0 with the highest periodicity scoreis selected as the final F0. DIO calculates the F0 muchfaster than the conventional method because it does notuse frame-by-frame processing to calculate FFT. The F0estimation performance of DIO is the same as the conven-tional techniques [12–14], while the elapsed time is at least28 times faster than the other methods [15].

3.2 STAR: Spectral envelope estimation method

Since voiced speech has an F0, the speech waveform in-cludes not only the spectral envelope but also the F0 in-formation. Many methods based on linear predictive cod-ing (LPC) [16] and Cepstrum [17] have been proposed.Among them, STRAIGHT [18] can accurately estimate thespectral envelope and can synthesize high-quality speech.TANDEM-STRAIGHT [19] produces the same results asSTRAIGHT but at a lower computational cost, and STARcan reduce the computational cost even more than TANDEM-STRAIGHT [20]. To calculate the spectral envelope, TANDEM-STRAIGHT uses two power spectra windowed by two win-dow functions, whereas STAR produces the same resultusing only one power spectrum.

In STAR, spectral envelope|H(ω, τ)| is given by

|H(ω, τ)|2 = exp

(2

ω0

∫ ω02

−ω02

log (|S(ω + λ, τ)|) dλ

),

(1)

whereS(ω, τ) represents the spectrum of the windowedwaveform andτ represents the temporal position for win-dowing. A Hanning window, which is used as the windowfunction, has a length of3T0 and is based on pitch syn-chronous analysis [21].ω0 represents fundamental angu-lar frequency (2πf0). By windowing this window functionand by smoothing with Eq. (1),|H(ω, τ)|2 is temporallystable.

Figure 5 shows an example of the estimation result. Thetarget spectral envelope consisted of a pole and a dip, butLPC could not accurately estimate the envelope from thespectral envelope including dip. In contrast, TANDEM-STRAIGHT and STAR could, with STAR completing theestimation in half the time of TANDEM-STRAIGHT.

3.3 PLATINUM: Excitation signal extraction method

PLATINUM is the excitation signal extraction method fromthe windowed waveform, spectral envelope and F0 infor-mation [22]. In typical vocoder-based systems, the pulseis used as the excitation signal and the signal calculatedbased on the spectral envelope with minimum phase is usedas the impulse response of voiced speech. White noiseis used as the excitation signal to synthesize consonants.PLATINUM can calculate the phase information of thewindowed waveform and use it when synthesizing.

The observed spectrumY (ω) is defined as the productof the spectral envelopeH(ω) and target spectrumX(ω)for reconstructing the waveform. In the case that the phaseof spectral envelopeH(ω) is minimum, an inverse filteris given by simply calculating the inverse ofH(ω). Sinceminimum phase is used as the phase information ofH(ω)


289

0 500 1000 1500 2000 2500 3000−20

0

20

40

60

80

Frequency (Hz)

Leve

l (dB

)

OriginalLPCTANDEM−STRAIGHTSTAR

Figure 5. Spectralenvelope estimated by STAR. The tar-get spectrum consists of a pole and a dip. Linear predic-tive coding (LPC) could not estimate the spectral envelope,whereas TANDEM-STRAIGHT and STAR could.

Time

Voiced speech

Origin

Figure 6. Determinationof origin in a voiced speech. Theindex with maximum amplitude round the center of speechis determined. Other positions are automatically calculatedby origin and F0 contour.

in vocoder-based systems, the target spectrumX(ω) is givenby

X(ω) =Y (ω)

H(ω). (2)

As shown in Eq. (1), the spectral envelopeH(ω) es-timated by STAR is smoothed by a rectangular window.The inverse value ofH(ω) can be calculated without anextremely high amplitude.

The pitch marking required for TD-PSOLA [23] is cru-cial because PLATINUM uses windowed waveform as theglottal vibration for synthesis. To calculate the temporalpositions for calculating the spectrumY (ω), PLATINUMuses an origin from voiced speech and an F0 contour. Theorigin of each voiced speech is determined in the mannershown in Fig. 6. The center interval of the voiced speechis selected, and the time with the maximum amplitude isextracted as the origin for windowing. Other positions areautomatically calculated by the F0 contour.

0.1 0.105 0.11 0.115 0.12−0.5

0

0.5

1

0.1 0.105 0.11 0.115 0.12−0.5

0

0.5

1

Time (sec)

Am

plitu

de

Synthesized voice

Input voice

Figure 7. Waveforms of input speech (upper) and synthe-sized speech (bottom). Since PLATINUM can synthesizea windowed waveform, the output speech is almost all thesame except for the temporal position of each excitation

3.4 Sound quality of the synthesized speech

Figure 7 shows the waveforms of both input and synthe-sized speeches. The waveform synthesized with WORLDis almost completely the same as the input waveform be-cause PLATINUM can compensate for the windowed wave-form by the minimum and maximum phase. The temporalpositions of each glottal vibration are shifted because theF0 contour does not include the origin of the glottal vibra-tions.

In reference [22], a MUSHRA-based evaluation [24] wascarried out. WORLD was compared with STRAIGHT [18]and TANDEM-STRAIGHT [19] as modern techniques, andCepstrum [17] as a conventional one. Not only a synthe-sized speech but also F0-scaled speeches (F0± 25 %)and Formant-shifted speeches (±15%) were tested to de-termine the robustness of the modification. The speechesused for the evaluation were of three males and three fe-males. The sampling was 44,100 Hz/16 bit, and a 32-dB (Aweighted) room was used. Five subjects with normal hear-ing ability participated. This article showed only the resultof WORLD, STRAIGHT and TANDEM-STRAIGHT be-cause the sound quality of Cepstrum is clearly low com-pared with these three. The results are shown in Table 1.Under almost all conditions, WORLD can synthesize thebest speech.

4. EVALUATION

WORLD and a singing synthesizer that fulfills the require-ment for UTAU were developed and released via a Website4 . Both the execute file and the C language source codewere released to encourage collaborative creation by de-velopers. Developers could use WORLD and release theirsynthesizer without any permission from us (they were re-leased under the modified BSD license). An evaluationwas performed to determine whether other singing syn-thesizers were developed and released. The number ofcontents uploaded on the video-sharing Web site was also

4 http://ml.cs.yamanashi.ac.jp/world/


290

STRAIGHT TANDEM-STRAIGHT WORLD

Synthesized speech 88.2 83.2 97.3F0-scaled speech (+25%) 77.4 72.1 88.4F0-scaled speech (−25%) 70.1 67.9 79.3Formant-shifted speech (+15%) 71.4 71.4 73.2Formant-shifted speech (−15%) 70.1 67.9 68.1

Table 1. The soundquality of speech synthesized with each method.

counted to collect comments on the sound quality of thesynthesizers.

4.1 Created singing synthesizers

As of now (April 2013), six synthesizers created by fourdevelopers have been released, and two original singingsynthesis applications have been created by one developer.More than 70 contents were uploaded by several creatorsin Nico Nico Douga.

The comments collected from Nico Nico Douga were an-alyzed to determine the effectiveness of the synthesizers.Almost all comments on the sound quality of the developedsynthesizers were positive. It was also suggested that thesensitive atmospheres depended on the synthesizer even ifthey were synthesized by WORLD-based synthesizers. Onthe other hand, there are some indications about the com-patibility between the synthesizer and the voice library.

4.2 Discussion

Six synthesizers based on WORLD were developed by fourdevelopers, and many contents were created and uploadedon the video-sharing Web site. In this section, we discussour evaluation of the synthesizers.

4.2.1 Synthesizers as content generation software

Vocaloid and UTAU are singing synthesis systems used forsupporting creation activities. Although the simplest eval-uation of a singing synthesizer is a MOS evaluation of thesynthesized singing voice, the content consists of not onlythe singing but also the music. Post-processing such asadding reverb affects the quality of the music, and compat-ibility between the synthesizer and the library (includinglabeling data) also affects the quality. As shown in Fig. 8,there are various factors to evaluate the performance of thesynthesizer as content generation software.

4.2.2 Effectiveness of the collaborative creation

The purpose of this study was to support collaborative cre-ation by developers. We consider our attempt a successbecause six synthesizers were developed and subsequentlyused to create music. In the past, three synthesizers werereleased by the developer of UTAU, and three synthesizersthat do not use WORLD have been released by other de-velopers. In our case, six synthesizers with WORLD werereleased. The performance of these synthesizers was veri-fied by other people, which is the collaborative element ofthe verification.

Method Library

Compatibility

Parameter tuning

SingingLyric Singing

SynthesisPost-processing Mixdown

Contents

Score

Figure 8. Musiccreation process. It is difficult to evaluatethe synthesized singing voice because the quality of themusic does not depend solely on only the singing voice.Not only the synthesizer but also the post-processingand mixdown can change the sensitive atmosphere of thesinging voice.

The next step of this attempt is to develop other synthesiz-ers that do not depend on UTAU. Although two such sys-tems have already been developed, they rely on the labelingdata and functions of UTAU. Since UTAU requires adjust-ing the format to synthesize the singing voice, WORLDdoes not achieve its full potential. Singing voice morph-ing [25] has potential for use in the field of singing synthe-sis. More flexible modification will be the primary focusof our future work.

5. CONCLUSIONS

In this article, we described the development of singingsynthesizers for UTAU by collaborative creation amongmany developers. The synthesizers were based on WORLD,which is a high-quality speech synthesis system, and re-leased via a Web site with C language source code. Intotal, six synthesizers were developed, released, and usedto create music.

We also discussed our evaluation of the singing synthe-sizer. Although WORLD can synthesize a speech that soundsas natural as the input speech, it is difficult to evaluate eachsynthesizer because there are so many factors in the musiccreation process.

We consider the proposed attempt to be a success becausesix synthesizers—a half of all the synthesizers of UTAU—were developed, many creators used them, and their con-tents were evaluated as good. A discussion of how to eval-uate the effectiveness of the singing synthesizer will be thekey focus of our future work. We will attempt to developanother singing synthesis system that does not depend onUTAU by collaborative creation.


291

Acknowledgments

This work was supported by JSPS KAKENHI Grant Num-bers 23700221, 24300073, and 24650085.

6. REFERENCES

[1] H. Kenmochi, “Vocaloid and hatsune miku phe-nomenon in japan,” inProc. INTERSINGING2010,2010, pp. 1–4.

[2] T. Nakano and M. Goto, “Vocalistener: A singing-to-singing synthesis system based on iterative parameterestimation,” inProc. SMC2009, 2009, pp. 343–348.

[3] ——, “Vocalistener2: A singing synthesis system ableto mimic a user’s singing in terms of voice timbrechanges as well as pitch and dynamics,” inProc.ICASSP2011, 2011, pp. 453–456.

[4] G. Fischer, “Symmetry of ignorance, social creativity,and meta-design,”Knowledge-Based Systems Journal,vol. 13, no. 7-8, pp. 527–537, 2000.

[5] M. Hamasaki, H. Takeda, and T. Nishimura, “Networkanalysis of massively collaborative creation of multi-media contents — case study of hatsune miku videoson nico nico douga—,” inProc. uxTV2008, 2008, pp.165–168.

[6] K. Yoshii and M. Goto, “Musiccommentator: generat-ing comments synchronized with musical audio signalsby a joint probabilistic model of acoustic and textualfeatures,”Lecture Notes in Computer Science, LNCS5709, pp. 85–97, 2009.

[7] H. Dudley, “Remaking speech,”J. Acoust. Soc. Am.,vol. 11, no. 2, pp. 169–177, 1939.

[8] A. M. Noll, “Cepstrum pitch determination,”J. Acoust.Soc. Am., vol. 41, no. 2, pp. 293–309, 1967.

[9] L. R. Rabiner, “On the use of autocorrelation analysisfor pitch detection,” inIEEE Trans. Acoust, Speech,and Signal Process., vol. 25, no. 1, 1977, pp. 24–33.

[10] M. Morise, H. Kawahara, and H. Katayose, “Fast andreliable f0 estimation method based on the period ex-traction of vocal fold vibration of singing voice andspeech,” inProc. AES 35th International Conference,2009, pp. CD–ROM.

[11] A. H. Nuttall, “Some windows with very good sidelobebehavior,”IEEE Trans. on Acoust. Speech, and Signalprocess., vol. 29, no. 1, pp. 84–91, 1981.

[12] H. Kawahara, A. Cheveigne, H. Banno, T. Takahashi,and T. Irino, “Nearly defect-free f0 trajectory ex-traction for expressive speech modifications based onstraight,” inProc. ICSLP2005, 2005, pp. 537–540.

[13] A. Cheveigne and H. Kawahara, “Yin: a fundamentalfrequency estimator for speech and music,”J. Acoust.Soc. Am., vol. 111, no. 4, pp. 1917–1930, 2002.

[14] A. Camacho and J. Harris, “A sawtooth waveform in-spired pitch estimator for speech and music,”J. Acoust.Soc. Am., vol. 124, no. 3, pp. 1638–1652, 2008.

[15] M. Morise, H. Kawahara, and T. Nishiura, “Rapid f0estimation for high-snr speech based on fundamentalcomponent extraction,”IEICE Trans. on Informationand Systems, vol. J93-D, no. 2, pp. 109–117, 2010. (inJapanese).

[16] B. S. Atal and S. L. Hanauer, “Speech analysis andsynthesis by linear prediction of the speech wave,”J.Acoust. Soc. Am., vol. 50, no. 2B, pp. 637–655, 1971.

[17] A. M. Noll, “Short-time spectrum and “cepstrum”techniques for vocal pitch detection,”J. Acoust. Soc.Am., vol. 36, no. 2, pp. 269–302, 1964.

[18] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne,“Restructuring speech representations using apitch-adaptive time-frequency smoothing and aninstantaneous-frequency-based f0 extraction,”SpeechCommunication, vol. 27, no. 3–4, pp. 187–207, 1999.

[19] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura,T. Irino, and H. Banno, “Tandem-straight: A tempo-rally stable power spectral representation for periodicsignals and applications to interference-free spectrum,f0, and aperiodicity estimation,” inProc. ICASSP2008,2008, pp. 3933–3936.

[20] M. Morise, T. Matsubara, K. Nakano, and T. Nishiura,“A rapid spectrum envelope estimatino technique ofvowel for high-quality speech synthesis,”IEICE Trans.on Information and Systems, vol. J94-D, no. 7, pp.1079–1087, 2011. (in Japanese).

[21] M. V. Mathews, J. E. Miller, and E. E. David, “Pitchsynchronous analysis of voiced sounds,”J. Acoust.Soc. Am., vol. 33, no. 2, pp. 179–185, 1961.

[22] M. Morise, “Platinum: A method to extract excita-tion signals for voice synthesis system,”Acoust. Soc.& Tech., vol. 33, no. 2, pp. 123–125, 2012.

[23] E. M. C. Hanon and F. Charpentier, “A diphone syn-thesis system based on time-domain prosodic modifi-cations of speech,” inProc. ICASSP89, 1989, pp. 238–241.

[24] Method for the subjective assessment of intermediatequality level of coding systems. ITU-R Recommen-dation BS.1534-1, 2003.

[25] M. Morise, M. Onishi, H. Kawahara, and H. Katayose,“v.morish’09: A morphing-based singing design inter-face for vocal melodies,”Lecture Notes in ComputerScience, LNCS 5709, pp. 185–190, 2009. (in Japanese).


292

2013 an Attempt to Develop a Singing Synthesizer by Collaborative Creation

Documents

Transcript of 2013 an Attempt to Develop a Singing Synthesizer by Collaborative Creation