The Blizzard Challenge 2009 - Festvoxfestvox.org/blizzard/bc2009/summary_Blizzard2009.pdfThe...
Transcript of The Blizzard Challenge 2009 - Festvoxfestvox.org/blizzard/bc2009/summary_Blizzard2009.pdfThe...
The Blizzard Challenge 2009
Simon Kinga and Vasilis Karaiskosb
aCentre for Speech Technology Research, bSchool of Informatics,University of [email protected]
Abstract
The Blizzard Challenge 2009 was the fifth annual Blizzard Chal-lenge. As in 2008, UK English and Mandarin Chinese were thechosen languages for the 2009 Challenge. The English corpuswas the same one used in 2008. The Mandarin corpus was pro-vided by iFLYTEK. As usual, participants with limited resourcesor limited experience in these languages had the option of usingunaligned labels that were provided for both corpora and for thetest sentences. An accent-specific pronunciation dictionary wasalso available for the English speaker. This year, the tasks wereorganised in the form of ‘hubs’ and ‘spokes’ where each hub taskinvolved building a general-purpose voice and each spoke task in-volved building a voice for a specific application.
A set of test sentences was released to participants, who weregiven a limited time in which to synthesise them and submit thesynthetic speech. An online listening test was conducted to eval-uate naturalness, intelligibility, degree of similarity to the originalspeaker and, for one of the spoke tasks, “appropriateness.”Index Terms: Blizzard Challenge, speech synthesis, evaluation,listening test
1. IntroductionThe Blizzard Challenge, conceived by Black and Tokuda [1], isthe international evaluation of corpus-based speech synthesisersopen to any participant. Blizzard Challenges are scientific re-search exercises, not competitions, in which participants use acommon corpus to build speech synthesisers. A common test setis then synthesised and a large listening test is used to obtain lis-teners’ judgements regarding the overall naturalness of the speech,its intelligibility and how similar it sounds to the original speaker.In this, the 2009 Challenge, we used the same general setup asin recent challenges, but with the tasks organised into a hub andspoke structure, as explained in this paper.
The first two Blizzard Challenges, in 2005 and 2006, were or-ganised by Carnegie Mellon University, USA, with the 2007, 2008and 2009 Challenges being organised by the Centre for SpeechTechnology Research (CSTR) at the University of Edinburgh, UK.For general details of Blizzard, the rules of participation, a time-line, and information on previous and future Blizzard Challenges,see the website [2].
2. ParticipantsThe Blizzard Challenge 2005 [1, 3] had 6 participants, Blizzard2006 had 14 [4], Blizzard 2007 had 16 [5] and Blizzard 2008 had19 (18 for English, 11 for Mandarin) [6]. This year, there wereagain 19 participants, listed in Table 1. One participant requestedto withdraw from the Challenge after the listening test was com-pleted. The results for this system (’ANON’ in Table 1) have beenretained in the tables and plots presented here and in the completeset of results distributed to participants. This is important, becauselistener scores obtained using 5-point scales are effectively inter-nally normalised by listeners with respect to the range of stimuli
they are presented with. In other words, the similarity and natu-ralness ratings of any individual system are relative to the scoresof all the other systems present in the listening test. The upperend of the 5-point scale can be fixed by the inclusion of naturalspeech, but the remainder of the scale is calibrated only by theother systems present in the test. Proper interpretation of the re-sults therefore requires presentation of the scores from all systemstogether. In future Blizzard Challenges, we may explicitly disal-low withdrawal after distribution of results.
Three systems from previous challenges were used as bench-marks, in an attempt to facilitate comparisons between the resultsfrom one year to another: a Festival-based system from CSTRconfigured very similarly to the Festival/CSTR entry to Blizzard2006 [7], an HTS speaker-dependent system configured the sameas the HTS entry to Blizzard 2005 [8] and the HTS speaker-adaptive system from Blizzard 2007 [9]. Whilst precise calibra-tion of Mean Opinion Score (MOS) ratings across different listen-ing tests (with different participating systems and different listen-ers) is almost certainly not possible, the ranking of a system rela-tive to these benchmarks may possibly be meaningfully comparedfrom one year to another. Comparisons of the absolute scoresacross different years should be avoided, noting both the pointmade above about the relative nature of such scores and also thateach year different sentences and a different pool of listeners isused.
The tasks completed by each participant are shown in Ta-ble 2. As in previous years, a number of additional groups (notlisted here) registered for the Challenge and obtained the cor-pora, but did not submit samples for evaluation. When reportinganonymised results, the systems are identified using letters, withA denoting natural speech, B to D denoting the three benchmarksystems and E to W denoting the nineteen systems submitted byparticipants in the Challenge.
3. Voices to be built3.1. Speech databases
The English data for voice building was provided by the Centre forSpeech Technology Research, University of Edinburgh, UK. Par-ticipants who had signed a user agreement were able to downloadabout 15 hours of recordings of a UK English male speaker witha fairly standard RP accent. An accent-specific pronunciation dic-tionary, and Festival utterance files created using this dictionary,were also available for the English speaker, under a separate li-cence. This is exactly the same data used for the 2008 Challenge.
For Mandarin, the ANHUI USTC iFLYTEK Company, Ltd.(iFLYTEK) released a 10 hour / 6000 utterance Mandarin Chi-nese database of a young female professional radio broadcasterwith a standard Beijing accent, reading news sentences. Thefirst 1000 sentences were manually phonetically segmented andprosodically labelled, with the remainder being segmented or la-belled automatically. Because it was not possible to make addi-tional recordings of this speaker, no natural semantically unpre-dictable sentences were available this year. However, we took the
view that this was a reasonable price to pay, given the opportunityto use a commercially-produced corpus.
3.2. Tasks
Participants were asked to build several synthetic voices from thedatabases, in accordance with the rules of the challenge [10]. Ahub and spoke design was adopted this year. Hub tasks contain ‘H’in the task name, spoke tasks contain ‘S’ and each are describedin the following sections.
3.2.1. English tasks
• EH1: English full voice from the full dataset (about 15hours)
• EH2: English ARCTIC voice from the ARCTIC [11] sub-set (about 1 hour)
• ES1: build voices from the specified ‘E SMALL10’,‘E SMALL50’ and ‘E SMALL100’ datasets, which con-
Systemshort name
Details
NATURAL Natural speech from the same speaker as thecorpus
FESTIVAL The Festival unit-selection benchmark sys-tem [7]
HTS2005 A speaker-dependent HMM-based bench-mark system [8]
HTS2007 A speaker-adaptive HMM-based bench-mark system [9]
AHOLAB Aholab, University of the Basque Country,Spain
ANON Identity withheldCASIA National Laboratory of Pattern Recog-
nition, Institute of Automation, ChineseAcademy of Sciences, China
CEREPROC CereProc Ltd, UKCMU Carnegie Mellon University, USACSTR The Centre for Speech Technology Re-
search, University of Edinburgh, UKDFKI DFKI GmbH, GermanyEMIME The EMIME project consortiumI2R Institute for Infocomm Research (I2R), Sin-
gaporeITRI Industrial Technology Research Institute,
TaiwanIVO IVO Software Sp. z o. o.MXAC µXac, AustraliaNICT National Institute of Information and Com-
munications Technology, JapanNIT Nagoya Institute of Technology, JapanNTUT National Taipei University of Technology,
TaiwanSHRC Speech and Hearing Research Center,
Peking University, ChinaTOSHIBA Research and Development Center, Toshiba
(China)USTC iFlytek Speech Lab, University of Science
and Technology of ChinaVUB Vrije Universiteit, Belgium
Table 1: The participating systems and their short names. Thefirst four rows are the benchmark systems and correspond to thesystem identifiers A to D in that order. The remaining rows are inalphabetical order of the system’s short name and not the order Eto W.
System EH1 EH2 ES1 ES2 ES3 MH MS1 MS2NATURAL X X X X X X X XFESTIVAL X X XHTS2005 X X X X XHTS2007 X X X X X X XAHOLAB X X XANON X XCASIA X X X XCEREPROC X XCMU X X X XCSTR X XDFKI X X X XEMIME X X X X X X XHTS X X X X X X XI2R X X X X X XITRI XIVO X X XMXAC X X XNICT X X XNTUT X X XSHRC X XTOSHIBA X XUSTC X X X X XVUB X X X
Table 2: The tasks completed by each participating system. Thefirst four rows are the benchmark systems and correspond to thesystem identifiers A to D in that order. The remaining rows are inalphabetical order of the system’s short name and not the order Eto W
sist of the first 10, 50 and 100 sentences respectively of the‘ARCTIC’ subset. Participants could use voice conversion,speaker adaptation techniques or any other technique.
• ES2: build a voice from the full UK English database suit-able for synthesising speech to be transmitted via a tele-phone channel. The telephone channel simulation tool de-scribed in Section 3.3 was made available to assist partic-ipants in system development. It was permissible to enterthe same voice as task EH1 or EH2, but specially-designedvoices were strongly encouraged.
• ES3: build a voice from the full UK English databasesuitable for synthesising the computer role in a human-computer dialogue. A set of development dialogues wereprovided, from the same domain as the test dialogues. Par-ticipants could enter the same voice as task EH1 or EH2,but again specially-designed voices were strongly encour-aged. Participants were allowed to add simple markup tothe text, either automatically or manually, if they wished.The markup had to be of a type that could conceivably beprovided by a text-generation system (e.g. emphasis tagswere acceptable, but a handcrafted F0 contour was not).
3.2.2. Mandarin tasks
• MH: Mandarin voice from the full dataset (about 10 hours/ 6000 utterances / 130000 Chinese characters)
• MS1: build voices from each of the specified‘M SMALL10’, ‘M SMALL50’ and ‘M SMALL100’datasets, which consist of the first 10, 50 and 100 sentencesrespectively of the full Mandarin database. Same rules asES1.
• MS2: build a voice from the full Mandarin database suit-able for synthesising speech to be transmitted via a tele-phone channel. Same rules as ES2.
# Set active speech level of source signal to# -26 dBovsv56demo -q -lev -26 -sf 8000 in.pcm tmp1.pcm
# The level-normalized source speech signal is# then filtered according to the# "telephone bandpass" defined in# ITU-T Rec. G.712c712demo tmp1.pcm tmp2.pcm
# The G.712-filtered version is successively# G.711-encoded, encoded and decoded according# to G.726 at 16 kbit/s, and decoded by# G.711 (A-law)g711demo A lilo tmp2.pcm tmp3a.pcm 160g726demo A lolo 16 tmp3a.pcm tmp3b.pcm 160g711demo A loli tmp3b.pcm tmp3.pcm 160
# The decoded signal is filtered according to# the (modified) Intermediate Reference System# in receive direction, as defined in# ITU-T Rec. P.830filter -q RXIRS8 tmp3.pcm tmp4.pcm 160
# Set active speech level of output signal to# -26 dBovsv56demo -q -lev -26 -sf 8000 tmp4.pcm out.pcm
Figure 1: The pipeline of processes used to simulate the telephonechannel. Input and output are headerless PCM files at 8kHz sam-pling rate and 16 bit sample depth.
3.3. Telephone channel simulation for tasks ES2 and MS2
In order to investigate the effects of telephone channels on theintelligibility of the submitted synthetic speech, a simulated tele-phone channel was used. Although it would have added morerealisim to present listeners with the stimuli monaurally using atelephone handset, this was not practical for the large numbers oflisteners required by the Blizzard Challenge.
The simulated channel was implemented using the “G.191:Software tools for speech and audio coding standardization” soft-ware freely available from the ITU1 with a pipeline of processeskindly suggested by Telekom Laboratories & The Quality and Us-ability Lab at TU Berlin, shown in Figure 1. We elected to imple-ment a relatively low quality channel with a 16kbps transmissionrate.
Participants were provided with this pipeline, should theywish to use it during development of their ES2 and MS2 voices.They were encouraged to build special voices for these tasks, butwere allowed to enter their EH1, EH2 or MH1 voices instead.
3.4. Appropriateness (task ES3)
At previous Blizzard workshops there was a clear desire to evalu-ate more than just naturalness and intelligibility; specifically, par-ticipants wished to evaluate synthetic speech in a particular usagecontext. Therefore, we conceived task ES3 in which the syntheticspeech was evaluated in a simulated human-computer dialogue. Areal-time dialogue system, which dynamically generates the com-puter response, would require participants to submit run-time syn-thesisers. It was decided that this would be unattractive for someparticipants, and impractical for the organisers. Therefore, weused pairs of dialogue utterances comprising one user’s query tothe system followed by the system’s response. These were kindlyprovided by the CLASSIC project2 and were in a restaurant rec-ommendation domain. The sentences were manually adjusted bythe Blizzard organisers in order to remove difficult-to-pronounce
1http://www.itu.int/rec/T-REC-G.191-200509-I/en2www.classic-project.org
restaurant names (e.g. French words). Since these dialogue pairswere static, participants could pre-synthesise all the system utter-ances and submit them for evaluation. For the test sentences, thetexts of both the user query and the corresponding system responsewere provided to participants.
4. Listening test design4.1. Interface
The listening evaluation was conducted online, using the designdeveloped for Blizzard 2007 [5] and refined in Blizzard 2008 [6],which was itself developed from designs in previous challenges[1, 3, 4]. The registration page for each listener type presentedan overview of the listening test and the tasks to be completed.It was possible for a listener to register for both the English andMandarin listening tests separately, if they wished. Please refer to[5] for a complete description of the listening test interface.
4.2. Materials
The participants were asked to synthesise several hundred test sen-tences (including the complete Blizzard Challenge 2007 and 2008test sets, to be retained as a resource for future experimentation),of which a subset were used in the listening test. The selection ofwhich sentences to use in the listening tests was made as in 2008– please see [6] for details. Permission has been obtained fromalmost all participants to distribute parts of this dataset along withthe listener scores; we hope to find the resources to do this shortly.For English, participants synthesised sentences that had been heldout from the corpus (so that natural speech samples were availablefor them) plus Semantically Unpredictable Sentences (SUS) [12]generated using a tool provided by Tim Bunnell of the Univer-sity of Delaware and recorded by us specially for the Challengewith the same speaker as the distributed corpus. These SUS con-form more closely to the original specification [12] and use sim-pler words than the SUS used in previous Blizzard Challenges. Inorder to mitigate this and avoid ceiling effects, listeners were onlypermitted to play each such sentence once. For Mandarin, held outsentences were also used. The SUS for Mandarin were generatedusing the same tool as in 2008. Natural SUS were not availablefor Mandarin, since the original speaker was not available.
4.3. Listener types
Various listener types were employed in the test: letters in paren-thesis below are the identifiers used for each type in the results dis-tributed to participants. For English, the following listener typeswere used:
• Volunteers recruited via participants, mailing lists, blogs,etc. (ER).
• Speech experts, recruited via participants and mailing lists(ES).
• Paid UK undergraduates, native speakers of UK English,aged about 18-25. These were recruited in Edinburgh andcarried out the test in purpose-built soundproof listeningbooths using good quality audio interfaces and headphones(EU).
For Mandarin, the following listener types were used:• Paid native speakers of Mandarin, aged 18-25, recruited in
China using a commercial testing organisation, who car-ried out the test in a quiet supervised lab using headphones(MC).
• Paid undergraduate native speakers of Mandarin agedabout 20-25. These were recruited in Edinburgh andcarried out the test in purpose-built soundproof listeningbooths using good quality audio interfaces and headphones(ME).
Sectionnumber
Tasksbeingevaluated
Type (see Section 4.4.1)
Test name: EH1 + ES31 EH1 SIM2 EH1 Multidimensional scaling (MDS)3 EH1 MOSnews4 EH1,ES3 MOSconv5 EH1 SUS6 ES3 MOSapp
Test name: EH2 + ES31 EH2 SIM2 EH2 MDS3 EH2 MOSnews4 EH1,ES3 MOSconv5 EH2 SUS6 ES3 MOSapp
Test name: ES1 + ES21 ES1 SIM2 ES1 SIM3 ES1 MOSnews4 ES1 MOSconv5 ES1 SUS6 ES2 SIM7 ES2 MOSnews8 ES2 MOSconv9 ES2 SUS10 ES2 SUS
Table 3: The three listening tests conducted for English.
• Volunteers, recruited via participants, mailing lists, etc.(MR).
• Speech Experts, recruited via participants and mailing lists(MS).
Tables 29 to 35, summarised in Table 5, show the number oflisteners of each type obtained for each of the listening tests listedin Tables 3 and 4.
4.4. Listening tests
Since the tests for tasks ES1, ES2, MS1 and MS2 were relativelyshort, they were combined into pairs in order to make the best useof available listeners. Only two participants entered ES3 voices,so the listening test for this task was handled differently. Ratherthan simply performing a comparison between these two systems,they were included in two sections of the main EH1 and EH2 lis-tening tests, as described in Section 4.4.1. Tables 3 and 4 showthe five independent listening tests that were run in parallel forthis year’s Blizzard Challenge. Each listener performed one of thethree English tests or one of the two Mandarin tests (or, possiblyone English test and one Mandarin test). Each test followed thesame general design, although the number and type of sectionsvaried, as described in the tables. Within each numbered sec-tion of a listening test, the listener generally heard one examplefrom each system, with the exception of the MDS sections (whichinvolved pairwise comparisons) and the MOSconv/MOSapp sec-tions in tests EH1+ES3 and EH2+ES3. Note that the number ofsystems involved in each task varies; where there were more sys-tems, and therefore larger Latin Squares, fewer sections could beincluded in the corresponding listening test. Samples of the orig-inal speaker were included in all sections, except for MandarinSUS.
Sectionnumber
Tasksbeingevaluated
Type (see Section 4.4.1)
Test name: MH1 MH SIM2 MH MDS3 MH MOSnews4 MH MOSnews5 MH SUS6 MH SUS7 MH SUS
Test name: MS1 + MS21 MS1 SIM2 MS1 SIM3 MS1 MDS4 MS1 MOSnews5 MS1 MOSnews6 MS2 SUS7 MS2 SIM8 MS2 MDS9 MS2 MOSnews10 MS2 MOSnews11 MS2 SUS12 MS2 SUS
Table 4: The two listening tests conducted for Mandarin.
4.4.1. Description of each type of section in the listening test
SIM In each part listeners could play 4 reference samples of theoriginal speaker and one synthetic sample. They chose a responsethat represented how similar the synthetic voice sounded to thevoice in the reference samples on a scale from 1 [Sounds like a to-tally different person] to 5 [Sounds like exactly the same person].MDS In each part listeners heard one sample from each of two ofthe participating systems, (or, in the case of one system orderingfor each dataset, two samples from the same system). Listenerswere asked to ignore the meanings of the sentences and insteadconcentrate on how natural or unnatural each one sounded. Theythen chose whether in their opinion the two sentences were similaror different in terms of their overall naturalness. The results of thissection are intended for analysis using Multidimensional Scaling(not presented here).MOSnews Mean Opinion Score (MOS - naturalness), news do-main. In each part listeners listened to one sample and chosea score which represented how natural or unnatural the sentencesounded on a scale of 1 [Completely Unnatural] to 5 [CompletelyNatural].MOSconv Mean Opinion Score (MOS - naturalness), conversa-tional domain. In each part listeners listened to one sample andchose a score which represented how natural or unnatural the sen-tence sounded on a scale of 1 [Completely Unnatural] to 5 [Com-pletely Natural].
There were only two entries to the ES3 task, so we deviseda listening test design in which the listening tests EH1+ES3 andEH2+ES3 included sections in which samples from the two ES3systems that were submitted, plus samples from all systems forvoice EH1 or EH2. However, due to a small error in the lis-tening test scripts, these two sections actually contained samplesfrom all EH1 systems except the EH1 samples from the two teamsthat submitted an ES3 voice, but including samples from the ES3voice of those two teams. The consequence of this is that theEH1 samples from those two teams were evaluated by fewer lis-teners than intended. We used the results for all EH1 samplesfrom both MOSconv sections (the ones from test EH1+ES3, andthe one from test EH2+ES3) to compute the MOS scores for voice
EH1. The results from the ES3 samples are presented separately.SUS Semantically Unpredictable Sentences (SUS) designed totest the intelligibility of the synthetic speech. Listeners heard oneutterance in each part and typed in what they heard. The errorrates were computed as in previous years [5, 6].MOSapp Mean opinion scores (MOS - appropriateness), conver-sational domain. In each part, listeners saw a question (providedin text form only) of the type that a human user might ask a restau-rant enquiry service, and then listened to one spoken sample thatrepresented the response to that question. Listeners chose a scorewhich represented how appropriate or not the response sounded inthat dialogue context on a scale of 1 [Completely Inappropriate] to5 [Completely Appropriate]. For this section we used the samplesfrom the two teams that submitted a separate voice for ES3; wedecided to also add EH1 samples from all the other teams. Theresults are presented together.
4.4.2. Number of listeners
The listener responses used for the distributed results were ex-tracted from the database on 26th June 2009 after the online eval-uation had been running for approximately six weeks. The numberof listeners obtained is shown in Table 5.
English MandarinTotal registered 482 334
of which:Completed all sections 365 311Partially completed 59 14No response at all 58 9
Table 5: Number of listeners obtained
See Table 28 for a detailed breakdown of evaluation comple-tion rates for each listener type. As in last year’s challenge, thehigher completion rate for Mandarin listeners is a consequence ofthe higher proportion of paid listeners.
5. Analysis methodologyAs in previous years, we pooled ‘completed all sections’ and‘partially completed’ listeners together in all analyses. Here, wepresent only results for all listener types combined. Analysis bylistener type was provided to participants. Please refer to [13]for a complete description of the statistical analysis techniquesused and justification of the statistical significance techniques em-ployed. As usual, system names are anonymised in all distributedresults. See Section 7.3 and Tables 23 to 63 for a summary of theresponses to the questionnaire that listeners were asked to option-ally complete at the end of the listening test.
6. ResultsStandard boxplots are presented for the ordinal data where themedian is represented by a solid bar across a box showing thequartiles; whiskers extend to 1.5 times the inter-quartile range andoutliers beyond this are represented as circles. Bar charts are pre-sented for the word error rate type interval data. A single orderingof the systems is employed in all plots for a particular language.This ordering is in descending order of the mean MOS (combiningMOSnews and MOSconv) for the main task (EH1 or MH) – seeTables 6 and 8. Note that this ordering is intended only to makethe plots more readable and cannot be interpreted as a ranking.In other words, the ordering does not tell us anything about whichsystems are significantly better than other systems.
System median MAD mean sd n naA 5 0.0 4.9 0.38 463 43B 3 1.5 2.9 1.06 457 49C 3 1.5 2.7 1.07 463 43D 2 1.5 2.5 1.02 456 50E 2 1.5 2.1 1.01 462 44H 3 1.5 2.8 1.01 463 43I 3 1.5 3.1 1.02 462 44J 2 1.5 2.4 0.98 463 43
K 4 1.5 3.8 0.88 457 49L 3 1.5 2.8 0.97 457 49
M 2 1.5 1.9 0.92 462 44O 3 1.5 2.6 0.98 463 43P 2 1.5 2.0 0.98 457 49Q 2 1.5 2.1 0.93 463 43R 2 1.5 2.1 0.97 463 43S 4 1.5 4.2 0.71 163 343T 2 1.5 2.0 0.97 463 43
W 2 1.5 2.1 0.94 456 50
Table 6: Mean opinion scores for task EH1 (full data set) on thecombined results from sections 3 and 4 of the EH1+ES3 listeningtest, excluding the ES3 samples. Table shows median, medianabsolute deviation (MAD), mean, standard deviation (sd), n andna (data points excluded). Note the high value of na for system S– this is due to the error in the setup of section 4 of this listeningtest.
6.1. Task EH1
Table 6 presents descriptive statistics for the mean opinion scoresfor English task EH1. Figure 2 displays the results of the testsgraphically. As expected, we see that natural speech (system A)has a MOS naturalness of 5. Inspecting the Bonferoni-correctedpairwise Wilcoxon signed rank significance tests (α = 0.01) fornaturalness presented in Table 11 reveals that system A is signifi-cantly different from all other systems. We can therefore say thatno synthesiser is as natural as the natural speech. Systems S andK, whilst not as natural as the natural speech, are both significantlymore natural than all other systems.
From the plot of similarity scores and by referring to Table 10,we can also say that, although systems K and S are significantlyless similar to the original speaker than natural speech, they areboth significantly more similar to the original speaker than allother systems, for English task EH1. Likewise, from Table 11,systems S and K are equally natural and significantly more natu-ral than all other systems, although significantly less natural thannatural speech.
System S is as intelligible as natural speech (Table 12). How-ever, there is no significant difference between system S and anumber of other systems (B,C,K,L,O,P), so we cannot state thatsystem S is more intelligible than other systems.
6.2. Task EH2
For English task EH2, results are given in Table 7 and Figure 3with statistical significances shown in Table 13 for similarity, Ta-ble 14 for naturalness and Table 15 for intelligibility. Again, nosystem is as natural as the natural speech, or as similar to the orig-inal speaker. There is no system that is clearly more natural thanthe rest. Although it was as intelligible as natural speech on taskEH1, system S is no longer as intelligible as natural speech on taskEH2.
6.3. Task ES1
In both English and Mandarin, we chose to evaluate just one ofthe three voices built for this task. For task ES1, we selected the
System median MAD mean sd n naA 5 0.0 4.8 0.47 139 13B 3 1.5 2.9 1.08 140 12C 3 1.5 3.2 0.94 140 12D 3 1.5 2.6 0.97 141 11E 2 1.5 1.7 0.83 140 12H 3 1.5 2.6 0.94 140 12I 3 1.5 3.3 0.98 141 11J 3 1.5 2.6 1.03 140 12
K 4 0.0 3.6 0.89 140 12L 3 1.5 3.3 1.01 139 13
M 2 1.5 1.8 0.89 140 12O 2 1.5 2.5 0.92 141 11P 2 1.5 2.4 0.99 141 11Q 3 1.5 2.5 1.03 140 12R 2 1.5 2.3 0.88 140 12S 4 1.5 3.7 0.92 141 11T 2 1.5 2.2 1.01 140 12U 2 1.5 1.7 0.88 141 11W 2 1.5 2.3 0.99 140 12
Table 7: Mean opinion scores for task EH2 (ARCTIC data set)on the combined results from sections 3 and 4 of the EH2+ES3listening test, excluding the ES3 samples. Table shows median,median absolute deviation (MAD), mean, standard deviation (sd),n and na (data points excluded).
E SMALL100 voice, based on preferences expressed by partici-pants who submitted entries for task ES1. For English task ES1(building a voice from very small amounts of speech), results aregiven in Figure 4 and significance tests are shown in Table 16.
All systems are rated as unnatural and not similar to the orig-inal speaker. Systems J and P and significantly less similar to theoriginal speaker than the other systems. Systems W and D aresomewhat more natural than other systems, although this is notsignificant in all cases.
The systems fall neatly into three groups for intelligibility:natural speech is significantly more intelligible than all synthesis-ers, systems P, S, D, W and L are equally intelligible, followed bysystems H and J.
6.4. Task ES2
For English task ES2 (building a voice for use over the telephone),results are given in Figure 5. Significance tests are shown for nat-uralness and intelligibility in Table 17. Now there is no systemthat is as intelligible as natural speech – it appears that syntheticspeech may be generally more degraded by the telephone channelthan natural speech in terms of intelligibility.
6.5. Task ES3
For English task ES3 (building a voice for a dialogue system),results are given in Figure 6. System S is rated as significantlymore appropriate than system U (using the same type of pairwiseWilcoxon signed rank tests as in other tasks), although this may besimply because system U is significantly less natural than systemS.
6.6. Task MH
Table 8 and Figure 7 presents the results for the Mandarin hub taskMH. The significance tests illustrated in Table19 show that again,as for English, no system is as natural as the natural speech. Themost natural synthesiser is system L which, although less naturalthan natural speech, is significantly more natural than all othersystems.
Since natural SUS were not available for Mandarin this year,
we are unable to test whether any system was as intelligible as nat-ural speech. We can say, from the significance tests illustrated inTable20, that systems L, F and C are equally intelligible, althoughonly system L is significantly more intelligible than the remainingsystems.
With regards to similarity to the original speaker, Table18shows that no system was regarded as being as similar to the orig-inal speaker as the natural speech. Systems L, F, C and R forma group of systems that appear to be most similar to the origi-nal speaker, although only systems L and F are significantly moresimilar than the remaining systems. Note that system F is actuallysignificantly different to system R within this approximate group-ing.
System median MAD mean sd n naA 5 0.0 4.6 0.79 370 26C 4 1.5 3.6 0.96 370 26D 3 1.5 2.9 1.06 370 26F 4 1.5 3.8 1.07 370 26G 3 1.5 2.8 1.13 371 25I 3 1.5 3.3 1.24 370 26
L 4 1.5 4.1 0.93 370 26M 3 1.5 3.1 1.14 370 26N 3 1.5 2.8 1.25 370 26R 4 1.5 3.5 1.04 371 25V 3 1.5 3.0 1.18 370 26W 3 1.5 3.1 1.04 370 26
Table 8: Mean opinion scores for task MH. Table shows median,median absolute deviation (MAD), mean, standard deviation (sd),n and na (data points excluded due to missing data)
6.7. Task MS1
For Mandarin task MS1 (building a voice from very small amountsof speech), results are given in Figure 8 and significance tests areshown in Table 21. We selected the M SMALL100 voice for eval-uation, based on preferences expressed by participants who sub-mitted entries for this task. No system was found to be as naturalor as similar to the original speaker as the natural speech. SystemsL, R, W and D form a group of systems which sound most similarto the original speaker (although there is a significant differencebetween system R and system D). System L is significantly morenatural than all other systems except W. There are few signifi-cant differences in intelligibility between most systems, in termsof PTER.
6.8. Task MS2
For Mandarin task MS2 (building a voice for use over the tele-phone), results are given in Figure 9. Significance tests are shownfor naturalness and intelligibility in Table 22. The natural speech(system A) is no longer rated by listeners as being very similarto the original speaker, although it is still found to be highly nat-ural and significantly more so than any other system. System Lis significantly more natural than all other systems, except naturalspeech. There are relatively few significant differences in intelligi-bility, with systems C, F, L, V and W forming a group of roughlyequally intelligible systems (although there are some significantdifferences between systems within this group, and also some in-significant differences between some members of this group andthe remaining systems).
7. DiscussionThere is continued interest in the Blizzard Challenge, with 19teams participating this year. We therefore propose to organiseanother Challenge in 2010. In 2009, we made several additions
Year2007 2008 2009
System MOS WER MOS WER MOS WERNatural 4.7 – 4.8 22 4.9 14Festival 3.0 25 3.3 35 2.9 25HTS 2005 – – 2.9 33 2.7 23
Table 9: Comparing the results of the benchmark systems for En-glish (main voice, large database) across three years of the Bliz-zard Challenge. MOS means mean naturalness score and WERmeans word error rate in percent using semantically unpredictablesentences (SUS). Note that the SUS in 2009 were simpler thanthose in 2007 and 2008
to the challenge, with varying degrees of success. Both the ‘verysmall amounts of data’ and ‘speech for transmission by telephone’tasks seemed popular with participants. The dialogue speech taskwas not popular, with only two entries, even though from our dis-cussions with past participants this type of application for TTS iswidely thought to be important and interesting. Entries to this taskprobably required considerably more effort, and perhaps neededmore expert knowledge of the language (English). We would wel-come suggestions for ways to evaluate ‘appropriateness’ or anyother measure of how good synthetic speech is in particular us-age situations or applications. Task-based scenarios are attractive,since they allow objective measures of task success (e.g. comple-tion rate or time taken). However, they also tend to be lengthyand may require on-line generation of synthetic speech; neither ofthese are practical for the Blizzard Challenge.
7.1. Benchmark systems
The inclusion of the benchmark systems is intended to providereference points for comparison between different years of theChallenge. If this is to be possible, then the relative ranking ofthe benchmark systems should be constant from year to year. Ta-ble 9 presents the key results for the English benchmark systemsfor 2007, 2008 and 2009. These results do seem to be consistentyear-on-year. WER decreases uniformly by about one third for allsystems from 2008 to 2009, due to the simpler SUS used this year.The relative MOS and WER of the three systems is consistent: forMOS, the ranking is Natural–Festival–HTS 2005; for WER, theranking is Natural–HTS 2005–Festival.
7.2. Limitations of the listening test design
The current listening test design has many advantages, includingthe ability to perform evaluations for quite large number of sys-tems (perhaps up to 25) with a fully balanced design which con-trols for possible effects of sentence and order of presentation byusing a Latin Square design.
We consider this year’s hub and spoke design a success, be-cause it allowed participants to enter whichever tasks they desired.The disappointing number of entries to task ES3 necessitated spe-cial treatment in the listening test, which created considerable ad-ditional complexity in the design which in turn lead to a smallerror being made in this part of the test.
However, there are two significant weaknesses which shouldbe considered when designing future listening tests for the Bliz-zard Challenge:
• The listening tests for each hub and spoke task are con-ducted independently, making cross-task comparisons im-possible. In particular, this year’s test does not allow directcalculation of the difference in intelligibility for a singlesystem between a hub task and the telephone channel spoketask.
• Each new task added increases the number of listeners re-
quired. This year, we were able to use the same listenerpool for some pairs of tasks, but this necessitated the useof different sentences in each test (particularly importantfor SUS) which only increases the difficulty in comparingresults across tasks for a single system.
7.3. Listener feedback
On completing the evaluation, listeners were given the opportunityto tell us what they thought through an online feedback form. Thiswas the same as in Blizzard 2007 and 2008. All responses wereoptional. Feedback forms were submitted by all the listeners whocompleted the evaluation and included many detailed commentsand suggestions from all listener types. Listener information andfeedback is summarised in Tables 23 to 63.
8. AcknowledgementsRob Clark designed and implemented the statistical analysis;Dong Wang wrote the WER and CER/PTER/PER programmes;Volker Strom and Junichi Yamagishi provided the benchmark sys-tems. Roger Burroughes is ‘roger’, the English voice; Tim Bun-nell of the University of Delaware generated the 2009 SUS sen-tences; iFLYTEK provided the Mandarin data; the listening testscripts are based on earlier versions provided by previous organ-isers of the Blizzard Challenge. Thanks to all participants andlisteners.
9. References[1] Alan W. Black and Keiichi Tokuda, “The Blizzard Challenge - 2005:
Evaluating corpus-based speech synthesis on common datasets,” inProc Interspeech 2005, Lisbon, 2005.
[2] “Blizzard Challenge 2009 website,”http://www.synsig.org/index.php/Blizzard Challenge 2009.
[3] C.L. Bennett, “Large scale evaluation of corpus-based synthesizers:Results and lessons from the Blizzard Challenge 2005,” in Proceed-ings of Interspeech 2005, 2005.
[4] C.L. Bennett and A. W. Black, “The Blizzard Challenge 2006,” inBlizzard Challenge Workshop, Interspeech 2006 - ICSLP satelliteevent, 2006.
[5] Mark Fraser and Simon King, “The Blizzard Challenge 2007,” inProc. Blizzard Workshop (in Proc. SSW6), 2007.
[6] V. Karaiskos, S. King, R. A. J. Clark, and C. Mayo, “The BlizzardChallenge 2008,” in Proc. Blizzard Workshop (in Proc. SSW7), 2008.
[7] R. Clark, K. Richmond, V. Strom, and S. King, “Multisyn voices forthe Blizzard Challenge 2006,” in Proc. Blizzard Challenge Workshop(Interspeech Satellite), Pittsburgh, USA, Sept. 2006.
[8] Heiga Zen and Tomoki Toda, “An overview of Nitech HMM-basedspeech synthesis system for Blizzard Challenge 2005,” in Proc. Bliz-zard Workshop, 2005.
[9] Junichi Yamagishi, Heiga Zen, Tomoki Toda, and Keiichi Tokuda,“Speaker-independent HMM-based speech synthesis system - HTS-2007 system for the blizzard challenge 2007,” in Proc. BlizzardWorkshop, 2007.
[10] “Blizzard Challenge 2009 rules,”http://www.synsig.org/index.php/Blizzard Challenge 2009 Rules.
[11] J. Kominek, NewAuthor1, and A. W. Black, “The CMU Arcticspeech databases,” in SSW5-2004, 2004, pp. 223–224.
[12] C. Benoit and M. Grice, “The SUS test: a method for the assessmentof text-to-speech intelligibility using semantically unpredictable sen-tences,” Speech Communication, vol. 18, pp. 381–392, 1996.
[13] R. A. J. Clark, M. Podsiadło, M. Fraser, C. Mayo, and S. King, “Sta-tistical analysis of the Blizzard Challenge 2007 listening test results,”in Proc. Blizzard Workshop (in Proc. SSW6), August 2007.
●●●
●
●●●
●
●●
●
●
●
●●●●
●●●●●
●●●●●● ●●●●●●●●●●●●●● ●●●●●●
175 174 176 175 175 175 175 175 175 175 175 174 174 176 174 174 174 174n
A S K I B L H C O J D E R Q P W M T
12
34
5
Similarity scores comparing to original speaker for task EH1 (All listeners)
System
Sco
re
●●●●●●●●●●●●
●
●●●●●●●●●●
●
●●●●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●
●●
●●
●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●
●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●
●
●
463 163 457 462 457 457 463 463 463 463 456 462 463 463 457 456 462 463n
A S K I B L H C O J D E R Q P W M T
12
34
5
Mean opinion scores − naturalness − for task EH1 (All listeners)
System
Sco
re
●●●●●●●●●●●●
●
●●●●●●●●
●● ●●●●●●●●●●●●●
●●●●●
●●●● ●●●●●●●● ●● ●● ● ●
163 163 163 163 163 163 163 163 163 163 163 163 163 163 163 163 163 163n
A S K I B L H C O J D E R Q P W M T
12
34
5
Mean opinion scores − naturalness −, news domain for task EH1 (All listeners)
System
Sco
re
A S K I B H L C O D J E Q R W P T M
010
2030
40
151 150 150 150 151 150 150 151 152 152 151 151 150 149 150 152 151 150n
Word error rate for task EH1 (All listeners)
System
WE
R (
%)
Figure 2: Results for task EH1.
A B C D E H I J K L M O P Q R S T WABCDEHIJ
KL
MOPQRST
W
Table 10: Significant differences in similarity to the original speaker for task EH1: results of pairwise Wilcoxon signed rank tests betweensystems’ mean opinion scores. indicates a significant difference between a pair of systems.
A B C D E H I J K L M O P Q R S T WABCDEHIJ
KL
MOPQRST
W
Table 11: Significant differences in naturalness for task EH1: results of pairwise Wilcoxon signed rank tests between systems’ mean opinionscores. indicates a significant difference between a pair of systems.
A B C D E H I J K L M O P Q R S T WABCDEHIJ
KL
MOPQRST
W
Table 12: Significant differences in intelligibility for task EH1: results of pairwise Wilcoxon signed rank tests between systems’ word errorrates. indicates a significant difference between a pair of systems.
●
●
●
●
●●
●
●●
●
●●●●●●
●
●●
●
●●●
●●●●● ●●●●●●●●●●
●● ●●●● ●●●● ●●●
●●●●●●●●● ●●●●●●●
●
●●●
148 148 148 148 148 148 149 149 148 149 148 148 148 149 148 149 149 148 149n
A S K I B L H C O J D E R Q P W M T U
12
34
5
Similarity scores comparing to original speaker for task EH2 (All listeners)
System
Sco
re
●●●●●●●●●●●●●●●●●●
● ● ●●●●●●●
●●
●●●●
● ●●●●● ●●●
●●●●●
● ●● ●
●●
●
●●●●● ●●●●●
●●
139 141 140 141 140 139 140 140 141 140 141 140 140 140 141 140 140 140 141n
A S K I B L H C O J D E R Q P W M T U
12
34
5
Mean opinion scores − naturalness − for task EH2 (All listeners)
System
Sco
re
A S K I B H L C O D J E Q R W P T U M
010
2030
40
134 135 133 135 135 134 135 134 134 134 134 135 135 134 135 135 133 135 135n
Word error rate for task EH2 (All listeners)
System
WE
R (
%)
Figure 3: Results for task EH2.
A B C D E H I J K L M O P Q R S T U WABCDEHIJ
KL
MOPQRSTUW
Table 13: Significant differences in similarity to the original speaker for task EH2: results of pairwise Wilcoxon signed rank tests betweensystems’ mean opinion scores. indicates a significant difference between a pair of systems.
A B C D E H I J K L M O P Q R S T U WABCDEHIJ
KL
MOPQRSTUW
Table 14: Significant differences in naturalness for task EH2: results of pairwise Wilcoxon signed rank tests between systems’ mean opinionscores. indicates a significant difference between a pair of systems.
A B C D E H I J K L M O P Q R S T U WABCDEHIJ
KL
MOPQRSTUW
Table 15: Significant differences in intelligibility for task EH2: results of pairwise Wilcoxon signed rank tests between systems’ word errorrates. indicates a significant difference between a pair of systems.
●●●
●
●●
●
●●●●
●
●●
●●●●
●●
●●●●
●
●●●
●
●●●●
●
●●●●
●
●●
●
●●●●●
●
●●●●●
●
●
●
●
●●
●●●●●●●●●●●●●
●
●●
●●●
●
●
●●●●●●●●●●●●
●
●
●
●●
●
●●
●
●●●●
174 174 174 174 174 174 174 174n
A S L H J D P W
12
34
5
Similarity scores comparing to original speaker for task ES1 (All listeners)
System
Sco
re
●●●●●●●●
●●
●●●●●●
●
●
●●● ●●● ●
●●●●
●
●●●
●●
●●●●●●●●●
174 174 174 174 174 174 174 174n
A S L H J D P W
12
34
5
Mean opinion scores − naturalness − for task ES1 (All listeners)
System
Sco
re
A P H S D J W L
010
2030
4050
60
84 85 85 87 85 86 85 85n
Word error rate for task ES1 (All listeners)
System
WE
R (
%)
Figure 4: Results for task ES1.
A D H J L P S WADHJLPS
W
A D H J L P S WADHJLPS
W
A D H J L P S WADHJLPS
W
Table 16: Significant differences in similarity to the original speaker (left table) and naturalness (middle table) and intelligibility (right table)for task ES1: results of pairwise Wilcoxon signed rank tests between systems’ mean opinion scores. indicates a significant differencebetween a pair of systems.
● ●●●●●●●●● ●● ●●●●
88 88 88 87 89 88 89 88 88 88 88 89 88n
A S K I B L C O D R Q P W
12
34
5
Similarity scores comparing to original speaker for task ES2 (All listeners)
System
Sco
re
●●●●
●●●●● ●●●●●
●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●
●●●●●●●●
●●●●
174 174 174 174 174 174 174 174 174 174 174 174 174n
A S K I B L C O D R Q P W
12
34
5
Mean opinion scores − naturalness − for task ES2 (All listeners)
System
Sco
re
A S K I B L C O D Q R W P
010
2030
4050
60
86 87 86 87 86 87 86 86 86 87 85 85 86n
Word error rate for task ES2 (All listeners)
System
WE
R (
%)
Figure 5: Results for task ES2.
A B C D I K L O P Q R S WABCDI
KLOPQRS
W
A B C D I K L O P Q R S WABCDI
KLOPQRS
W
Table 17: Significant differences in naturalness (left table) and intelligibility (right table) for task ES2: results of pairwise Wilcoxon signedrank tests between systems’ word error rates. indicates a significant difference between a pair of systems.
●●●
●
●●●
●
●
●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●
●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●
295 588 295 295 295 287 293 294 294 295 295 288 294 294 287 295 287 288 424n
A S.ES3 K I B L H C O J D E R Q P W M T U.ES3
12
34
5
Mean opinion scores − appropriateness − for task ES3 (All listeners)
System
Sco
re
●●●●●●●●●●
600 438n
S.ES3 U.ES3
12
34
5
Mean opinion scores − naturalness − for task ES3 (All listeners)
System
Sco
re
Figure 6: Results for task ES3.
●●
●●●
●
●
●
●●
●
●
●
●
●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
189 188 189 188 189 188 189 189 188 189 189 189n
A L F C R I M W V D G N
12
34
5
Similarity scores comparing to original speaker for task MH (All listeners)
System
Sco
re
●●
●●
●●●●●
●
●●●●
●
●●●●●●●●●●●
●●●
●●●●●●●●●
●●
●●●
●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●
370 370 370 370 371 370 370 370 370 370 371 370n
A L F C R I M W V D G N
12
34
5
Mean opinion scores − naturalness − for task MH (All listeners)
System
Sco
re
L F C R I M W V D G N
010
2030
40
547 547 547 547 547 547 547 547 547 547 547n
Pinyin (without tone) error rate (PER) for task MH
System
WE
R (
%)
L F C R I M W V D G N
010
2030
40
547 547 547 547 547 547 547 547 547 547 547n
Pinyin (with tone) error rate (PTER) for task MH
System
WE
R (
%)
Figure 7: Results for task MH
A C D F G I L M N R V WACDFGI
LMNRVW
Table 18: Significant differences in similarity to the original speaker for task MH: results of pairwise Wilcoxon signed rank tests betweensystems’ mean opinion scores. indicates a significant difference between a pair of systems.
A C D F G I L M N R V WACDFGI
LMNRVW
Table 19: Significant differences in naturalness for task MH: results of pairwise Wilcoxon signed rank tests between systems’ mean opinionscores. indicates a significant difference between a pair of systems.
C D F G I L M N R V WCDFGI
LMNRVW
Table 20: Significant differences in intelligibility for task MH: results of pairwise Wilcoxon signed rank tests between systems’ pinyin+toneerror rate (PTER). indicates a significant difference between a pair of systems.
●●●
●●●
●●●●●
●●
●
●
●●●●
●●
●
●
●●
268 268 269 268 268 268 268n
A L R M W V D
12
34
5
Similarity scores comparing to original speaker for task MS1 (All listeners)
System
Sco
re
●●
●●
●●●
●●●●
●
●●●●●●●●●●●●●
268 268 268 268 268 268 268n
A L R M W V D
12
34
5
Mean opinion scores − naturalness − for task MS1 (All listeners)
System
Sco
re
L R M W V D
010
2030
40
132 132 132 132 132 132n
Pinyin (without tone) error rate (PER) for task MS1
System
WE
R (
%)
L R M W V D
010
2030
40
132 132 132 132 132 132n
Pinyin (with tone) error rate (PTER) for task MS1
System
WE
R (
%)
Figure 8: Results for task MS1
A D L M R V WADL
MRVW
A D L M R V WADL
MRVW
D L M R V WDL
MRVW
Table 21: Significant differences in similarity to the original speaker (left table) and naturalness (middle table) and intelligibility in termsof PTER (right table) for task MS1: results of pairwise Wilcoxon signed rank tests between systems’ mean opinion scores. indicates asignificant difference between a pair of systems.
●●●●● ●●●●●●●●
130 130 130 130 130 130 130 130 130n
A L F C R W V D N
12
34
5
Similarity scores comparing to original speaker for task MS2 (All listeners)
System
Sco
re
●●●●
●
●●
●
●●
●●
●●
●
●
●●●●●●●●●●● ●●●●●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●● ●●●●●●●
257 256 256 256 256 256 256 257 257n
A L F C R W V D N
12
34
5
Mean opinion scores − naturalness − for task MS2 (All listeners)
System
Sco
re
L F C R W V D N
010
2030
40
256 256 256 256 256 256 256 256n
Pinyin (without tone) error rate (PER) for task MS2
System
WE
R (
%)
L F C R W V D N
010
2030
40
256 256 256 256 256 256 256 256n
Pinyin (with tone) error rate (PTER) for task MS2
System
WE
R (
%)
Figure 9: Results for task MS2
A C D F L N R V WACDFLNRVW
C D F L N R V WCDFLNRVW
Table 22: Significant differences in naturalness (left table) and intelligibility in terms of PTER (right table) for task MS2: results of pairwiseWilcoxon signed rank tests between systems’ word error rates. indicates a significant difference between a pair of systems.
Language English total Mandarin totalAmharic 1 0Basque 1 0
Cantonese 1 0Chinese 14 1Czech 1 0Danish 1 0Dutch 3 0
Estonian 1 0Finnish 4 0French 5 0German 9 0Hebrew 2 0Hindi 2 0
Hungarian 2 0Japanese 35 0Kannada 1 0Korean 1 3 0
Mandarin 6 0Norwegian 1 0
Polish 6 0Portuguese 2 0
Russian 2 0Slovak 1 0Spanish 12 0Swedish 2 0Telugu 1 0Turkish 1 0Uighur 0 1
N/A 8 0
Table 23: First language of non-native speakers for English and Mandarin versions of Blizzard
Gender Male FemaleEnglish total 192 176
Mandarin total 160 144
Table 24: Gender
Age under 20 20-29 30-39 40-49 50-59 60-69 70-79 over 80English total 39 273 79 23 7 7 0 0
Mandarin total 64 226 27 7 1 0 1 0
Table 25: Age of listeners whose results were used (completed the evaluation fully or partially
Native speaker Yes NoEnglish 239 128
Mandarin 299 5
Table 26: Native speakers for English and Mandarin versions of Blizzard
EH1 EH2 ES1 ES2 MH MS1 MS2ER 39 27 15 18 0 0 0ES 58 41 21 22 0 0 0EU 80 84 51 51 0 0 0MC 0 0 0 0 117 86 86ME 0 0 0 0 36 20 20MR 0 0 0 0 15 12 8MS 0 0 0 0 22 18 16ALL 177 152 87 91 190 50 44
Table 27: Listener types per voice, showing the number of listeners whose responses were used in the results. Tasks ES1/ES2 and MS1/MS2were bundled together, so most, but not all, of their respective listeners overlap.
Registered No response at all Partial evaluation Completed EvaluationER 125 39 38 48ES 142 19 21 102EU 215 0 0 215
ALL ENGLISH 482 58 59 365MC 204 1 0 203ME 56 0 0 56MR 31 4 7 20MS 43 4 7 32
ALL MANDARIN 334 9 14 311
Table 28: Listener registration and evaluation completion rates. For listeners assigned to do the ES1/ES2 and MS1/MS2 tests, finishing onebut not both of the tests was included as partial completion.
EH
101
EH
102
EH
103
EH
104
EH
105
EH
106
EH
107
EH
108
EH
109
EH
110
EH
111
EH
112
EH
113
EH
114
EH
115
EH
116
EH
117
EH
118
EH
119
EH
120
ER 3 3 3 3 3 3 3 1 3 1 1 1 2 3 2 2 1 0 1 0ES 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3EU 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
ALL 9 10 10 10 10 10 10 8 10 8 8 8 9 10 10 8 8 7 8 7
Table 29: Listener groups - Voice EH1 (English), showing the number of listeners whose responses were used in the results - i.e. those withpartial or completed evaluations
EH
201
EH
202
EH
203
EH
204
EH
205
EH
206
EH
207
EH
208
EH
209
EH
210
EH
211
EH
212
EH
213
EH
214
EH
215
EH
216
EH
217
EH
218
EH
219
EH
220
EH
221
ER 1 1 1 0 1 1 1 1 1 1 2 1 2 1 2 1 2 2 2 2 1ES 2 3 3 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 1EU 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
ALL 7 8 8 6 7 7 7 6 8 8 8 7 8 7 7 7 7 8 8 7 6
Table 30: Listener groups - Voice EH2 (English), showing the number of listeners whose responses were used in the results
ES1
01
ES1
02
ES1
03
ES1
04
ES1
05
ES1
06
ES1
07
ES1
08
ER 2 3 0 2 2 2 2 2ES 4 3 3 2 2 2 3 2EU 7 7 7 6 6 6 6 6
ALL 13 13 10 10 10 10 11 10
Table 31: Listener groups - Voice ES1 (English), showing the number of listeners whose responses were used in the results
ES2
01
ES2
02
ES2
03
ES2
04
ES2
05
ES2
06
ES2
07
ES2
08
ES2
09
ES2
10
ES2
11
ES2
12
ES2
13
ER 1 2 2 0 1 2 1 1 2 2 0 2 2ES 2 2 1 2 2 1 2 2 1 2 2 1 2EU 4 4 4 4 4 4 4 4 4 4 4 4 3
ALL 7 8 7 6 7 7 7 7 7 8 6 7 7
Table 32: Listener groups - Voice ES2 (English), showing the number of listeners whose responses were used in the results
MH 01 MH 02 MH 03 MH 04 MH 05 MH 06 MH 07 MH 08 MH 09 MH 10 MH 11 MH 12MC 10 10 9 10 10 10 9 10 10 10 9 10ME 3 3 3 3 3 3 3 3 3 3 3 3MR 2 2 1 2 2 1 1 0 1 1 1 1MS 3 2 2 1 1 2 1 2 2 2 1 2ALL 18 17 15 16 16 16 14 15 16 16 14 16
Table 33: Listener groups - Voice MH (Mandarin), showing the number of listeners whose responses were used in the results
MS1 01 MS1 02 MS1 03 MS1 04 MS1 05 MS1 06 MS1 07MC 11 13 13 13 12 12 12ME 3 3 3 3 3 3 2MR 2 2 2 2 2 1 1MS 3 3 3 3 2 2 2ALL 19 21 21 21 19 18 17
Table 34: Listener groups - Voice MS1 (Mandarin), showing the number of listeners whose responses were used in the results
MS2 01 MS2 02 MS2 03 MS2 04 MS2 05 MS2 06 MS2 07 MS2 08 MS2 09MC 9 10 10 10 9 10 10 9 9ME 3 3 2 2 2 2 2 2 2MR 1 1 1 1 0 1 1 1 1MS 2 2 2 2 1 2 2 1 1ALL 15 16 15 15 12 15 15 13 13
Table 35: Listener groups - Voice MS2 (Mandarin), showing the number of listeners whose responses were used in the results
Listener Type ER ES EU ALL ENGLISHTotal 51 102 215 368
Table 36: Listener type totals for submitted feedback (English)
Listener Type MC ME MR MS ALL MANDARINTotal 201 44 18 33 296
Table 37: Listener type totals for submitted feedback (Mandarin)
Level High School Some College Bachelor’s Degree Master’s Degree DoctorateEnglish total 48 65 94 104 50
Mandarin total 6 6 204 64 32
Table 38: Highest level of education completed
CS/Engineering person? Yes NoEnglish total 149 215
Mandarin total 89 214
Table 39: Computer science / engineering person
Work in speech technology? Yes NoEnglish total 131 234
Mandarin total 61 240
Table 40: Work in the field of speech technology
Frequency Daily Weekly Monthly Yearly Rarely Never UnsureEnglish total 58 54 44 74 81 26 30
Mandarin total 20 19 14 36 82 83 50
Table 41: How often normally listened to speech synthesis before doing the evaluation
Dialect of English Australian Indian UK US Other N/ATotal 1 1 169 33 13 22
Table 42: Dialect of English of native speakers
Dialect of Mandarin Beijing Shanghai Guangdong Sichuan Northeast Other N/ATotal 47 7 8 17 11 156 53
Table 43: Dialect of Mandarin of native speakers
Level Elementary Intermediate Advanced Bilingual N/AEnglish total 15 49 52 11 1Madarin total 0 1 0 4 0
Table 44: Level of English/Mandarin of non-native speakers
Speaker type Headphones Computer Speakers Laptop Speakers OtherEnglish total 346 11 6 0
Mandarin total 263 36 5 0
Table 45: Speaker type used to listen to the speech samples
Same environment? Yes NoEnglish total 359 4
Mandarin total 294 7
Table 46: Same environment for all samples?
Environment Quiet all the time Quiet most of the time Equally quiet and noisy Noisy most of the time Noisy all the timeEnglish total 281 71 13 0 0
Mandarin total 141 111 43 7 1
Table 47: Kind of environment when listening to the speech samples
Number of sessions 1 2-3 4 or moreEnglish total 267 71 0
Mandarin total 208 75 0
Table 48: Number of separate listening sessions to complete all the sections
Browser Firefox IE Mozilla Netscape Opera Safari OtherEnglish total 61 78 1 5 0 207 0
Mandarin total 233 15 0 1 0 40 0
Table 49: Web browser used
Similarity with reference samples Easy DifficultEnglish total 266 100
Mandarin total 223 74
Table 50: Listeners’ impression of their task in section(s) about similarity with original voice.
Scale too big, Bad speakers, playing filesProblem too small, files disturbed others, Other
or confusing connection too slow, etcEnglish total 43 4 49Mandrin total 53 11 12
Table 51: Listeners’ problems in section(s) about similarity with original voice.
Number of times 1-2 3-5 6 or moreEnglish total 327 37 2
Mandarin total 177 116 1
Table 52: Number of times listened to each example in section(s) about similarity with original voice.
MDS section Easy DifficultEnglish total 269 91
Mandarin total 237 54
Table 53: Listeners’ impression of their task in section about similarity of voice between two samples.
Bad speakers, playingProblem Unfamiliar task Instructions not clear files disturbed others Other
connection too slow, etcEnglish total 33 8 1 41
Mandarin total 26 16 6 3
Table 54: Listeners’ problems in section about similarity of voice between two samples.
Number of times 1-2 3-5 6 or moreEnglish total 323 32 1
Mandarin total 193 95 0
Table 55: How many times listened to each example in section section about similarity of voice between two samples.
MOS naturalness sections Easy DifficultEnglish total 341 142
Mandarin total 275 92
Table 56: Listeners’ impression of their task in MOS naturalness sections
Bad speakers, playingProblem All sounded same and/or 1 to 5 scale too big, files disturbed others, Other
too hard to understand too small, or confusing connection too slow, etcEnglish total 12 66 4 70
Mandarin total 22 63 12 13
Table 57: Listeners’ problems in MOS naturalness sections
Number of times 1-2 3-5 6 or moreEnglish total 355 53 3
Mandarin total 234 145 4
Table 58: How many times listened to each example in MOS naturalness sections?
Typing problems:SUS section(s) Usually understood Usually understood Very hard to words too hard to spell,
all the words most of the words understand the words or too fast to typeEnglish total 30 203 112 19
Mandarin total 31 196 57 10
Table 59: Listeners’ impressions of the task in SUS section(s)
Number of times 1-2 3-5 6 or moreEnglish total 357 4 1
Mandarin total 81 202 11
Table 60: How many times listened to each example in SUS section(s)
MOS appropriateness sections Easy DifficultEnglish total 149 161
Table 61: Listeners’ impression of their task in MOS appropriateness sections
Bad speakers, playingProblem All sounded same and/or 1 to 5 scale too big, files disturbed others, Other
too hard to understand too small, or confusing connection too slow, etcEnglish total 24 54 1 80
Table 62: Listeners’ problems in MOS appropriateness sections
Number of times 1-2 3-5 6 or moreEnglish total 292 19 1
Table 63: How many times listened to each example in MOS appropriateness sections?