Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai...

9
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012 1109 A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction Chu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation databases, which are crucial for speech processing tasks. Given complete pronunciation databases for related dialects, one can use supervised learning techniques to predict a Chinese character’s pronunciation in a target dialect based on the character’s features and its pronunciation in other related dialects. Unfortunately, Chinese dialect pronunciation databases are far from complete. We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing dialectal pronunciations based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in rime books. The augmented pronunciation database can then be used in supervised learning settings. We evaluate the prediction accuracy in terms of phonological features, such as tone, initial phoneme, nal phoneme, etc. For each character, features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our rst experimental results show that adding features from dialectal pronunciation data to our baseline rime-book model dramatically improves OPFA using the support vector machine (SVM) model. In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects with that of the model using phonological features from non-closely related dialects. The experimental results show that using features from closely related dialects results in higher accu- racy. In the third experiment, we show that using our proposed data augmentation model to ll in missing data can increase the SVM model’s OPFA by up to 7.6%. Index Terms—Chinese dialects, data augmentation, generative model, pronunciation database. I. INTRODUCTION C HARACTER pronunciation databases are key resources in speech processing tasks such as speech recognition and synthesis. For ofcial written languages, such databases are rich. For example, English has the CMU pronouncing dictionary [1], while Mandarin has the Unihan database [2]. For spoken languages, digitized pronunciation resources are not so plen- tiful, however. In China, this is particularly relevant. A 2004 survey of Chinese dialects revealed that more than 86% of the Manuscript received October 31, 2010; revised March 14, 2011; accepted July 11, 2011. Date of publication October 17, 2011; date of current version Feb- ruary 10, 2012. This work was supported in part by the National Science Council under Grants NSC 98-2221-E-155-060-MY3 and NSC99-2628-E-155-004. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gokhan Tur. C.-C. Lin is with the Department of Computer Science and Information Engi- neering, National Taiwan University, Taipei 10617 , Taiwan (e-mail: chu.cheng. [email protected]). R. T.-H. Tsai is with the Department of Computer Science and Engineering, Yuan Ze University, Zhongli 320, Taiwan (e-mail: [email protected]). Digital Object Identier 10.1109/TASL.2011.2172424 Chinese population can converse in a non-Mandarin dialect, while only 53% can converse in Mandarin. [3] However, there is a serious lack of such databases for non-Mandarin dialects. This situation impedes the development of speech processing technologies and applications for resource-poor dialects. Since compiling such resources is labor-intensive, our goal is to de- velop a tool to help automate the prediction of character pro- nunciations for different Chinese dialects. Currently, most dialect pronunciation databases/dictionaries have been constructed by individual researchers and vary greatly in terms of completeness. If we have complete pro- nunciation databases for related dialects, we can use standard supervised learning techniques to predict a character’s pro- nunciation in a target dialect. As mentioned above, however, pronunciations databases for most Chinese dialects are far from complete. Therefore, we propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. Unlike previous work, this model does not assume that language evolves like a branching tree, but only that character pronunciations across related dialects do show patterns. The proposed model can augment character pronunciations for a dialect based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in medieval rime books. After augmentation, a standard classier-based pronunciation prediction system can be constructed. II. BACKGROUND OF CHINESE DIALECTS A. Mutual Intelligibility It is widely recognized that Chinese dialects are to a great ex- tent mutually unintelligible. All the southern Chinese dialects have mean sentence intelligibility lower than 30% for nonna- tive speakers [4]. In comparison, Portuguese and Spanish have mutual intelligibility at roughly 60% [5]. Although the mutual intelligibility among Chinese dialects is very low, the character pronunciations across dialects show regular correspondence. For example, the pronunciations of “ ” (gan/liver) and “ ” (han/frigid) sound utterly different in Southern Min and Mandarin; but within the dialects themselves, the rhyming is consistent. B. Rime Books Other than areal inuence, the striking correspondence is largely attributed to historical reasons [6], which can be seen in medieval rime books. Earlier rime books, such as “ 切韻 (Qieyun)” (601AD), records contemporary character pronun- ciations with fanqie 反切 ” analyses. Fanqie represents a character’s pronunciation with other two characters, combining 1558-7916/$31.00 © 2011 IEEE

Transcript of Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai...

Page 1: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012 1109

A Generative Data Augmentation Model forEnhancing Chinese Dialect Pronunciation Prediction

Chu-Cheng Lin and Richard Tzong-Han Tsai

Abstract—Most spoken Chinese dialects lack comprehensivedigital pronunciation databases, which are crucial for speechprocessing tasks. Given complete pronunciation databases forrelated dialects, one can use supervised learning techniques topredict a Chinese character’s pronunciation in a target dialectbased on the character’s features and its pronunciation in otherrelated dialects. Unfortunately, Chinese dialect pronunciationdatabases are far from complete. We propose a novel generativemodel that makes use of both existing dialect pronunciationdata plus medieval rime books to discover patterns that existin multiple dialects. The proposed model can augment missingdialectal pronunciations based on existing dialect pronunciationtables (even if incomplete) and the pronunciation data in rimebooks. The augmented pronunciation database can then be used insupervised learning settings. We evaluate the prediction accuracyin terms of phonological features, such as tone, initial phoneme,final phoneme, etc. For each character, features are evaluated onthe whole, overall pronunciation feature accuracy (OPFA). Ourfirst experimental results show that adding features from dialectalpronunciation data to our baseline rime-book model dramaticallyimproves OPFA using the support vector machine (SVM) model.In the second experiment, we compare the performance of theSVM model using phonological features from closely relateddialects with that of the model using phonological features fromnon-closely related dialects. The experimental results show thatusing features from closely related dialects results in higher accu-racy. In the third experiment, we show that using our proposeddata augmentation model to fill in missing data can increase theSVM model’s OPFA by up to 7.6%.

Index Terms—Chinese dialects, data augmentation, generativemodel, pronunciation database.

I. INTRODUCTION

C HARACTER pronunciation databases are key resourcesin speech processing tasks such as speech recognition

and synthesis. For official written languages, such databases arerich. For example, English has the CMUpronouncing dictionary[1], while Mandarin has the Unihan database [2]. For spokenlanguages, digitized pronunciation resources are not so plen-tiful, however. In China, this is particularly relevant. A 2004survey of Chinese dialects revealed that more than 86% of the

Manuscript received October 31, 2010; revised March 14, 2011; acceptedJuly 11, 2011. Date of publication October 17, 2011; date of current version Feb-ruary 10, 2012. This work was supported in part by the National Science Councilunder Grants NSC 98-2221-E-155-060-MY3 and NSC99-2628-E-155-004. Theassociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Gokhan Tur.C.-C. Lin is with the Department of Computer Science and Information Engi-

neering, National Taiwan University, Taipei 10617 , Taiwan (e-mail: [email protected]).R. T.-H. Tsai is with the Department of Computer Science and Engineering,

Yuan Ze University, Zhongli 320, Taiwan (e-mail: [email protected]).Digital Object Identifier 10.1109/TASL.2011.2172424

Chinese population can converse in a non-Mandarin dialect,while only 53% can converse in Mandarin. [3] However, thereis a serious lack of such databases for non-Mandarin dialects.This situation impedes the development of speech processingtechnologies and applications for resource-poor dialects. Sincecompiling such resources is labor-intensive, our goal is to de-velop a tool to help automate the prediction of character pro-nunciations for different Chinese dialects.Currently, most dialect pronunciation databases/dictionaries

have been constructed by individual researchers and varygreatly in terms of completeness. If we have complete pro-nunciation databases for related dialects, we can use standardsupervised learning techniques to predict a character’s pro-nunciation in a target dialect. As mentioned above, however,pronunciations databases for most Chinese dialects are far fromcomplete. Therefore, we propose a novel generative modelthat makes use of both existing dialect pronunciation data plusmedieval rime books to discover patterns that exist in multipledialects. Unlike previous work, this model does not assume thatlanguage evolves like a branching tree, but only that characterpronunciations across related dialects do show patterns. Theproposed model can augment character pronunciations for adialect based on existing dialect pronunciation tables (even ifincomplete) and the pronunciation data in medieval rime books.After augmentation, a standard classifier-based pronunciationprediction system can be constructed.

II. BACKGROUND OF CHINESE DIALECTS

A. Mutual Intelligibility

It is widely recognized that Chinese dialects are to a great ex-tent mutually unintelligible. All the southern Chinese dialectshave mean sentence intelligibility lower than 30% for nonna-tive speakers [4]. In comparison, Portuguese and Spanish havemutual intelligibility at roughly 60% [5].Although the mutual intelligibility among Chinese dialects

is very low, the character pronunciations across dialects showregular correspondence. For example, the pronunciations of “肝” (gan/liver) and “寒” (han/frigid) sound utterly different inSouthernMin andMandarin; but within the dialects themselves,the rhyming is consistent.

B. Rime Books

Other than areal influence, the striking correspondence islargely attributed to historical reasons [6], which can be seenin medieval rime books. Earlier rime books, such as “切 韻(Qieyun)” (601AD), records contemporary character pronun-ciations with fanqie “反切 ” analyses. Fanqie represents acharacter’s pronunciation with other two characters, combining

1558-7916/$31.00 © 2011 IEEE

Page 2: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

1110 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

TABLE ISYMBOLS USED IN SECTION IV

the former’s onset and the latter’s rhyme and tone. An Englishequivalent would be to combine the onset of “peek” / i: k/and the rhyme of “cat” /kæt/ to get “pat” / æt/.Obviously, there may be multiple combinations of char-

acters to represent a single pronunciation in the system offanqie. In contrast, Later rime books such as “韻 鏡 (Yunjing)”(900–950AD), did finer phonological analysis, using fixedsets of characters to represent phonological qualities of con-temporary analysis [6]. A character pronunciation under thenew system has six features, each having value in fixed setsof Chinese characters. The six features are 聲 母 (initials), 韻(rhymes/finals), 攝 (rhyme groups), 聲 調 (tones), 呼 (open-ness), and 等 (grades). For example, the character 含 has 匣(xia) as 聲 母, 咸 as 攝, etc. These features cannot be directlyemployed to reconstruct Middle Chinese pronunciations, asthe meaning of some features are still disputed. Nevertheless,modern dialects still bear the correspondence., and thus rimebook features can be used to infer phonological correspondencebetween characters of the same rime book feature in moderndialects. For example, the two characters “含” (han) and “站”(zhan) are described with the same rhyme group character “咸”(xian), and they still rhyme in Mandarin, Cantonese, and Amoy,although the pronunciations do not rhyme across dialects. Thus,the rime books are very valuable resources in determining acharacter’s pronunciation.

III. RELATED WORK

There are manymodern dictionaries using phonetic alphabetsto denote pronunciation for specific dialects, such as 粤 音 韻彙 (A Chinese Syllabary Pronounced According to the Dialect

of Canton). In 1962 that the first comprehensive cross-dialectallexicon, 漢 語 方 音 字 彙 (Hanyu Fangyin Zihui, Zihui), waspublished. The original Zihui consists of approximately 2500character readings with IPA notation from 17 modern Chinesedialects. In addition, the categorical descriptive features fromthe Middle Chinese rimebook韻鏡 (Yunjing) are also provided.Soon after its publication, Zihui was digitized under ProjectDOC (Dictionary on Computer) [7]. The Zihui lexicon is in-valuable to the study of diachronic phonology. However, manydialects are still unrecorded. Another problem is that Zihui onlycontains about 2500 characters; it is far from the total amountof Chinese characters (more than 50 000). The two flaws renderthe Zihui lexicon unsatisfactory when used as a dialect dictio-nary. Our work then proposes to augment the unseen charactersand languages with dialects and character readings recorded inthe Zihui lexicon.To augment the missing data with known information is not a

new idea, as practiced by [8], and [9]. Data augmentation is gen-erally done by introducing latent variables to model the trainingdata [10]. In our problem, we need to model dialectal pronunci-ation data. A model of pronunciations has been proposed for theRomance languages by [11], which allows generation of wordforms of both reconstructed languages and modern languages.A phylogenic tree of Classical Latin, Vulgar Latin, Spanish, andItalian was built to model the evolutionary relationship amongthese languages. In this tree, Classical Latin is the root, VulgarLatin is its child, and Spanish and Italian are Vulgar Latin’s de-scendants. In their approach, the pronunciation of the root lan-guage must be given.

Page 3: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1111

However, for Chinese dialects, the applicability of the treemodel is disputed. [12] suggested that it may be more appro-priate to model the development of Chinese dialects with a net-work. Even if Chinese dialects are placed into a tree structureafter Bouchard-Côté et al.’s model and set Middle Chinese,which influenced the largest number of Chinese dialects, as theroot language, we still encounter the following problem. Clas-sical Latin’s phonology has been well established. [13] There-fore, the actual pronunciation can be easily deduced from thespelling. Unlike Classical Latin, the phonology and characterpronunciations of Middle Chinese are still not wholly clear. Forexample, we know virtually nothing about the actual tone. Cur-rent reconstructions depend heavily upon medieval rime books,which are known to be a combination of at least twoMiddle Chi-nese dialects. [14] To derive a proper phylogenic tree, one mustfirst distinguish between the Middle Chinese dialects (at leasttwo according to Ting) and then correctly assign their respec-tive offspring languages. However current studies show that forcertain Wu dialects there are at least two substrata, one from thenorthern Middle Chinese and the other from the southern one.[15] This directly violates the tree assumption. For a language ,without given the actual pronunciation in ’s ancestral language,we cannot use Bouchard-Côté et al.’s model to predict a char-acter’s pronunciation in .Some researches try to use the resources of other languages to

deal with the languages with poor resources. [16] shows addingunannotated text in more languages can improve unsupervisedPOS tagging performance. [17] uses multilingual acoustic datato improve a newly seen language’s recognition performance,sharing articulatory feature data among languages. Theseresearches assume that linguistic data used during traininghas patterns which carry over to the newly seen language,but our work only assumes Chinese dialects have consistentphonological correspondence with Middle Chinese and amongthemselves.

IV. METHODOLOGY

A. Problem Definition

Our task is to augment the pronunciation database of Chinesedialects. For each record, the given pronunciation database listsall existing pronunciations in the 21 dialects from all major di-alect groups. That is, some records may be incomplete. Our aug-mentation model not only utilizes the existing pronunciations,represented by phonemes (which we will refer to as phonolog-ical features,) but also rime book features.More formally, let be the character in a record. Let its

categorical rime book features be . For example, the rimebook features of character含 (han) can be encoded as [匣 (xia),覃 (tan), 咸 (xian), 平 (ping), 開 (kai), 一 (yi)]. The multi-classvector is then converted to a binary vector by con-catenating each “flattened” component of . For example, acomponent with three possible values is “flattened” to a binaryvector of dimension 3. Since for the rime book features there aresix components in , would be a binary vector of length

.1 Let there be modern dialects .Each dialect has fixed number of phonological features.

12a

TABLE IIENCODED PHONOLOGICAL FEATURES OF THE DOC DATASET

Fig. 1. Scheme of the input data. There are characters, every of which hasits binary rime book feature vector known. Some of the phonological featuresmay be missing. Our goal is to fill the missing values out, and the output is acomplete table.

Take the character 含 as an example, its rime book featurevector [匣 (xia),覃 (tan),咸 (xian),平 (ping),開 (kai),一 (yi)].Its phonological features (see Table II) would be[“12”, “43”, /h/,, /a/, , false, /m/] for the Xiamen dialect.The problem can be stated as follows: suppose there are totalphonological features for all dialects, and given bi-

nary rime book feature vectors , and a partially filledphonological feature table of dimension by for characters

, our goal is equivalent to filling that table out. Fig. 1depicts the scheme of the input under the problem definition.Definitions of symbols introduced in this section can be foundin Table I.

B. Model Considerations

As described in Section II, nearly every Chinese dialect’sphonology is highly correlated to both the categorical featuresdescribed in 廣 韻 (Guangyun) and 韻 鏡 (Yunjing), and toother Chinese dialects’ phonological features. For example,there is a clear correspondence among the rime book feature深 攝 (shen-she), the Cantonese rhyme /am/, and the Xiamenrhyme /im/. While the rime book alone offers much insightinto many dialects’ phonology, some characters listed underdifferent rime-book rhymes have clear correspondence amongdialects. To augment missing phonological features, all theabove phenomena should be taken into consideration.We propose a model that simultaneously captures phonolog-

ical similarities across dialects and rime book features, usinglatent variables which we call superlingual rhymes (SLRs). Ourmodel splits each character’s record into two parts. The first partcontains its rime book features while the second consists of isits phonological features. Our task is to augment missing valuesin the second part. We know that rime book features are highlycorrelated with phonological features in every Chinese dialect.

Page 4: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

1112 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

Fig. 2. Plate diagram of our proposed generative model. Shaded nodes are ob-served data.

Therefore, we employ rime book features to estimate missingphonological features. In addition, our model also employs theother dialects’ phonological features. Our basic idea is to in-troduce superlingual rhymes as an intermediate layer betweenrime book features and all dialects’ phonological features. Thepronunciation of each character can be represented as a mix-ture of all superlingual rhymes. That is, for each superlingualrhyme, the character has a proportional value. Since the phono-logical features are all categorical data, they are naturally mod-eled with multinomial distribution. As in every Bayesian model,we impose priors on these multinomials. Following many pre-vious works such as [18] and [19], we chose Dirichlet distri-bution, which allows analytic expression of posterior proba-bility. The proportional values of a character follow a Dirichletdistribution, whose parameters are decided by log-linear func-tions of the character’s rime book features. This approach is alsoknown as logistic regression. Because of the conjugacy betweenDirichlet and multinomial distribution, we can obtain the pos-terior distribution of a character over SLRs easily [20], [21].Mixing a generative model with logistic regression is akin tothe paradigm advocated by [22]. Similarly, using the multino-mial-Dirichlet conjugacy, we can estimate the distribution of asuperlingual rhyme over phonological features. Then, becauseeach character’s proportion of each superlingual rhyme and eachsuperlingual rhyme’s proportion of each phonological featureare known, missing phonological features can be augmented.

C. Model Description

A plate diagram for our proposed model is shown in Fig. 2.Let observation be a tuple of two components: . is anobserved phonological feature and is the dialect of . For everyobservation of character , there is alatent SLR ; and the character is a mixture of SLRs. Tosimplify the explanation, we assume every dialect has only onephonological feature, namely . In the real model, each observa-tion has multiple phonological features for dialect , butthe model’s structure is roughly the same.We describe the model as follows. Let there be SLRs:

. Each hasmultinomial distributions over phonologicalfeature values; and a multinomial distribution over thedialects . ’s and ’s are given Dirichlet uniformpriors Dirichlet and Dirichlet . In our experiments, eachcomponent of both and is set to 0.001, making the priorrather sparse.

Recall that the binary rime book feature of characteris . Let there be rime book feature weight vec-

tors of , and eachhas the same dimension as . We then define the priorover all SLRs in character to be amultinomial distribution with prior Dirichlet

Dirichlet . Note that

. In other words, the prior probability of SLR isproportional to , a log-linear function. is treatedas a given value in the generating part; indeed it is given aNormal prior, but we do not change its value through MCMCsteps—rather, its value is obtained by maximizing the like-lihood of the generative model. We will go into details inSection IV-D.We now describe the generating process. A plate diagram for

this model is depicted in Fig. 2.1) For each SLR :

a) ;b) ;c) for each dialect , Dirichlet .

2) For each character and its binary rime book feature vector:a) for each SLR , ;b) Dirichlet ;c) for each :• Multinomial ;

• Multinomial ;

• Multinomial .

D. Inference

Without subscripts, the full joint distribution, expressed asproduct of distributions, is is

(1)

(2)

Page 5: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1113

This equation can be rearranged and simplified by movingand to the former term. Variables , , and

can be integrated out using the identity

where , , and . More details areavailable in Appendix A.And then we have

(3)

where is the number of observations that have dialectwith SLR , is the number of observations that havephonological feature value with SLR and dialect , andis the number of observations with SLR in character .In (3) we have four variables, , , , and , and cannot really

sample from directly. However, it can be shownthat there exists an efficient Gibbs sampler to infer ; andwe subsequently use optimization methods to compute .1) The Gibbs Sampler: Gibbs sampling is an MCMC tech-

nique to sample from a complex and multivariate distribution.It can be applied if given variables , sampling from

is impossible, but sampling from distributionsis feasible. Below is the Gibbs sampler:

1) randomly assign values ;2) for to an arbitrarily assigned :• for to ;

a) re-sample new value of .The variable denotes except . If is sufficiently

large, the resultant values can be regarded as asample from .Since the training data already provides us with and , we

do not resample them. Neither do we resample , but insteaduse optimization methods to find the most probable . Now weonly need to collect samples from .Since is a vector of variables consisting of all observed

values’ (unobserved) SLRs, is actually multivariate; weuse the Gibbs sampling technique here, and obtain samples of

via alternately sampling from .can be expressed as

(4)

where . After reorganization (thedetails are in the appendix), we have

(5)

where if the current assignment of ; andotherwise.Now we describe the Gibbs sampler for :1) randomly assign values to ;2) for to an arbitrarily assigned :• for to ,

a) re-sample new value of using (5).2) Computing : Unlike , we do not use MCMC techniques

to find because it is difficult to derive a Gibbs sampler for. On the other hand, for our purpose an MAP estimate of

suffices.We use L-BFGS to solve this numeric optimization problem.

L-BFGS requires the loss function and the gradientfor minimization [23]. First, from (3) we can derive the lossfunction, which is negative log-likelihood function of :

(6)

where is a constant; and recall that for character ,.

Likewise, we derive the gradient :

where is the digamma function.As previously stated, we can minimize if we can com-

pute both and ; and minimizing in turn maximizes thelikelihood .

Page 6: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

1114 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

E. Inference Procedure

In Section IV-D, we have described aGibbs sampler that sam-ples from the posterior , and in Section IV-D2 wehave derived and , which enable us to maximize thelikelihood . We use an EM-like algorithm for infer-ence: [24] in alternating steps we sample and maximize

, repetitively. The posterior feature-value distribution canbe sampled if the latent SLRs are fixed. To augment a missingphonological feature, we output the mode of samples over sev-eral iterations.

V. DATA AND EVALUATION METRICS

A. Data

The experiments are conducted on the DOC dataset describedin Section III. In this dataset, each record corresponds to onepronunciation of a Chinese character. For example, the poly-phone “正”, with two Mandarin pronunciations (zheng1 andzheng4), has two corresponding records. The number of pro-nunciations for a character is determined by Guangyun. Foreach record, the DOC dataset lists all existing pronunciations in21 dialects from all major dialect groups. In the original DOC,pronunciations are transcribed in IPA notation. [25] representedthese IPA transcriptions with eight phonological features, listedin Table II. Given that there are 21 dialects and eight features,each record contains a total of 168 phonological features. Somerecords are incomplete because certain phonological features donot exist in some dialects. After disambiguation of polyphonecharacters, we have 5403 records.

B. Evaluation Metrics

Individual pronunciation feature accuracy (IPFA) is mea-sured as the number of correctly predicted phonologicalfeatures over the number of phonological features in the testset. Overall pronunciation feature accuracy (OPFA) is mea-sured as the number of correctly predicted records over thenumber of records in the test set.

C. Evaluation Scheme

To evaluate prediction accuracy in a given dialect , allphonological features of the dialect are regarded as groundtruth labels. Some phonological features of dialects other thanmay be missing, and they are all filled in using either our

proposed model or a baseline classifier, depending on whichaugmentation method is used in that configuration.Since one of our focus is augmentation (see Section VI-C), in

the augmentation experiments we randomly remove phonolog-ical features from all dialects except . The detailed procedureis as follows: first we create two subsets of the main dataset with10% or 20% of fields (phonological features) missing, respec-tively. The missing fields are then augmented as previously de-scribed. Note that phonological features of dialect are not usedfor prediction of other phonological features. After the missingpronunciations are augmented, no records have empty fields.To conduct the statistical significance -test, we perform the

following procedure 30 times. We randomly split the records2:1 into training (67%) and test (33%) data. Since each recordis associated with multiple labels, we employ multiclass SVMs

TABLE IIIPREDICTION ACCURACY WITH/WITHOUT DIALECTAL DATA

to learn the labels independently. The features fed to SVMclassifiers are the binary rime book feature vectors ( s) andphonological features of all dialects except dialect . The cor-responding labels are phonological features of dialect . Andthe output from these classifiers are predicted labels, which arephonological features of dialect .

D. -Test

We apply two-sample tests to examine whether one config-uration is significantly better than the other with statistical sig-nificance.Two-sample -tests are applied since we assume the samples

are independent. As the number of samples is large and the sam-ples’ standard deviations are known, the following two-sample-statistic is appropriate in this case:

where is mean accuracy, is variance of accuracy, and issample number (in our experiments, ). If the resultingscore is equal or less than 1.67 with a degree of freedom of 29and a statistical significance level of 95%, the null hypothesis isaccepted; otherwise it is rejected.

VI. EXPERIMENTS

We designed three experiments on character pronounciationsof the Chaozhou dialect, which is aMin dialect spoken in easternGuangdong, to evaluate the effect of the following factors:

A. Effect of Dialectal Data on Standard Classifiers

The conventional approach employed by philologists to Chi-nese dialect pronunciation prediction is to find correspondencebetween rime book categories and modern pronunciation, oftenthrough laborious human inspection. However, a clear corre-spondence between the two does not always exist. In the Wudialect for example, the rime book categories 夬 (guai) and 佳(jia) are not clearly distinguished, sometimes being referred toas -ua and sometimes as -uo. Introducing dialectal data (otherdialects’ phonological features) may help distinguish pronunci-ation in some dialects.We train the SVM classifier to predict character pronuncia-

tions in Chaozhou. As previously described, we conducted tworuns:1) Rime Book Only (R): In this run only the rime book

features, namely 聲 母 (initials), 韻 (rhymes/finals), 攝 (rhymegroups), 聲 調 (tones), 呼 (openness), and 等 (grades), areincluded.2) Rime Book + Full Dialectal Data (R+F): In addition to

rime book features, all dialectal data are used. In cases wherethere are missing pronunciations, a random guess is suppliedfor each phonological feature for the SVM classifier.

Page 7: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1115

TABLE IVPREDICTION ACCURACY WITH DIFFERENT DIALECT GROUPS

The results are listed in Table III. It is obvious that by in-cluding dialectal data, we make a significant performance gain.

B. Impacts of Proximate Dialects

[26] reported that POS tagging performance can be improvedby including more languages, especially closely related lan-guages. We carried out experiments to see whether using rimebook features (R) with closely related dialects (+C) is moreeffective than with distantly related dialects (+D).We compared the OPFA of the Xi’an and Chaozhou dialects,

which belong to the Mandarin and Min dialect groups, respec-tively. The Mandarin dialects we use in the experiments areJinan, Taiyuan, and Beijing; and for the Min dialects we useXiamen, Fuzhou, and Jian’ou. For each dialect we conducttwo runs, the first using dialects from the same dialect group,and the second using dialects from the other dialect group. Tomake comparison meaningful, we make the ratio of missingentries same in every run by randomly removing entries. Andthe missing entries are randomly augmented without sophis-ticated augmentation. Thus, each run has 10% pronunciationsremoved, and augmented with random guesses. Average OPFAover 30 times are listed in Table IV. The results show that R+Coutperforms R+D for both Xi’an and Chaozhou dialects by astatistically significant margin.

C. Effect of Data Augmentation

As described in Section I, the data for many Chinese di-alects are scarce. Our data augmentation model is designed tofill in missing pronunciation information. If our augmentationmodel is effective, one application would be to use multipleresource-poor dialects to augment missing data in another di-alect’s pronunciation database. For data augmentation, we usethe procedure described in Section V-C to fill in the missing pro-nunciations in the Chaozhou dialect.For comparison, we employ three different methods to aug-

ment the missing data as baselines:1) Logistic Regression (-L): Using the rime book features, a discriminative model is trained to predict missing phono-

logical values.2) Naive Bayes (-N): Similar to the logistic regression

model, a generative model is trained using to predictmissing phonological values.3) Random (-R): The missing phonological values are

guessed randomly.In this experiment we test two different amounts of removal,

10% and 20%. All the following SVM classifiers use the RBFkernel, with parameters and . The number ofSLRs in our augmentation model is set to 200.

TABLE VEFFECTS OF DATA AUGMENTATION WITH CLOSELY (R+C) AND DISTANTLY

(R+D) RELATED DIALECT DATA

TABLE VIIPFA TABLE

The results and corresponding values are listed in Table V.Using our data augmentation model consistently improvesthe OPFA accuracy. Interestingly, the margin of improvementseems to be greater when using closely related dialect datathan when using distantly related dialect data when using both

and datasets.

VII. ANALYSIS AND DISCUSSION

We are interested in how the choice of training dialects affectindividual feature predictions. Table VI shows percentile IPFAimprovement over baseline random augmentation. The R+C runbenefits from the augmentation in all features except nasaliza-tion, the reason for which is unclear.In the R+D run, tone, initial, and final features show worse

IPFA after augmentation. This can be explained by consideringthe assumptions of our proposed model. We assume that thedialects exhibit correspondence among phonological features

Page 8: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

1116 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

across dialects. That is, corresponding phonological featuresacross dialects should be put under the same SLR. Therefore, ifthe dialects lack such correspondence, the augmented featuresmay be inaccurate. It is evident that phonological features suchas tones, initials, and finals do not have good correspondenceacross different dialect families [27]. Recent research suggeststones in Min dialects may be related to an innovation of theWu-Min proto-dialect [27], which Mandarin did not share. Asfor initials, there is a striking difference between the “heavy”and “light” initial distinction inMandarin andMin dialects [28].Finals also lack good correspondence: the Min dialects havepreserved most final stops from Middle Chinese, while Man-darin dialects have lost many. Thus, it is difficult to predict finalconsonants in Min dialects using Mandarin dialects and viceversa.The IPFA metric seems to reflect the level of correspondence

between the target dialect and other dialects, both closely anddistantly related. The possibility of determining dialectal rela-tionships between individual dialects by comparing respectiveIPFA improvement scores may lead to interesting discoveries.

VIII. CONCLUSION

We propose a novel generative model that makes use of bothexisting dialect pronunciation data plus medieval rime booksto discover phonological patterns that exist in multiple dialects,which are referred to as superlingual rhymes (SLRs) in our pro-posed model. The proposed model can predict character pro-nunciations for a dialect based on existing dialect pronunciationtables (even if incomplete) and the pronunciation data in rimebooks. We evaluate the prediction accuracy in terms of phono-logical features, such as tone, initial phoneme, etc. For eachcharacter, phonological features are evaluated on the whole,overall pronunciation feature accuracy (OPFA). Our first exper-imental results show that adding features from dialectal pronun-ciation data to our baseline rime-book model dramatically im-proves OPFA using the support vector machine (SVM) model.In the second experiment, we compare the performance of theSVMmodel using phonological features from closely related di-alects with that of the model using phonological features fromnon-closely related dialects. The experimental results show thatusing features from closely related dialects results in higher ac-curacy. In the third experiment, we show that using our pro-posed data augmentation model to fill in missing data can in-crease the SVM model’s OPFA by up to 7.6%. We also notethat this improvement is greater when using closely related di-alect data.

APPENDIX AINTEGRATION OF , , AND

Since , , and have Dirichlet priors, the posterior distri-bution of , , and , which have the formareDirichlet-multinomial distribution as introduced in [29]. We

clarify how we integrate out these variables with as example.For convenience, (1) is relisted here again:

By fixing and , terms involving and in (1) are

The latter terms can be refactored to , whereis the number of observations with phonological feature

value , SLR and dialect . Thus, it can be rewritten as

Using the identity of (3), we have

(7)

Variables and can be integrated out in same fashion.

APPENDIX BDERIVATION OF THE GIBBS SAMPLER

(8)

Page 9: Chu-Cheng Lin and Richard Tzong-Han Tsair97060/06047570.pdfChu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive digital pronunciation

LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1117

and again using the identity , we have

(9)

using the fact that , where is thenumber of observations with SLR , and the fact that a characterhas fixed number of observations, we can further simplify (9)into

(10)

where if the current assignment of ; and ,otherwise.

ACKNOWLEDGMENT

The authors would like to thank Prof. C.-C. Cheng for pro-viding them the DOC dataset and the TASLP reviewers for theirvaluable comments, which helped them improve the quality ofthe paper.

REFERENCES[1] “CMUDICT, CMU Pronouncing Dictionary,” 1998 [Online]. Avail-

able: http://www.speech.cs.cmu.edu/cgi-bin/cmudict[2] J. H. Jenkins and R. Cook, “Unicode Han Database,” Tech. Rep. The

Unicode Consortium, 2009.[3] L.-Q. Tong, “Survey on the usage of Chinese languages and script,”

(in Chinese) Language and Literature Press, Beijing, China, 2006[Online]. Available: http://www.china-language.gov.cn/LSF/LS-Frame.aspx

[4] C. Tang and V. J. van Heuven, “Mutual intelligibility of Chinese di-alects experimentally tested,” Lingua, vol. 119, no. 5, pp. 709–732,2009.

[5] J. B. Jensen, “On the mutual intelligibility of Spanish and Portuguese,”Hispania, vol. 72, no. 4, pp. 848–852, 1989.

[6] E. G. Pulleyblank, “Qieyun and Yunjing: The essential foundation forchinese historical linguistics,” J. Amer. Oriental Soc., vol. 118, no. 2,pp. 200–216, 1998.

[7] M. Streeter, “DOC, 1971: A Chinese dialect dictionary on computer,”Comput. Humanities, vol. 6, no. 5, pp. 259–270, 1972.

[8] K. Nigam, A. K.McCallum, S. Thrun, and T.Mitchell, “Text classifica-tion from labeled and unlabeled documents using em,” Mach. Learn.,vol. 39, no. 2-3, pp. 103–134, 2000.

[9] X. Lu, B. Zheng, A. Velivelli, and C. Zhai, “Enhancing text catego-rization with semantic-enriched representation and training data aug-mentation,” J. Amer. Med. Inform. Assoc., vol. 13, no. 5, pp. 526–535,2006.

[10] D. van Dyk and X. Meng, “The art of data augmentation,” J. Comput.Graph. Statist., vol. 10, no. 1, pp. 1–50, 2001.

[11] A. Bouchard-Côté, P. Liang, T. Griffiths, and D. Klein, “A prob-abilistic approach to diachronic phonology,” in Proc. EmpiricalMethods in Natural Lang. Process. Comput. Natural Lang. Learn.(EMNLP/CoNLL), 2007.

[12] M. Ben Hamed and F. Wang, “Stuck in the forest : Trees, networks andChinese dialects,” Diachronica, vol. 23, no. 1, pp. 29–60, 2006.

[13] W. S. Allen, Vox Latina: A Guide to the Pronunciation of ClassicalLatin (in Eng.). Cambridge, U.K.: Cambridge Univ. Press, 1978.

[14] P.-H. Ting, “Some thoughts on the reconstruction of Middle Chinese,”J. Chinese Linguist., vol. 249, no. 6, p. 414, 1995.

[15] T.-L. Mei, “The survival of two pairs of Qieyun distinctions inSouthern Wu dialects,” J. Chinese Linguist., vol. 280, no. 1, pp. 1–15,2001.

[16] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay, “Adding morelanguages improves unsupervised multilingual part-of-speech tagging:A Bayesian non-parametric approach,” in Proc. NAACL ’09: HumanLang. Technol.: 2009 Annu. Conf. North Amer. Chapt. Assoc. Comput.Linguist., Morristown, NJ, 2009, pp. 83–91.

[17] S. Stüker, F. Metze, T. Schultz, and A. Waibel, “Integrating multilin-gual articulatory features into speech recognition,” in Proc. 8th Eur.Conf. Speech Commun. Technol., 2003, Citeseer.

[18] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

[19] S. Goldwater and T. Griffiths, “A fully Bayesian approach to unsu-pervised part-of-speech tagging,” in Proc. 45th Annu. Meeting Assoc.Comput. Linguist., Prague, Czech Republic, Jun. 2007, pp. 744–751.

[20] G. Heinrich, “Parameter Estimation for Text Analysis,” Tech. Rep.Univ. of Leipzig, Leipzig, Germany, 2008 [Online]. Available: http://www.arbylon.net/publications/text-est.pdf

[21] P. Resnik and E. Hardisty, “Gibbs sampling for the uninitiated,” Univ.of Maryland, 2010, Tech. Rep. CS-TR-4956, UMIACS-TR-2010-04,LAMP-153.

[22] T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein,“Painless unsupervised learning with features,” in Proc. HumanLang. Technol.: 2010 Annu. Conf. North Amer. Chap. Assoc. Comput.Linguist., Los Angeles, CA, Jun. 2010, pp. 582–590.

[23] D. C. Liu and J. Nocedal, “On the limited memory BFGS method forlarge scale optimization,”Math. Program., vol. 45, no. 3, pp. 503–528,1989.

[24] A. Dempster et al., “Maximum likelihood from incomplete data via theEM algorithm,” J. R. Statist. Soc.. Ser. B (Methodological), vol. 39, no.1, pp. 1–38, 1977.

[25] C.-C. Cheng, “Measuring relationship among dialects: DOC and re-lated resources,” Comput. Linguist., vol. 2, no. 1, pp. 41–72, 1997.

[26] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay, “Unsupervisedmultilingual learning for pos tagging,” in Proc. EMNLP ’08: Proc.Conf. Empirical Methods Natural Lang. Process., Morristown, NJ,2008, pp. 1041–1050.

[27] R.-W. Wu, “A Comparative study on the phonologies of Min and Wudialects,” Ph.D. dissertation, Dept. of Chinese Literature, NationalChengchi Univ., Taipei, Taiwan, 2005.

[28] U.-J. Ang, “On the motivation and typology of aspiration and nasal-ization in Sinitic languages,” in Proc. 6th Int. and 17th National Conf.Chinese Phonol., Taipei, Taiwan, May 1999.

[29] T.Minka, “Estimating a Dirichlet distribution,”Mass. Inst. of Technol.,Cambridge, MA, Tech. Rep., 2000.

Chu-Cheng Lin received the B.S. and M.S. degreesin computer science and information engineeringfrom National Taiwan University, Taipei, in 2008and 2010, respectively.His current research interests are information

retrieval, natural language processing, and computa-tional phonology.

Richard Tzong-Han Tsai received the B.S., M.S.,and Ph.D. degrees in computer science and informa-tion engineering from National Taiwan University,Taipei, Taiwan, in 1997, 1999, and 2006, respec-tively.He was a Postdoctoral Fellow at Academia Sinica

from 2006 to 2007. He is now an Assistant Professorin the Department of Computer Science and Engi-neering, Yuan Ze University, Zhongli, Taiwan. Hisresearch areas are natural language processing, cross-language information retrieval, biomedical literature

mining, and information services on mobile devices.