Morphology Matters: A Multilingual Language Modeling Analysis · 1999) is widely used in NLP tasks...

15
Morphology Matters: A Multilingual Language Modeling Analysis Hyunji Hayley Park University of Illinois [email protected] Katherine J. Zhang Carnegie Mellon University [email protected] Coleman Haley Johns Hopkins University [email protected] Kenneth Steimel Indiana University [email protected] Han Liu University of Chicago * [email protected] Lane Schwartz University of Illinois [email protected] Abstract Prior studies in multilingual language mod- eling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. 1 We fill in missing typological data for several lan- guages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher sur- prisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmenta- tion strategies like Morfessor and Finite- State Transducers (FSTs) and find that these segmentation strategies yield better perfor- mance and reduce the impact of a lan- guage’s morphology on language modeling. 1 Introduction With most research in Natural Language Process- ing (NLP) directed at a small subset of the world’s languages, whether the techniques developed are truly language-agnostic is often not known. Be- cause the vast majority of research focuses on English with Chinese a distant second (Mielke, 2016), neither of which is morphologically rich, the impact of morphology on NLP tasks for vari- ous languages is not entirely understood. Several studies have investigated this issue in the context of language modeling by comparing a number of languages, but found conflicting re- sults. Gerz et al. (2018) and Cotterell et al. (2018) * Work done while at University of Colorado Boulder 1 https://github.com/hayleypark/ MorphologyMatters find that morphological complexity is predictive of language modeling difficulty, while Mielke et al. (2019) conclude that simple statistics of a text like the number of types explain differences in model- ing difficulty, rather than morphological measures. This paper revisits this issue by increasing the number of languages considered and augment- ing the kind and number of morphological fea- tures used. We train language models for 92 languages from a corpus of Bibles fully aligned at the verse level and measure language model- ing performance using surprisal (the negative log- likelihood) per verse (see §4.5). We investigate how this measure is correlated with 12 linguist- generated morphological features and four corpus- based measures of morphological complexity. Additionally, we contend that the relation be- tween segmentation method, morphology, and lan- guage modeling performance needs further inves- tigation. Byte-Pair Encoding (BPE; Shibata et al., 1999) is widely used in NLP tasks including ma- chine translation (Sennrich et al., 2016) as an un- supervised information-theoretic method for seg- menting text data into subword units. Variants of BPE or closely related methods such as WordPiece (Kudo, 2018) are frequently employed by state- of-the-art pretrained language models (Liu et al., 2019; Radford et al., 2019; Devlin et al., 2019; Yang et al., 2019). However, BPE and other seg- mentation methods may vary in how closely they capture morphological segments for a given lan- guage, which may affect language modeling per- formance. Therefore, this paper focuses on the following two research questions: 1. Does a language’s morphology influence lan- guage modeling difficulty? 2. If so, how do different segmentation methods interact with morphology? In order to answer the first question, we train arXiv:2012.06262v1 [cs.CL] 11 Dec 2020

Transcript of Morphology Matters: A Multilingual Language Modeling Analysis · 1999) is widely used in NLP tasks...

  • Morphology Matters: A Multilingual Language Modeling Analysis

    Hyunji Hayley ParkUniversity of Illinois

    [email protected]

    Katherine J. ZhangCarnegie Mellon University

    [email protected]

    Coleman HaleyJohns Hopkins University

    [email protected]

    Kenneth SteimelIndiana [email protected]

    Han LiuUniversity of Chicago∗

    [email protected]

    Lane SchwartzUniversity of [email protected]

    Abstract

    Prior studies in multilingual language mod-eling (e.g., Cotterell et al., 2018; Mielkeet al., 2019) disagree on whether or notinflectional morphology makes languagesharder to model. We attempt to resolvethe disagreement and extend those studies.We compile a larger corpus of 145 Bibletranslations in 92 languages and a largernumber of typological features.1 We fillin missing typological data for several lan-guages and consider corpus-based measuresof morphological complexity in addition toexpert-produced typological features. Wefind that several morphological measuresare significantly associated with higher sur-prisal when LSTM models are trained withBPE-segmented data. We also investigatelinguistically-motivated subword segmenta-tion strategies like Morfessor and Finite-State Transducers (FSTs) and find that thesesegmentation strategies yield better perfor-mance and reduce the impact of a lan-guage’s morphology on language modeling.

    1 Introduction

    With most research in Natural Language Process-ing (NLP) directed at a small subset of the world’slanguages, whether the techniques developed aretruly language-agnostic is often not known. Be-cause the vast majority of research focuses onEnglish with Chinese a distant second (Mielke,2016), neither of which is morphologically rich,the impact of morphology on NLP tasks for vari-ous languages is not entirely understood.

    Several studies have investigated this issue inthe context of language modeling by comparinga number of languages, but found conflicting re-sults. Gerz et al. (2018) and Cotterell et al. (2018)

    ∗Work done while at University of Colorado Boulder1https://github.com/hayleypark/

    MorphologyMatters

    find that morphological complexity is predictive oflanguage modeling difficulty, while Mielke et al.(2019) conclude that simple statistics of a text likethe number of types explain differences in model-ing difficulty, rather than morphological measures.

    This paper revisits this issue by increasing thenumber of languages considered and augment-ing the kind and number of morphological fea-tures used. We train language models for 92languages from a corpus of Bibles fully alignedat the verse level and measure language model-ing performance using surprisal (the negative log-likelihood) per verse (see §4.5). We investigatehow this measure is correlated with 12 linguist-generated morphological features and four corpus-based measures of morphological complexity.

    Additionally, we contend that the relation be-tween segmentation method, morphology, and lan-guage modeling performance needs further inves-tigation. Byte-Pair Encoding (BPE; Shibata et al.,1999) is widely used in NLP tasks including ma-chine translation (Sennrich et al., 2016) as an un-supervised information-theoretic method for seg-menting text data into subword units. Variants ofBPE or closely related methods such as WordPiece(Kudo, 2018) are frequently employed by state-of-the-art pretrained language models (Liu et al.,2019; Radford et al., 2019; Devlin et al., 2019;Yang et al., 2019). However, BPE and other seg-mentation methods may vary in how closely theycapture morphological segments for a given lan-guage, which may affect language modeling per-formance.

    Therefore, this paper focuses on the followingtwo research questions:

    1. Does a language’s morphology influence lan-guage modeling difficulty?

    2. If so, how do different segmentation methodsinteract with morphology?

    In order to answer the first question, we train

    arX

    iv:2

    012.

    0626

    2v1

    [cs

    .CL

    ] 1

    1 D

    ec 2

    020

    https://github.com/hayleypark/MorphologyMattershttps://github.com/hayleypark/MorphologyMatters

  • models using data sets segmented by charactersand BPE units. Our results show that BPE lan-guage modeling surprisal is significantly corre-lated with measures of morphological typologyand complexity. This suggests that BPE segmentsare ineffective in mitigating the effect of morphol-ogy in language modeling.

    As for the second question, we consider morelinguistically-motivated segmentation methods tocompare with BPE: Morfessor (Creutz and La-gus, 2007) and Finite-State Transducers (FSTs)(see §4.3). Our comparison of the models us-ing the different segmentation methods shows thatMorfessor reduces the impact of morphology formore languages than BPE. FST-based segmenta-tion methods outperform the other segmentationmethods when available. These results suggestthat morphologically motivated segmentations im-prove cross-linguistic language modeling.

    2 Modeling difficulty across languages

    Studies have demonstrated that different lan-guages may be unequally difficult to model andhave tested the relations between such model-ing difficulty and morphological properties of lan-guages, using different segmentation methods.

    Vania and Lopez (2017) compared the effective-ness of word representations based on differentsegmentation methods in modeling 10 languageswith various morphological typologies. Theytrained word-level language models, but utilizesegmentation methods to create word embeddingsthat include segment-level information. Compar-ing character, BPE, and Morfessor segmentations,they concluded that character-based representa-tions were most effective across languages, withBPE always outperforming Morfessor. However,models based on hand-crafted morphological anal-yses outperformed all other segmentation methodsby a wide margin.

    Gerz et al. (2018) trained n-gram and neural lan-guage models over 50 languages and argued thatthe type of morphological system is predictive ofmodel performance. Their results show that lan-guages differ with regard to modeling difficulty.They attributed the differences among languagesto four types of morphological systems: isolating,fusional, introflexive, and agglutinative. Whilethey found a significant association between themorphological type and modeling difficulty, Type-Token Ratio (TTR) was the most predictive of lan-

    guage modeling performance.Cotterell et al. (2018) arrived at a similar con-

    clusion modeling 21 languages using the Europarlcorpus (Koehn, 2005). When trained with n-gramand character-based Long Short-Term Memory(LSTM) models, the languages showed differentmodeling difficulties, which were correlated witha measure of morphology, Morphological Count-ing Complexity (MCC) or the number of inflec-tional categories (Sagot, 2013).

    However, Mielke et al. (2019) failed to repro-duce the correlation with MCC when they in-creased the scope to 69 languages, utilizing aBible corpus (Mayer and Cysouw, 2014). Theyalso reported no correlation with measures ofmorphosyntactic complexity such as head-POSentropy (Dehouck and Denis, 2018) and otherlinguist-generated features (Dryer and Haspel-math, 2013). Rather, they found that simplerstatistics, namely the number of types and num-ber of characters per word, correlate with languagemodel surprisal using BPE and character segmen-tation, respectively.

    3 Morphological measures

    Different measures of morphology are used to rep-resent a language’s morphology.

    3.1 Linguist-generated measures

    The most linguistically-informed measures ofmorphology involve expert descriptions of lan-guages. The World Atlas of Language Structures(WALS; Dryer and Haspelmath, 2013) has beenused frequently in the literature to provide typo-logical information. WALS is a large database oflinguistic features gathered from descriptive mate-rials, such as reference grammars. It contains 144chapters in 11 areas including phonology, mor-phology, and word order. Each chapter describes afeature with categorical values and lists languagesthat have each value. However, not all languagesin the database have data for all the features, andfor some languages there is no data at all.

    The studies reviewed in §2 all relied on thisexpert-description approach to quantify morpho-logical properties. Gerz et al. (2018) focusedon WALS descriptions of inflectional synthesisof verbs, fusion, exponence, and flexivity, whileMielke et al. (2019) looked at two WALS fea-tures, 26A “Prefixing vs. Suffixing in InflectionalMorphology” and 81A “Order of Subject, Object

  • and Verb.” Cotterell et al. (2018) used UniMorph(Kirov et al., 2018), instead of WALS, to calcu-late MCC. Vania and Lopez (2017) did not citeany databases but provided descriptions of fourmorphological types (fusional, agglutinative, root-and-pattern, and reduplication) and categorized 10languages into these types.

    A major issue with this approach to represent-ing morphology is that there is not enough expertdata available to enable comparisons across manydifferent languages. In fact, Mielke et al. (2019)chose their two WALS features because data forthese features existed for most of their languages.Moreover, Bentz et al. (2016) showed that theirWALS-based measure had lower correlations withother measures of morphological complexity dueto this issue of missing data.

    3.2 Corpus-based measures

    In contrast, corpus-based measures of morphol-ogy can be easily calculated on a given dataset. These measures include the number oftypes, Type-Token Ratio (TTR), Moving-AverageTTR (MATTR; Covington and McFall, 2010), andMean Length of Words (MLW). The exact defini-tion of the measures may vary depending on stud-ies, but we define them as in Table 1, where a wordtoken is a string separated by spaces in the trainingset after tokenization but before segmentation.

    While some studies (e.g., Mielke et al., 2019)consider these measures as simple statistics of acorpus, other studies have found that they can beused as approximate measures of morphologicalcomplexity. Kettunen (2014) showed that TTR,MATTR, and MLW can capture the overall rank-ing of morphological complexity generated byinformation-theoretic and expert-generated mea-sures of morphological complexity. Bentz et al.(2016) compared different measures of morpho-logical complexity for 519 languages across 101families and showed a strong correlation betweenall measures, which were based on corpus statis-tics, linguistic expertise, information theory, andtranslation alignment. They argued that corpus-based measures, including TTR, and other mea-sures of morphological complexity can be usedinterchangeably. In addition, Gerz et al. (2018)showed that TTR is influenced by the morpholog-ical typology of a language. According to them,isolating languages tend to have small TTR valuesand are often easier to model while the opposite is

    Measure Definition

    Types Number of unique word tokensTTR Number of unique word tokens di-

    vided by total number of word to-kens

    MATTR Average TTR calculated over amoving window of 500 word tokens

    MLW Average number of characters perword token

    Table 1: Corpus-based measures of morphology de-fined for this study. These measures are calculated ontokenized data sets before applying any segmentationmethod.

    true for agglutinative languages.Given the previous literature, we utilize

    these corpus-based measures, as well as expert-generated WALS features, as a proxy for morpho-logical differences among languages in our study.

    4 Methods

    We design our experiments to test if a lan-guage’s morphology is correlated with languagemodel performance, depending on the segmen-tation method. We represent a language’s mor-phology using WALS features and corpus statis-tics. We train language models for Bible trans-lations in 92 languages based on five differentsegmentation methods: character, BPE, Morfes-sor, and FST with BPE or Morfessor back-offstrategies (FST+BPE & FST+Morfessor). Weuse surprisal per verse (Mielke et al., 2019) asthe evaluation metric to compare language mod-eling performance across different languages anddifferent segmentation methods. Additionally, wequantify the difference in surprisal per verse be-tween segmentation methods to compare the rel-ative strength of each segmentation method withregard to morphological complexity.

    4.1 DataOur data consist of 145 Bible translations in 92languages covering 22 language families,2 fullyaligned at the verse level. The majority of

    2For each language, we report the family assigned byWALS (Dryer and Haspelmath, 2013): 6 Afro-Asiatic, 1Algic, 1 Altaic, 2 Austro-Asiatic, 6 Austronesian, 1 Ay-maran, 3 Dravidian, 4 Eskimo-Aleut, 1 Guaicuruan, 33 Indo-European, 1 Japanese, 1 Korean, 1 Mande, 6 Mayan, 6 Niger-Congo, 4 Quechuan, 5 Sino-Tibetan, 1 Songhay, 1 Tai-Kadai,2 Tupian, 2 Uralic, 2 Uto-Aztecan, 2 creoles.

  • the data came verse-aligned from Mielke et al.(2019) (original data from Mayer and Cysouw,2014). We added more Bibles from another cor-pus (Christodoulopoulos and Steedman, 2014) andfrom online Bible resources (see Appendix A formore information). We refer to each language byISO 639-3 code when applicable.

    We followed Mielke et al. (2019)’s method tosplit the data into training, development, and testsets: the verse-aligned data were divided intoblocks of 30 verses, with the first five verses be-ing assigned to the development set, the next fiveto the test set and the rest to the training set. Theresulting training set had 16,926 verses while de-velopment and test sets had 4,225 verses each.

    It should be noted that both Mielke et al. (2019)and Christodoulopoulos and Steedman (2014) pro-vided tokenized data. We tokenized the newlyadded Bibles using Mielke and Eisner (2019)’stokenizer, following Mielke et al. (2019). Whenboth tokenized and untokenized versions wereavailable, we included the tokenized versions only.

    We chose to replace characters that only oc-curred one time with a special UNK symbol.Mielke et al. (2019) applied this procedure to char-acters that appear less than 25 times in the trainingset except for Chinese, where only singleton char-acters were replaced. Because we added severallanguages where the original strategy would haveresulted in removing too much data, we prepro-cessed singleton characters across the board.

    We also corrected several errors present in thedata. For example, the Bible translations in Shona(sna) and Telugu (tel) were mis-coded as Shan(shn) and Tecpatlán Totonac (tcw), respectively.

    4.2 Morphological measures selected

    In this paper, we adopt two approaches to repre-senting a language’s morphology. First, we rely onexpert descriptions of languages in WALS, manu-ally augmenting the database to rectify the issueof missing data. Second, we utilize corpus-basedmeasures like TTR to represent the morphologicalcomplexity of a given language.

    WALS features While some previous studies(e.g., Gerz et al., 2018; Vania and Lopez, 2017)categorized relatively well-known languages intoa small number of morphological types, such cat-egorization is not always clear. Some other stud-ies (e.g., Cotterell et al., 2018; Mielke et al., 2019)selected a small number of available typological

    ID Name

    20A Fusion of Selected Inflectional Forma-tives

    21A Exponence of Selected Inflectional For-matives

    21B Exponence of Tense-Aspect-Mood In-flection

    22A Inflectional Synthesis of the Verb23A Locus of Marking in the Clause24A Locus of Marking in Possessive Noun

    Phrases25A Locus of Marking: Whole-language Ty-

    pology25B Zero Marking of A and P Arguments26A Prefixing vs. Suffixing in Inflectional

    Morphology27A Reduplication28A Case Syncretism29A Syncretism in Verbal Person/Number

    Marking

    Table 2: The 12 morphological features in WALS.

    features to compare, but their conclusions wereat odds, possibly calling for exploration of othermeasures. Therefore, we consider all availablemorphological features described by WALS to ex-plore which features affect language modeling andhow. Instead of making theoretical claims aboutmorphological typology, we explore which typo-logical features make a language’s morphologymore complex for LSTM language models.

    To that end, we augmented the existing WALSdatabase by consulting reference grammars foreach language. Of the 92 languages in our cor-pus, six were not in the WALS database.3 In ad-dition, many of the languages in the database hadmissing data for some features. For example, wehad no data for any of the morphological featuresof Afrikaans (afr). We manually assigned missingfeatures where possible following the descriptionsin the relevant WALS chapters regarding the pro-cedures used to assign feature values to languages.

    Of the almost 200 features in WALS, the editorsof the database labeled 12 of them as morpholog-ical features. Therefore, we considered these 12features, listed in Table 2 and described below,4 totest the hypothesis that morphological complexity

    3ikt, lat, nch, tbz, wbm, zom4See https://wals.info/chapter for more de-

    tails and examples of these features.

    https://wals.info/chapter

  • correlates with modeling difficulty.

    Feature 20A describes how closely grammati-cal markers (inflectional formatives) are phono-logically connected to a host word or stem. Themarkers can be isolating, concatenative, or evennonlinear (i.e., ablaut and tone).

    Features 21A and 21B measure the exponenceof selected grammatical markers. Exponencerefers to the number of categories that a singlemorpheme expresses. For 21A, the selected gram-matical markers were case markers. For 21B, theywere tense-aspect-mood (TAM) markers.

    Feature 22A measures how many grammati-cal categories may appear on verbs in a lan-guage. These categories include tense-aspect-mood, negation, voice, and agreement.

    Features 23A through 25B describe the exis-tence and locus of marking in different kinds ofphrases. A phrase may have marking on eitherits head, its dependent(s), both, or neither. In fullclauses, the verb is the head, and the subject andobject arguments are dependents. In possessivenoun phrases, the possessed noun is the head whilethe possessor is dependent.

    Feature 26A measures the degree to which lan-guages use prefixes versus suffixes in their in-flectional morphology. Feature 27A describeswhich languages use reduplication productivelyand whether or not both full and partial redupli-cation are used.

    Both Features 28A and 29A measure syn-cretism. Syncretism occurs when a single in-flected form corresponds to more than one func-tion. 28A measures case syncretism specificallywhile 29A measures syncretism in the subjectagreement marking of verbs.

    Types, TTR, MATTR, and MLW We calcu-lated the number of types, TTR, Moving-AverageTTR, and Mean Length of Word using an adaptedscript from the Python module LexicalRichness.5

    We used a window size of 500 for Moving-Average TTR, following previous studies (e.g.,Kettunen, 2014). The definitions of the measuresare found in Table 1. All measures were calculatedbased on the word tokens in the training set beforeapplying any segmentation method.

    5https://github.com/LSYS/LexicalRichness

    4.3 Segmentation methods

    We chose to train only open-vocabulary lan-guage models for fair comparison. Word-levelmodels will predict UNK for out-of-vocabularyword tokens and cannot be fairly compared withcharacter- and subword-level models as a result.Specifically, we trained language models usingfive segmentation methods: character, BPE, Mor-fessor, FST+BPE, and FST+Morfessor. Thesesegmentation methods provide a way to segmentany given text into smaller pieces, some of whichapproximate morphemes.

    A morpheme is the smallest meaning-bearingmorphological unit while a morph is the sur-face representation of one or more morphemes.Linguistically-motivated methods like Morfessorand FSTs are designed with the goal of produc-ing subword segments that are closely aligned tothe true morphs comprising a word. While BPEwas not designed with morpheme segmentation inmind, its resulting subwords are commonly be-lieved to align with morphs to some degree dueto morph subsequences being frequent in the data.

    Segmenting words into morphs may reduce theimpact of rich morphology as highly inflectedwords can be broken into smaller pieces that arelikely to contribute similar meanings across con-texts in the corpus. Table 3 provides examplesof the segmentation methods we used to train lan-guage models. The original verse is provided forreference only and not used to train any models.

    Character We trained character-based languagemodels, following previous studies (Mielke et al.,2019; Gerz et al., 2018; Cotterell et al., 2018).Character language models are trained to predictthe next character given the preceding context, andthe vocabulary includes an underscore 〈_〉 to de-note word boundaries.

    BPE We trained BPE-based language models,following Mielke et al. (2019). Starting with char-acter segmentation, BPE operations combine char-acters into larger chunks based on their frequen-cies to create units somewhere between charactersand words with the number of merge operationsas the hyperparameter (Sennrich et al., 2016). Weused 0.4 × types as the number of merges, asMielke et al. (2019) reported that to be most ef-fective with their corpus.6 BPE language models

    6Additional static numbers of merge operations were alsotested, with nearly identical results.

    https://github.com/LSYS/LexicalRichnesshttps://github.com/LSYS/LexicalRichness

  • Segmentation Example

    Tokenized Yuhannanın kardeşi Yakubu kılıçla öldürdü .Character Y u h a n n a n ı n _ k a r d e ş i _ Y a k u b u _ k ı l ı ç l a _ ö l d ü r d ü .BPE Yuhan@@ nanın kardeşi Yakubu kılıçla öldürdü .Morfessor Yuhanna@@ nın kardeş@@ i Yakub@@ u kılıç@@ la öldürdü .FST+BPE Yuhan@@ nanın kardeş@@ i Yakub@@ u kılıç@@ la öl@@ dür@@ dü .FST+Morfessor Yuhanna@@ nın kardeş@@ i Yakub@@ u kılıç@@ la öl@@ dür@@ dü .

    Table 3: Turkish examples for different segmentation methods. An English translation is “And he killed Jamesthe brother of John with the sword” (Acts 12:2). FST does not produce analyses for Yuhannanın (“John’s”),for which BPE or Morfessor back-off was used. The segmentation created by human experts was the same asFST+Morfessor. 〈@@〉 denotes subword segmentation while 〈_〉 encodes space between word tokens for charactersegmentation.

    are trained to predict the next BPE unit. The dou-ble at sign 〈@@〉 is used to indicate segments thatare not word-final.

    Morfessor Morfessor (Creutz and Lagus, 2007)is a word segmentation method explicitly designedfor morphological segmentation. The default im-plementation utilizes a unigram language model tofind morph-like constructs. While like BPE thisapproach is information-theoretic, it selects seg-ments top-down and includes a prior term for thelength of segments, regularizing segments to bemore plausible morphemes.

    Using the default settings with Morfessor 2.0(Virpioja et al., 2013), we trained Morfessor onthe training set and applied the segmentation to alldata sets. Just like BPE, the language models aretrained to predict the next morph unit.

    FST While segmentation based on BPE andMorfessor may or may not resemble actual mor-phemes, morpheme segmentation from Finite-State Transducers (FSTs) provides a knowledge-based method to segment a text into morphemes.Finite-state morphological analyzers are rule-based systems that take a surface string as in-put and produce all possible morphological anal-yses as output. To use FSTs for segmenta-tion, we changed existing morphological analyz-ers into segmenters and developed a heuristicto select one analysis for a given word token.FSTs for Plains Cree (Arppe et al., 2014–2019),German (Schmid et al., 2004), English (Axel-son et al., 2015), Finnish (Pirinen, 2015), Indone-sian (Larasati et al., 2011), Cuzco Quechua (Vilcaet al., 2012), and Turkish (Çağrı Çöltekin, 2014,2010) were used as morphological segmenters.

    Most FSTs are designed to provide analyses for

    surface forms, not morphological segmentations.Fortunately, morpheme boundaries are frequentlypart of FSTs due to their relevance for lexico-phonological phenomena. By modifying the FSTbefore the cleanup rules that remove morphemeboundaries can apply, we create a morphologicalsegmenter that takes in a surface form and returnsthe surface form with morpheme boundary mark-ers. If the analyzer provides segmentations, thetransducer is used as-is.

    For example, the Turkish FST produces a mor-phological analysis for the surface form kılıçla(“with the sword") in the example in Table 3:kılıç. In-stead of producing such an analysis for the givenword, the segmenter instead produces the seg-mented surface form kılıç@@ la, which isused in the FST segmentation methods.

    Because a FST may return multiple analysesor segmentations given a single word, a heuristicmethod was used to determine which segmenta-tion to select. In general, we chose the segmenta-tion with the fewest segments. However, the En-glish segmenter based on Axelson et al. (2015) al-ways returns the input string itself as a possiblesegmentation if covered by the analyzer. For ex-ample, walks would produce two segmentations inthe English segmenter: walks and walk@@ s.For this segmenter, we selected the fewest numberof segments excluding the input string itself (e.g.,choosing walk@@ s over walks).

    When a FST produces no analyses for a givenword, as in the case of Yuhannanın (John’s) inTable 3, we adopt the FST-augmented BPE seg-mentation (FST+BPE) and FST-augmented Mor-fessor segmentation (FST+Morfessor), where wefall back to BPE or Morfessor segmentation when-

  • ever FST segmentation is unavailable. As shownin the table, FST+BPE and FST+Morfessor onlydiffer in the segmentation of the unanalyzed word.For this particular verse, the human segmenta-tion agrees with the FST+Morfessor segmenta-tion. FST+BPE and FST+Morfessor models aretrained just like BPE or Morfessor models to pre-dict the next subword unit.

    4.4 Models

    Following Mielke et al. (2019), we trained LongShort-Term Memory (LSTM) models introducedby Merity et al. (2018) for each of the segmenta-tion methods. Three LSTM models using char-acter, BPE, and Morfessor segmentation weretrained for all languages. For a select group of lan-guages, we also trained models using FST+BPEand FST+Morfessor units. The neural architec-ture consisted of an initial embedding layer, mul-tiple LSTM layers, and a linear decoder layer. Forour particular experiments, we adopted the hyper-parameters from Mielke et al. (2019) (see Merityet al., 2018, for their character PTB setttings). Thebatch size used for character models was 128 with500 epochs of training. All other models used abatch size of 40 and were trained for 200 epochs.

    4.5 Metrics

    Surprisal per verse One major evaluation met-ric for language models is the negative log-likelihood on a test set. The negative log-likelihood, or surprisal, is the amount of informa-tion a language model needs to generate the nextunit. Following Mielke et al. (2019), we definethe surprisal at the verse level, where NLL(vij) =− log2 p(vij) with a verse vij (for ith verse in lan-guage j). Since each verse is intended to expressthe same meaning across languages, differencesin per-verse surprisal across languages primarilyindicate differences in cross-linguistic languagemodel quality (rather than differences in meaningcontent).

    For each language j, we average the negativelog-likelihood across the 4,225 verses in the testset, making Lj = 14225

    ∑4225i=1 NLL(vij).

    Surprisal difference Additionally, we quan-tify the difference between segmentation methodsin language modeling performance as shown inEquation 1. This quantity compares the relative

    strength of one segmentation method to another.

    ∆Sj1,Sj2 =Lj1 − Lj2

    12(Lj1 + Lj2)

    (1)

    Sj1 and Sj2 are two segmentation methods tocompare and Lj1 and Lj2 represent the surprisalper verse for the language models based on the twosegmentation methods. If ∆Sj1,Sj2 is positive, Sj1resulted in a higher surprisal than Sj2 and Sj2 wasmore effective in modeling a given language.

    5 Results

    We now present results from our experiments. Wereport the strong association between several mor-phological features and surprisal per verse for BPElanguage models, compared to language modelsbased on other segmentation methods. Then, weshow the trade offs between different segmenta-tion methods and how they interact with morpho-logical complexity. Our assumption is that, if asegmentation method reduces the impact of mor-phology, the surprisal values of language modelsbased on that segmentation will have weaker cor-relations with measures of morphology.

    5.1 Correlation studies with character andBPE models

    We investigated correlations between surprisal perverse and various measures of morphology (i.e.,WALS features, number of types, TTR, Moving-Average TTR, Mean Length of Word). Benjaminiand Hochberg (1995)’s procedure was used to con-trol the false discovery rate, so only p ≤ 815 · 0.05(≈ 0.027) is considered significant.

    WALS features We tested for association be-tween surprisal and each selected WALS featurewith the Kruskal–Wallis test, or one-way ANOVAon ranks. This non-parametric test was chosen be-cause the distribution of surprisal values did notmeet the assumption of normality. A significanttest result in this context means that there are sig-nificant differences in the median surprisal valuesbetween categories for a given feature. In order forthe test to be effective, only feature values with asample size ≥ 5 were tested.

    For the character models, no features showedsignificant association with surprisal. However,for the BPE models, half of the morphologicalfeatures had significant association with surprisal.These features were 21A “Exponence of SelectedInflectional Formatives,” 23A “Locus of Marking

  • Segmentation ID p-value η2

    BPE

    21A 1.3e-05 0.2823A 6.7e-06 0.2824A 2.2e-04 0.22825A 6.5e-05 0.25325B 0.014 0.0629A 2.0e-04 0.198

    Morfessor

    21A 0.009 0.10923A 0.002 0.13526A 0.022 0.06429A 0.024 0.072

    Table 4: p-values and effect sizes of WALS featuresthat showed significant effect on surprisal per verse.Large effect sizes (≥ 0.14) are in bold.

    in the Clause,” 24A “Locus of Marking in Pos-sessive Noun Phrases,” 25A “Locus of Marking:Whole-language Typology,” 25B “Zero Markingof A and P Arguments,” and 29A “Syncretism inVerbal Person/Number Marking.”

    For the features shown to have an effect onthe BPE surprisal, we calculated the effect sizesand performed post-hoc comparisons to determinewhich categories were significantly different. Inthis context, effect size (η2) indicates the propor-tion of variance in surprisal per verse explained byeach WALS feature, and η2 ≥ 0.14 is considereda large effect (Tomczak and Tomczak, 2014). Thep-values and effect sizes are summarized in Table4. The effect size was large for all of the signifi-cant features except for 25B.

    For Feature 21A, the median surprisal value forlanguages with no case was significantly lowerthan the median value for other types. Similarly,for 23A, the median surprisal value for languageswith no marking was significantly lower than thevalue for other types. In the cases of both 24A and25A, languages with double marking had highersurprisal values than those with single or no mark-ing. For 25B, languages with non-zero markinghad slightly higher surprisal values than those withzero-marking. Lastly, for 29A, languages withoutsyncretism had higher surprisal values than thosewith syncretism or with no marking.

    In general, less inflectional morphology was as-sociated with lower surprisal while more inflec-tional morphology was associated with higher sur-prisal.

    Segmentation Measure Spearman’s ρ

    Character

    Types 0.19∗TTR 0.15MATTR 0.17∗MLW 0.06

    BPE

    Types 0.80∗∗∗TTR 0.76∗∗∗MATTR 0.68∗∗∗MLW 0.61∗∗∗

    Morfessor

    Types 0.50∗∗∗TTR 0.44∗∗∗MATTR 0.39∗∗∗MLW 0.30∗∗∗

    Table 5: Correlation between surprisal per verse persegmentation method and morphological complexitymeasures. ∗p < 0.027, ∗∗∗p < 0.0005.

    Corpus-based measures A similar trendemerged for corpus-based measures of morpho-logical complexity. The surprisal per verse ofBPE models was highly correlated with typecount, TTR, Moving-Average TTR (MATTR),and Mean Length of Word (MLW). Yet withcharacter models, the strength of the correlationwas weak and often insignificant. These resultssuggest that BPE segmentation was ineffective inreducing the impact of morphological complexity.

    Table 5 summarizes the correlation coefficientsand corresponding p-values. For the character-based models, only the number of types andMATTR showed a significant correlation in Spear-man’s rank-order correlation, and those correla-tions were rather weak. In contrast, the BPEmodels presented strong correlations with all ofthe corpus-based measures at any reasonable al-pha value (p < 10−16). The number of typesshowed the strongest correlation, followed byTTR, MATTR, and MLW in that order.

    5.2 Comparison with Morfessor andFinite-State Transducer models

    We trained language models using three additionalsegmentation methods: Morfessor, FST+BPE,and FST+Morfessor. Because Morfessor is an un-supervised method, we were able to utilize it tosegment all languages, but we were able to gen-erate FST segmentation for only a few languages.As such, we compare the character, BPE, and Mor-fessor models for all languages before looking into

  • Figure 1: Pairwise comparisons of surprisal per verse values for character, BPE, and Morfessor models. For themajority of the languages, Morfessor segmentation resulted in lower surprisal per verse than character or BPEsegmentation.

    a subset of them where the FST methods wereavailable.

    Morfessor models Morfessor segmentation per-formed better than both character and BPE seg-mentation for the majority of languages. Figure 1shows the pairwise comparisons of the surprisalper verse values of a given language on differ-ent segmentation strategies. As shown in the ploton the left, the relative strength between BPE andcharacter segmentation methods is not clear. BPEsegmentation produced slightly better results for49 of the 92 languages, but character segmen-tation produced much lower surprisal values forthe rest of the languages. In contrast, Morfes-sor clearly outperformed character and BPE formost of the languages, as shown in the plots inthe middle and right. Only 12 out of the 92 lan-guages had higher surprisal values for Morfessorsegmentation than character, while a total of 66languages performed better with Morfessor seg-mentation than with BPE.

    In addition, Morfessor models’ surprisal perverse showed weaker correlations with measuresof morphology. Only four WALS features showedsignificant association with the Morfessor mod-els: 21A “Exponence of Selected Inflectional For-matives,” 23A “Locus of Marking in the Clause,”26A “Prefixing vs. Suffixing in Inflectional Mor-phology,” and 29A “Syncretism in Verbal Per-son/Number Marking.” The effect sizes were alsomuch smaller than those for the BPE models asshown in Table 4.

    Just as with the BPE models, the median sur-prisal for languages with no marking was muchlower than the surprisal for other types for Fea-

    tures 21A, 23A, and 29A. For 26A, there was onlya significant difference between weakly suffixinglanguages and strongly prefixing languages, withstrongly prefixing languages having a lower me-dian surprisal per verse.

    As shown in Table 5, corpus-based statistics stillshowed significant correlations with the surprisalper verse value of Morfessor models, but the cor-relations were moderate compared to those of theBPE models.

    FST models When available, a FST segmen-tation method resulted in the best performance.The graph in Figure 2 displays the surprisal ofFST+BPE and FST+Morfessor models in com-parison to the segmentation methods discussedabove. For all seven languages, either FST+BPEor FST+Morfessor segmentation (or both) showsa clear decrease in the surprisal per verse com-pared to the BPE and Morfessor segmentations.

    5.3 Surprisal difference and morphologicalcomplexity

    In order to look into the effect of morphologicalcomplexity on the relative strength of a given seg-mentation method, we conducted correlation stud-ies with the difference between the surprisal perverse for pairs of segmentation methods (the ∆values as defined in §4.5). We considered onlythe measures of morphological complexity thatwere continuous variables (i.e., number of types,TTR, Moving-Average TTR, and Mean Length ofWord).

    As shown in Table 6, all of the corpus-basedstatistics were highly correlated to the ∆ values.The correlations range from moderate to high us-

  • Figure 2: Surprisal per verse per segmentation methodincluding FST segmentation methods. FST+BPE orFST+Morfessor models outperform all other models.

    ing Spearman’s ρ (0.50 < ρ < 0.95). Eventhough the strength of correlations varied slightly,number of types, TTR, MATTR, and MLW allshowed a similar correlation with the differencestatistics. They all had a positive correlation with∆ BPE, char. This indicates that the more morpho-logically complex a language is, the better it ismodeled with character segmentation compared toBPE segmentation. Similarly, there were posi-tive correlations between the morphological mea-sures and ∆ Morfessor, char, suggesting that char-acter segmentation works better than Morfessorin modeling morphologically complex languages.∆ BPE, Morfessor also had positive correlations withcomplexity measures. This means that languageswith higher morphological complexity tend torecord lower surprisal values with Morfessor seg-mentation than BPE. While BPE and Morfessormodels outperformed character models on averageas shown in §5.2, the positive correlations with∆ Morfessor, char and ∆ BPE, char suggest that char-acter segmentation outperformed BPE and Mor-fessor segmentation for languages with very richmorphology.

    These results are supported by Figure 3, wherethe surprisal per verse for different segmenta-tion models is plotted against MATTR.7 For lan-guages with lower MATTR, BPE and Morfessorperform better than character segmentation. How-

    7The same trend was captured when we plotted with theother corpus-based measures.

    Difference Measure Spearman’s ρ

    ∆ BPE, char

    Types 0.95∗∗∗TTR 0.92∗∗∗MATTR 0.77∗∗∗MLW 0.74∗∗∗

    ∆ Morfessor, char

    Types 0.71∗∗∗TTR 0.66∗∗∗MATTR 0.50∗∗∗MLW 0.53∗∗∗

    ∆ BPE, Morfessor

    Types 0.86∗∗∗TTR 0.86∗∗∗MATTR 0.80∗∗∗MLW 0.75∗∗∗

    Table 6: Correlation between surprisal differencesand morphological complexity measures for character,BPE, and Morfessor models. All p-values < 10−11.

    ever, for languages with higher MATTR, characterand Morfessor models outperform BPE.

    6 Discussion

    Our results show that BPE models’ surprisalper verse is highly correlated with a language’smorphology, represented by several WALS fea-tures and corpus-based measures. Morfessorshows weaker correlations with such measuresand records better performance for most of thelanguages. FST-based models outperform otherswhen available. In this section, we discuss the im-plications of these findings in the context of previ-ous work and future research.

    6.1 Morphology and surprisal

    In accordance with the prior work discussed in§2, we found differences in modeling difficultybetween languages. The correlation studies in§5 provide evidence that morphology is a sub-stantial contributing factor to these differences.Six WALS (Dryer and Haspelmath, 2013) mor-phology features showed association with the sur-prisal per verse of BPE language models. Corpus-based statistics like number of types and MATTRshowed strong correlations with BPE surprisal,supporting the relationship between modeling dif-ficulty and morphological complexity.

    Our conclusion that a language’s morphologyimpacts language modeling difficulty agrees withCotterell et al. (2018) and Gerz et al. (2018), butis at odds with Mielke et al. (2019). We included

  • Figure 3: Surprisal per verse plotted against Moving-Average TTR for character, BPE, and Morfessor segmentationmethods. Lines indicate the regression estimate with 95% confidence intervals.

    languages known for their rich morphology, suchas Western Canadian Inuktitut (ikt) and CentralAlaskan Yup’ik (esu), which may have increasedthe variation in morphological complexity in thecorpus. We also augmented the WALS data byconsulting reference grammars, so we were able toconsider 11 more morphological WALS featuresthan Mielke et al. (2019). We found that the mor-phological feature Mielke et al. (2019) considered,26A “Prefixing vs. Suffixing in Inflectional Mor-phology,” indeed showed no correlation with BPEsurprisal. However, our results show that thereare aspects of morphology that affect surprisal thatwere not considered before.

    Previous work, such as Gerz et al. (2018), fo-cused only on aspects of morphology that they be-lieved a priori would predict language model per-formance. In contrast, our study tested all of themorphological features listed in WALS and alsotested each of them individually. We found thattwo of the four features in Gerz et al. (2018), 20A“Fusion of Selected Inflectional Formatives” and22A “Inflectional Synthesis of the Verb,” showedno association with language model performance.Additionally, we found several features that af-fected language modeling performance, specifi-cally locus of marking and syncretism, which werenot mentioned in the literature. These results showthat the features tied to morphological complexityin previous work are not necessarily the same fea-tures that affect language modeling.

    In addition to differences in results, our in-terpretation of corpus-based statistics like TTRalso diverges from previous work. While Mielke

    et al. (2019) reported high correlations betweenlanguage model performance and such statistics,they considered them only as simple statistics ofthe data. In fact, our results replicate Mielkeet al. (2019) in that the number of types was themost predictive of BPE language model surprisalamong all the variables considered. However, weargue that corpus-based statistics can be used as anapproximate measure of morphological complex-ity based on previous studies. These corpus-basedmeasures of morphology are reported to capturethe overall ranking of morphological complexity(Kettunen, 2014; Bentz et al., 2016) and can beinterpreted in relation to morphological typology(Gerz et al., 2018). We also believe our resultsindicate that TTR and the WALS features cap-ture similar information. For example, the posi-tive correlation of ∆ BPE, Morfessor for corpus-basedmeasures corresponds to the smaller effect sizesof WALS features found for Morfessor comparedto BPE. This indicates a lesser effect of rich mor-phology on Morfessor models compared to BPE.

    6.2 Segmentation methods

    While the primary goal of this work is to analyzethe relation of a language’s morphology to lan-guage modeling performance, we found this to beentangled with the level and method of segmenta-tion. Our results show that there is significant vari-ation in the effectiveness of segmentation meth-ods cross-linguistically, and suggest challengesto the status quo methods of subword segmenta-tion in particular. While the subword segmen-tation methods we used generally outperformed

  • character-level segmentation, the higher the TTR,the smaller the difference in surprisal for both BPEand Morfessor, suggesting that these methods areless effective at segmenting languages with highlycomplex morphology. Of pre-existing methods,we found Morfessor to have the lowest surprisalper verse for most of the languages considered.Morfessor’s weaker correlations with WALS fea-tures and other measures like TTR suggest that itsbetter performance may be due to a better abilityto model languages with a wider range of mor-phological attributes. This is in line with Bostromand Durrett (2020), which showed that UnigramLM (Kudo, 2018), a segmentation algorithm sim-ilar to Morfessor, often outperforms BPE and pro-duces more morph-like segmentation in the con-text of language model pretraining in English andJapanese.

    However, Morfessor was significantly outper-formed by character segmentation for a small sub-set of languages.8 Many of these languages havebeen classified as polysynthetic, suggesting thatperhaps Morfessor is ill-suited for such languages(see Klavans, 2018; Tyers and Mishchenkova,2020; Mager et al., 2018, for discussions onchallenges polysynthetic languages pose for NLPtasks).

    Additionally, for a typologically diverse sub-set of languages for which we could obtainFST morphological segmenters, we considerednovel segmentation methods: FST+BPE andFST+Morfessor. We found this simple exten-sion of BPE and Morfessor with morphological in-formation achieved the lowest surprisal per versein all available languages. The overall successof combining statistical segmentations with FSTsfurther confirms the impact of morphology on lan-guage modeling and yields significant promise forthe use of segmentation based on linguistic mor-phological information.

    7 Conclusion

    A language’s morphology is strongly associ-ated with language modeling surprisal for BPE-segmented language models. BPE model sur-prisal is associated with 6 out of the 12 stud-ied WALS morphology features, indicating thatthere are aspects of some languages’ morphologythat BPE does not help mitigate. Strong corre-

    8amh, arz, ayr, cmn, esu, heb, ike, ikt, kal, quh, tel, xho.BPE outperformed Morfessor for cmn and heb.

    lations with corpus-based measures of morphol-ogy such as TTR further suggest that the moretypes available in a language (often by means ofrich morphology), the harder it is to model basedon BPE units. Morfessor, which was designedwith morpheme induction in mind, performs bet-ter for most languages and shows less associationwith morphological features. When available, thelinguistically-informed method of FST-augmentedBPE or Morfessor segmentation performs best,indicating a further promise for using linguisticknowledge to combat the effects of morphologyon language model surprisal.

    These conclusions were only possible throughmanual augmentation of typological databases andexpansion of studied languages. Future effortscould adopt our approach for other areas of lan-guage. Using linguistically-informed resourcesacross many languages is an avenue for improvingneural models in NLP in both design and analysis.

    Acknowledgments

    This paper builds on our prior work for the2019 Sixth Frederick Jelinek Memorial SummerWorkshop on Speech and Language Technology(JSALT 2019) (Schwartz et al., 2020). We thankthe organizers of the workshop and the mem-bers of our workshop team on Neural Polysyn-thetic Language Modeling for inspiring us to pur-sue this research direction. Our special thanks toRebecca Knowles, Christo Kirov, Lori Levin, Chi-kiu (Jackie) Lo, and TACL reviewers and editorsfor their feedback on our manuscript. We thankAta Tuncer for his assistance with Turkish seg-mentation. This work utilizes resources supportedby the National Science Foundation’s Major Re-search Instrumentation program, grant #1725729,as well as the University of Illinois at Urbana-Champaign.

    References

    Antti Arppe, Atticus Harrigan, Katherine Schmir-ler, Lene Antonsen, Trond Trosterud, SjurNørstebø Moshagen, Miikka Silfverberg, ArokWolvengrey, Conor Snoek, Jordan Lach-ler, Eddie Antonio Santos, Jean Okimāsis,and Dorothy Thunder. 2014–2019. Finite-state transducer-based computational model ofPlains Cree morphology.

    https://giellalt.uit.no/lang/crk/PlainsCreeDocumentation.htmlhttps://giellalt.uit.no/lang/crk/PlainsCreeDocumentation.htmlhttps://giellalt.uit.no/lang/crk/PlainsCreeDocumentation.html

  • Eric Axelson, Sam Hardwick, Krister Lindén,Kimmo Koskenniemi, Flammie Pirinen, MikkaSilfverberg, and Senka Drobac. 2015. Helsinkifinite-state technology resources.

    Yoav Benjamini and Yosef Hochberg. 1995. Con-trolling the false discovery rate: A practical andpowerful approach to multiple testing. Jour-nal of the Royal Statistical Society: Series B(Methodological), 57(1):289–300.

    Christian Bentz, Tatyana Ruzsics, Alexander Ko-plenig, and Tanja Samardžić. 2016. A compar-ison between morphological complexity mea-sures: Typological data vs. language corpora.In Proceedings of the Workshop on Compu-tational Linguistics for Linguistic Complexity(CL4LC), pages 142–153, Osaka, Japan. TheCOLING 2016 Organizing Committee.

    Kaj Bostrom and Greg Durrett. 2020. Byte pairencoding is suboptimal for language model pre-training. CoRR, cs.CL/2004.03720v1.

    Çağrı Çöltekin. 2010. A freely available morpho-logical analyzer for Turkish. In Proceedings ofthe Seventh International Conference on Lan-guage Resources and Evaluation (LREC’10),Valletta, Malta. European Language ResourcesAssociation (ELRA).

    Çağrı Çöltekin. 2014. A set of open source toolsfor Turkish natural language processing. InProceedings of the Ninth International Confer-ence on Language Resources and Evaluation(LREC’14), Reykjavik, Iceland. European Lan-guage Resources Association (ELRA).

    Christos Christodoulopoulos and Mark Steedman.2014. A massively parallel corpus: The Bible in100 languages. Language Resources and Eval-uation, 49:1–21.

    Ryan Cotterell, Sabrina J. Mielke, Jason Eisner,and Brian Roark. 2018. Are all languagesequally hard to language-model? In Proceed-ings of the 2018 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers), pages 536–541,New Orleans, Louisiana. Association for Com-putational Linguistics.

    Michael A. Covington and Joe D. McFall. 2010.Cutting the Gordian knot: The moving-average

    type–token ratio (MATTR). Journal of Quanti-tative Linguistics, 17(2):94–100.

    Mathias Creutz and Krista Lagus. 2007. Unsu-pervised models for morpheme segmentationand morphology learning. ACM Transactionson Speech and Language Processing, 4(1):3:1–3:34.

    Mathieu Dehouck and Pascal Denis. 2018. Aframework for understanding the role of mor-phology in universal dependency parsing. InProceedings of the 2018 Conference on Empir-ical Methods in Natural Language Processing,pages 2864–2870, Brussels, Belgium. Associa-tion for Computational Linguistics.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019Conference of the North American Chapterof the Association for Computational Linguis-tics: Human Language Technologies, Volume1 (Long and Short Papers), pages 4171–4186,Minneapolis, Minnesota. Association for Com-putational Linguistics.

    Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Max Planck Institute forEvolutionary Anthropology, Leipzig.

    Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti,Roi Reichart, and Anna Korhonen. 2018. Onthe relation between linguistic typology and(limitations of) multilingual language model-ing. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Pro-cessing, pages 316–327, Brussels, Belgium. As-sociation for Computational Linguistics.

    Kimmo Kettunen. 2014. Can type-token ratio beused to show morphological complexity of lan-guages? Journal of Quantitative Linguistics,21(3):223–245.

    Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylo-mova, Patrick Xia, Manaal Faruqui, SebastianMielke, Arya McCarthy, Sandra Kübler, DavidYarowsky, Jason Eisner, and Mans Hulden.2018. UniMorph 2.0: Universal morphology.In Proceedings of the Eleventh International

    https://hfst.github.io/https://hfst.github.io/https://doi.org/10.1111/j.2517-6161.1995.tb02031.xhttps://doi.org/10.1111/j.2517-6161.1995.tb02031.xhttps://doi.org/10.1111/j.2517-6161.1995.tb02031.xhttps://www.aclweb.org/anthology/W16-4117https://www.aclweb.org/anthology/W16-4117https://www.aclweb.org/anthology/W16-4117http://arxiv.org/abs/2004.03720v1http://arxiv.org/abs/2004.03720v1http://arxiv.org/abs/2004.03720v1http://www.lrec-conf.org/proceedings/lrec2010/pdf/109_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2010/pdf/109_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/437_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/437_Paper.pdfhttps://doi.org/10.1007/s10579-014-9287-yhttps://doi.org/10.1007/s10579-014-9287-yhttps://doi.org/10.18653/v1/N18-2085https://doi.org/10.18653/v1/N18-2085https://doi.org/10.1080/09296171003643098https://doi.org/10.1080/09296171003643098https://doi.org/10.1145/1187415.1187418https://doi.org/10.1145/1187415.1187418https://doi.org/10.1145/1187415.1187418https://doi.org/10.18653/v1/D18-1312https://doi.org/10.18653/v1/D18-1312https://doi.org/10.18653/v1/D18-1312https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://wals.info/https://doi.org/10.18653/v1/D18-1029https://doi.org/10.18653/v1/D18-1029https://doi.org/10.18653/v1/D18-1029https://doi.org/10.18653/v1/D18-1029https://doi.org/10.1080/09296174.2014.911506https://doi.org/10.1080/09296174.2014.911506https://doi.org/10.1080/09296174.2014.911506https://www.aclweb.org/anthology/L18-1293

  • Conference on Language Resources and Evalu-ation (LREC 2018), Miyazaki, Japan. EuropeanLanguage Resources Association (ELRA).

    Judith L. Klavans. 2018. Computational chal-lenges for polysynthetic languages. In Proceed-ings of the Workshop on Computational Mod-eling of Polysynthetic Languages, pages 1–11,Santa Fe, New Mexico, USA. Association forComputational Linguistics.

    Philipp Koehn. 2005. Europarl: A parallel corpusfor statistical machine translation. In Proceed-ings of the Tenth Machine Translation Summit,pages 79–86, Phuket, Thailand. AAMT.

    Taku Kudo. 2018. Subword regularization: Im-proving neural network translation models withmultiple subword candidates. In Proceedings ofthe 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers), pages 66–75, Melbourne, Australia. As-sociation for Computational Linguistics.

    Septina Dian Larasati, Vladislav Kuboň, andDaniel Zeman. 2011. Indonesian morphol-ogy tool (MorphInd): Towards an indonesiancorpus. In Cerstin Mahlow and Michael Pi-otrowski, editors, Systems and Frameworks forComputational Morphology, pages 119–129.Springer Berlin Heidelberg, Berlin, Heidelberg.

    Yinhan Liu, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer, and VeselinStoyanov. 2019. RoBERTa: A robustly op-timized BERT pretraining approach. CoRR,cs.CL/1907.11692v1.

    Manuel Mager, Elisabeth Mager, Alfonso Medina-Urrea, Ivan Vladimir Meza Ruiz, and Katha-rina Kann. 2018. Lost in translation: Analy-sis of information loss during machine trans-lation between polysynthetic and fusional lan-guages. In Proceedings of the Workshop onComputational Modeling of Polysynthetic Lan-guages, pages 73–83, Santa Fe, New Mexico,USA. Association for Computational Linguis-tics.

    Thomas Mayer and Michael Cysouw. 2014. Cre-ating a massively parallel Bible corpus. InProceedings of the Ninth International Confer-ence on Language Resources and Evaluation

    (LREC’14), pages 3158–3163, Reykjavik, Ice-land. European Language Resources Associa-tion (ELRA).

    Stephen Merity, Nitish Shirish Keskar, andRichard Socher. 2018. An analysis of neurallanguage modeling at multiple scales. CoRR,cs.CL/1803.08240v1.

    Sabrina J. Mielke. 2016. Language diversity inACL 2004 - 2016.

    Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,Brian Roark, and Jason Eisner. 2019. Whatkind of language is hard to language-model?In Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics,pages 4975–4989, Florence, Italy. Associationfor Computational Linguistics.

    Sabrina J. Mielke and Jason Eisner. 2019. Spellonce, summon anywhere: A two-level open-vocabulary language model. Proceedings ofthe AAAI Conference on Artificial Intelligence,33:6843–6850.

    Tommi A. Pirinen. 2015. Omorfi — free andopen source morphological lexical databasefor Finnish. In Proceedings of the 20thNordic Conference of Computational Linguis-tics (NODALIDA 2015), pages 313–315, Vil-nius, Lithuania. Linköping University Elec-tronic Press, Sweden.

    Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Lan-guage models are unsupervised multitask learn-ers.

    Benoît Sagot. 2013. Comparing complexity mea-sures. In Computational Approaches to Mor-phological Complexity, Paris, France. SurreyMorphology Group.

    Helmut Schmid, Arne Fitschen, and Ulrich Heid.2004. SMOR: A German computational mor-phology covering derivation, composition andinflection. In Proceedings of the Fourth In-ternational Conference on Language Resourcesand Evaluation (LREC’04), pages 1–263, Lis-bon, Portugal. European Language ResourcesAssociation (ELRA).

    Lane Schwartz, Francis Tyers, Lori Levin, ChristoKirov, Patrick Littell, Chi-kiu Lo, Emily

    https://www.aclweb.org/anthology/W18-4801https://www.aclweb.org/anthology/W18-4801http://mt-archive.info/MTS-2005-Koehn.pdfhttp://mt-archive.info/MTS-2005-Koehn.pdfhttps://doi.org/10.18653/v1/P18-1007https://doi.org/10.18653/v1/P18-1007https://doi.org/10.18653/v1/P18-1007https://doi.org/10.1007/978-3-642-23138-4_8https://doi.org/10.1007/978-3-642-23138-4_8https://doi.org/10.1007/978-3-642-23138-4_8http://arxiv.org/abs/1907.11692v1http://arxiv.org/abs/1907.11692v1https://www.aclweb.org/anthology/W18-4808https://www.aclweb.org/anthology/W18-4808https://www.aclweb.org/anthology/W18-4808https://www.aclweb.org/anthology/W18-4808http://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/220_Paper.pdfhttp://arxiv.org/abs/1803.08240v1http://arxiv.org/abs/1803.08240v1https://sjmielke.com/acl-language-diversity.htmhttps://sjmielke.com/acl-language-diversity.htmhttps://doi.org/10.18653/v1/P19-1491https://doi.org/10.18653/v1/P19-1491https://doi.org/10.1609/aaai.v33i01.33016843https://doi.org/10.1609/aaai.v33i01.33016843https://doi.org/10.1609/aaai.v33i01.33016843https://www.aclweb.org/anthology/W15-1844https://www.aclweb.org/anthology/W15-1844https://www.aclweb.org/anthology/W15-1844https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfhttps://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfhttps://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfhttps://hal.inria.fr/hal-00927276https://hal.inria.fr/hal-00927276http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdfhttp://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdfhttp://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf

  • Prud’hommeaux, Hyunji Hayley Park, Ken-neth Steimel, Rebecca Knowles, Jeffrey Micher,Lonny Strunk, Han Liu, Coleman Haley,Katherine J. Zhang, Robbie Jimerson, Vasil-isa Andriyanets, Aldrian Obaja Muis, NaokiOtani, Jong Hyuk Park, and Zhisong Zhang.2020. Neural polysynthetic language mod-elling. CoRR, cs.CL/2005.05477v2.

    Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural machine translation of rarewords with subword units. In Proceedings ofthe 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers), pages 1715–1725, Berlin, Germany. As-sociation for Computational Linguistics.

    Yusuxke Shibata, Takuya Kida, Shuichi Fuka-machi, Masayuki Takeda, Ayumi Shinohara,Takeshi Shinohara, and Setsuo Arikawa. 1999.Byte pair encoding: A text compression schemethat accelerates pattern matching. Technical re-port, Department of Informatics, Kyushu Uni-versity.

    Maciej Tomczak and Ewa Tomczak. 2014. Theneed to report effect size estimates revisited. Anoverview of some recommended measures ofeffect size. Trends in Sport Sciences, 1(21):19–25.

    Francis Tyers and Karina Mishchenkova. 2020.Dependency annotation of noun incorporationin polysynthetic languages. In Proceedings ofthe Fourth Workshop on Universal Dependen-cies (UDW 2020), pages 195–204, Barcelona,Spain (Online). Association for ComputationalLinguistics.

    Clara Vania and Adam Lopez. 2017. From char-acters to words to in between: Do we capturemorphology? In Proceedings of the 55th An-nual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers),pages 2016–2027, Vancouver, Canada. Associ-ation for Computational Linguistics.

    Hugo David Calderon Vilca, Flor Cagniy Cár-denas Mariñó, and Edwin Fredy MamaniCalderon. 2012. Analizador morfólogico dela lengua Quechua basado en software libreHelsinkifinite-statetransducer (HFST).

    Sami Virpioja, Peter Smit, Stig-Arne Grönroos,and Mikko Kurimo. 2013. Morfessor 2.0:

    Python implementation and extensions for Mor-fessor baseline. Technical report, Aalto Univer-sity; Aalto-yliopisto.

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V.Le. 2019. XLNet: Generalized autoregres-sive pretraining for language understanding.In Hanna Wallach, Hugo Larochelle, AlinaBeygelzimer, Florence d’Alché Buc, EmilyFox, and Roman Garnett, editors, Advancesin Neural Information Processing Systems 32,pages 5753–5763. Curran Associates, Inc.

    A Data

    We began with the data used in Mielke et al.(2019). This was originally a subset of a Biblecorpus (Mayer and Cysouw, 2014), which is nolonger publically available. We excluded con-structed languages (epo, tlh) from the data, keep-ing a total of 104 verse-aligned Bibles in 60 lan-guages9 in 12 language families. To increase thenumber of the languages and language familiesrepresented, we added 41 Bibles in 32 languagesto the data. 13 Bible translations in 13 languages10

    were sourced from Christodoulopoulos and Steed-man (2014). In addition, we included 28 Bibletranslations in 21 languages scraped from vari-ous online sources. Two of the Bibles scrapedwere in Spanish (spa) and Telugu (tel), languageswhich were already included in the Bible cor-pora (Mayer and Cysouw, 2014; Christodoulopou-los and Steedman, 2014). These translationswere included because the new Spanish Biblewas a parallel source of the Paraguayan Guaraní(gug) translation, and the Telugu Bible obtainedfrom Mielke et al. (2019) was originally misla-beled as Tecpatlán Totonac (tcw). The CentralAlaskan Yup’ik (esu) Bible was from https://bibles.org. 26 Bibles in 19 languages11

    were from http://bible.com. The Green-landic (kal) Bible was obtained from http://old.bibelselskabet.dk.

    9afr, aln, arb, arz, ayr, bba, ben, bqc, bul, cac, cak, ceb,ces, cmn, cnh, cym, dan, deu, ell, eng, fin, fra, guj, gur, hat,hrv, hun, ind, ita, kek, kjb, lat, lit, mah, mam, mri, mya, nld,nor, plt, poh, por, qub, quh, quy, quz, ron, rus, som, tbz, tel,tgl, tpi, tpm, ukr, vie, wal, wbm, xho, zom

    10als, amh, dje, heb, isl, jpn, kor, pck, slk, slv, spa, swe, tha11crk, gug, gui, hin, ike, ikt, kan, mal, mar, nch, nep, nhe,

    pes, pol, sna, spa, tel, tob, tur

    http://arxiv.org/abs/2005.05477v2http://arxiv.org/abs/2005.05477v2https://doi.org/10.18653/v1/P16-1162https://doi.org/10.18653/v1/P16-1162http://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttp://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttp://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttp://tss.awf.poznan.pl/files/3_Trends_Vol21_2014__no1_20.pdfhttps://www.aclweb.org/anthology/2020.udw-1.22https://www.aclweb.org/anthology/2020.udw-1.22https://doi.org/10.18653/v1/P17-1184https://doi.org/10.18653/v1/P17-1184https://doi.org/10.18653/v1/P17-1184http://urn.fi/URN:ISBN:978-952-60-5501-5http://urn.fi/URN:ISBN:978-952-60-5501-5http://urn.fi/URN:ISBN:978-952-60-5501-5http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdfhttp://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdfhttps://bibles.orghttps://bibles.orghttp://bible.comhttp://old.bibelselskabet.dkhttp://old.bibelselskabet.dk