arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

18
Systematic Inequalities in Language Technology Performance across the World’s Languages Damián Blasi Harvard University [email protected] Antonios Anastasopoulos George Mason University [email protected] Graham Neubig Carnegie Mellon University [email protected] Abstract Natural language processing (NLP) systems have become a central technology in commu- nication, education, medicine, artificial intel- ligence, and many other domains of research and development. While the performance of NLP methods has grown enormously over the last decade, this progress has been restricted to a minuscule subset of the world’s 6,500 lan- guages. We introduce a framework for estimat- ing the global utility of language technologies as revealed in a comprehensive snapshot of re- cent publications in NLP. Our analyses involve the field at large, but also more in-depth stud- ies on both user-facing technologies (machine translation, language understanding, question answering, text-to-speech synthesis) as well as more linguistic NLP tasks (dependency pars- ing, morphological inflection). In the process, we (1) quantify disparities in the current state of NLP research, (2) explore some of its associ- ated societal and academic factors, and (3) pro- duce tailored recommendations for evidence- based policy making aimed at promoting more global and equitable language technologies. 1 1 Introduction The past decade has seen a rapid advance in natural language processing (NLP), the technology that al- lows computers to process human language. NLP has grown from a relatively technical niche to a fun- damental tool in virtually all domains that involve language data in any shape or form. NLP is now instrumental for a vast array of tasks, from the early detection of neurodegenerative diseases (Orimaye et al., 2017), to exposing widespread gender and ethnic biases in societies (Caliskan et al., 2017), and predicting large-scale trends in collective con- sumer behavior (Kallus, 2014). More ostensibly, NLP has also become a staple technology for every- day frequent tasks in most contemporary societies 1 All authors contributed equally. Data and code to re- produce the findings discussed in this paper are available on GitHub (https://github.com/neubig/globalutility). of the world. For instance, an English speaker with a smartphone can now easily get accurate infor- mation on many topics through a quick query to a virtual assistant, they can consult an online trans- lation service to translate a foreign language web page with a click, and they can interact with many different machines and computers through simple speech commands. These technological capabilities can be at- tributed to several developments over the last few decades: 1. the advent of sophisticated machine learning methods, which allow for more effec- tive creation of NLP systems from existing data (Goldberg, 2017), 2. the existence of standardized benchmark datasets and evaluation metrics, 3. the prestige afforded by the research community to researchers who improve upon these benchmarks, 4. the resulting large number of resources, be they computation, data, or ingenuity, that are poured into optimizing performance thereon. As both a theoret- ical and technical endeavor, NLP is experiencing an explosive increase: the annual conference of the Association of Computational Linguistics (ACL, the flagship event in NLP) received in 2000 less than 300 papers, growing in 2010 to slightly less than 1,000, to over more than 3,500 submissions in its 2020 edition. Largely as a result of this ex- pansion of research effort, state-of-the-art systems have also achieved evaluation benchmark scores on par with human performance on a variety of NLP tasks such as question answering on English (He et al., 2021), or on automatic translation of news from German, Russian, and Chinese to English (Barrault et al., 2020). 2 These upward slanting curves on standard bench- marks fail to show how uneven this development has been for all potential NLP users. Extensive research across NLP tasks have found systematic performance drops according to dimensions such 2 Although the significance of these parity claims has been disputed (Läubli et al., 2018). arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Transcript of arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Page 1: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Systematic Inequalities in Language Technology Performanceacross the World’s Languages

Damián BlasiHarvard University

[email protected]

Antonios AnastasopoulosGeorge Mason University

[email protected]

Graham NeubigCarnegie Mellon University

[email protected]

AbstractNatural language processing (NLP) systemshave become a central technology in commu-nication, education, medicine, artificial intel-ligence, and many other domains of researchand development. While the performance ofNLP methods has grown enormously over thelast decade, this progress has been restrictedto a minuscule subset of the world’s 6,500 lan-guages. We introduce a framework for estimat-ing the global utility of language technologiesas revealed in a comprehensive snapshot of re-cent publications in NLP. Our analyses involvethe field at large, but also more in-depth stud-ies on both user-facing technologies (machinetranslation, language understanding, questionanswering, text-to-speech synthesis) as well asmore linguistic NLP tasks (dependency pars-ing, morphological inflection). In the process,we (1) quantify disparities in the current stateof NLP research, (2) explore some of its associ-ated societal and academic factors, and (3) pro-duce tailored recommendations for evidence-based policy making aimed at promoting moreglobal and equitable language technologies.1

1 Introduction

The past decade has seen a rapid advance in naturallanguage processing (NLP), the technology that al-lows computers to process human language. NLPhas grown from a relatively technical niche to a fun-damental tool in virtually all domains that involvelanguage data in any shape or form. NLP is nowinstrumental for a vast array of tasks, from the earlydetection of neurodegenerative diseases (Orimayeet al., 2017), to exposing widespread gender andethnic biases in societies (Caliskan et al., 2017),and predicting large-scale trends in collective con-sumer behavior (Kallus, 2014). More ostensibly,NLP has also become a staple technology for every-day frequent tasks in most contemporary societies

1All authors contributed equally. Data and code to re-produce the findings discussed in this paper are available onGitHub (https://github.com/neubig/globalutility).

of the world. For instance, an English speaker witha smartphone can now easily get accurate infor-mation on many topics through a quick query to avirtual assistant, they can consult an online trans-lation service to translate a foreign language webpage with a click, and they can interact with manydifferent machines and computers through simplespeech commands.

These technological capabilities can be at-tributed to several developments over the last fewdecades: 1. the advent of sophisticated machinelearning methods, which allow for more effec-tive creation of NLP systems from existing data(Goldberg, 2017), 2. the existence of standardizedbenchmark datasets and evaluation metrics, 3. theprestige afforded by the research community toresearchers who improve upon these benchmarks,4. the resulting large number of resources, be theycomputation, data, or ingenuity, that are poured intooptimizing performance thereon. As both a theoret-ical and technical endeavor, NLP is experiencingan explosive increase: the annual conference of theAssociation of Computational Linguistics (ACL,the flagship event in NLP) received in 2000 lessthan 300 papers, growing in 2010 to slightly lessthan 1,000, to over more than 3,500 submissionsin its 2020 edition. Largely as a result of this ex-pansion of research effort, state-of-the-art systemshave also achieved evaluation benchmark scores onpar with human performance on a variety of NLPtasks such as question answering on English (Heet al., 2021), or on automatic translation of newsfrom German, Russian, and Chinese to English(Barrault et al., 2020).2

These upward slanting curves on standard bench-marks fail to show how uneven this developmenthas been for all potential NLP users. Extensiveresearch across NLP tasks have found systematicperformance drops according to dimensions such

2Although the significance of these parity claims has beendisputed (Läubli et al., 2018).

arX

iv:2

110.

0673

3v1

[cs

.CL

] 1

3 O

ct 2

021

Page 2: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

as gender, racial identity, and language varieties,among others. The reasons for these biases are mul-tifactorial and can be traced to virtually all stagesin the process of NLP development, from the dataused to train systems (Caliskan et al., 2017; Sapet al., 2019; De-Arteaga et al., 2019; Tatman, 2017;Tatman and Kasten, 2017; Buolamwini and Ge-bru, 2018; Raji and Buolamwini, 2019) to the veryalgorithms involved (Speicher et al., 2018; Bel-lamy et al., 2018; Adebayo et al., 2016). The grow-ing awareness of these biases in NLP technologiesbrought by these studies, along with the develop-ment of novel metrics and tests to evaluate thesedisparities, have resulted in progressively more ef-ficient and principled strategies to understand andmitigate them.

However, similarly systematic approaches arestill lacking in one fundamental dimension of vari-ation across individuals: their languages. Out ofthe over 6,500 languages spoken or signed in theworld today (Hammarström, 2015), only a handfulare systematically represented in academia and in-dustry (Joshi et al., 2020). In spite of the aforemen-tioned near-human results on translation or under-standing of languages from the world’s economicand political superpowers, the experience of anyNLP practicioner is that, for the vast majority oflanguages, they fall far below such standards. Crit-ically, the languages of the world showcase sub-stantial amounts of variation in most domains ofdescription, and in fact, the performance of lan-guage technologies has been shown to be sensitiveto diverse aspects of the language under study, in-cluding morphology, word order, or phonologicalrepertoire, as well as more mundane aspects likedata availability (Tsarfaty et al., 2020; Xia et al.,2020; Arivazhagan et al., 2019). Hence, the transferof NLP developments from one language to anotheris far from trivial, as it often means that buildinghighly functional language technologies on anyparticular language is a non-automatic, costly, andtechnically challenging task.

Taking all these considerations together, andgiven that even the consequences brought by un-equal NLP technologies across (racial, gender, so-cioeconomic) groups within the same nominal lan-guage are already substantial, there is a pressingneed for measuring and understanding NLP perfor-mance inequalities across the world’s languages.Here we develop novel estimates on how the util-ity afforded by NLP systems is distributed across

individuals, languages, and tasks at an unprece-dented global scale. These estimates allow us toidentify which languages are systematically under-served by language technologies and could benefitthe most individuals from focused technology de-velopment. We finally trace these inequalities tothe societal, economic, and academic correlates ofNLP systems’ performance, shedding light on itslatent causes, and indicate how our results favorspecific evidence-based policies in research anddevelopment.

2 Methodology

2.1 Quantifying utility and demandOur fundamental goal is evaluating the distribu-tion of diverse representative language technolo-gies (and their qualities) across the world’s lan-guages and their populations. Minimally, we wouldattempt to account for the patterns of associationbetween the demand of language technologies andthe utility they confer to users across languages.Thus, the first component of our analysis pertainsquantifying the utility users in a given language lreceive from a language technology. Ideally, such ameasure would capture to what extent a given NLPsystem solves the specific problems an individualcan pose to them - for instance, how successful anautomatic translation is in translating a webpage,or how faithfully a speech recognition system is inexecuting a series of verbal commands. Intuitively,utility is associated with the nominal performanceof the technology - in NLP systems more specif-ically, performance is typically measured by con-trasting the solution offered by the machine againstthe one a (knowledgeable) human would provide.How this comparison is instantiated and measureddepends on the task (see Section 1); however, sinceour purpose is to allow for comparisons, we de-fine the utility of a task and language, ul, as thecorresponding performance normalized by the bestpossible performance afforded by such task, i.e.

ul =performancel

theoretical max performance

In cases where the best possible performanceis undefined or technically unattainable, we takethe empirical maximum as an estimate of the theo-retical one and normalize by the best-performinglanguage across all languages L, i.e. we re-place the denominator in the above definition bymaxl′∈L(performancel′).

Page 3: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Task Description Metric

Syntactic Analysis (DEP) Infer syntactic dependencies between words in text Labeled AttachmentScore

Morphological Inflection (ING) Produce an inflection given a lemma and morpho-logical tags

Accuracy

Machine Translation (MT) Translate text from a language into another BLEU scoreSpeech Synthesis (TTS) Produce speech on the basis of textual input 1-mel-cepstral distortionNatural Language Inference (NLI) Recognize entailment or contradiction between two

sentencesAccuracy

Question Answering (QA) Produce an answer for a textual query Fscore

Table 1: NLP tasks evaluated in the present study, along with their corresponding performance metric.

Defining utility in this manner allow us to ex-plore and contrast language technologies at thebroadest scale, which is possible thanks to somenecessary simplifying assumptions. As we pointedout before, not all users of the same language tech-nology might benefit in the same manner given afixed utility, and the relation between nominal per-formance and “true" utility might be complex andnon-linear.

With these caveats in mind, we further quantifythe second component of our analysis, the demandfor a language technology in each language l, dl.We characterize dl by taking into consideration de-mographic and linguistic perspectives. Under thefirst perspective, the demand for a given technol-ogy in a language is estimated to be proportionalto the number of speakers of the language itselfnl (dl ∝ nl). Under the second perspective, thedemand across the approximately 6,500 languagesof the world is identical (dl ∝ 1). These two alter-natives as well as any intermediate combination ofthem can be simply parameterized through a singleexponent τ ,

d(τ)l =

nτl∑l′∈L n

τl′

where τ = 1 correspond to a demographic notionof demand, τ = 0 to a linguistic one, and 0 < τ <1 is in between.

Equipped with these notions, we construct a sim-ple family of global metrics (Mτ ) revealing to whatdegree the global demand for language technolo-gies is actually met:

Mτ =∑l∈L

d(τ)l · ul

Mτ has a number of intuitive properties wewould like such a metric to have. Mτ is boundedbetween 0 and 1; 0 corresponds to a case where

no-one benefits from a given language technology,whereas 1 would correspond to a situation whereall languages enjoy perfect technology. Increasingthe utility of a given language leads to an increasein Mτ , and the magnitude of this increase is influ-enced by both the size of the improvement and thedemand in that language.

2.2 NLP tasks

We apply our measures of utility and demand to aset of diverse and major representative NLP tasks,which are described below and summarized in Ta-ble 1.

The first three are tasks that technology usersinteract with directly in their everyday life, so thattheir output is already in a shape and form that isusable for most individuals. Question answering(QA) consists of crafting a relevant answer to aquestion formulated in natural language, such ase.g. “what is the capital city of the Philippines?"or “why do dogs like bones?". This task is ubiqui-tous in online search or virtual assistants. Machinetranslation (MT) is the task of translating from onelanguage to another (e.g. from Tagalog to Estonianor from Japanese to Basque), and is typically usedto facilitate inter-personal communication, infor-mation gathering, and e-commerce. Text-to-speech(TTS) is the task of rendering speech from textualinput, which is used widely in spoken virtual as-sistants, car navigation systems, and in general isbecoming the standard gateway for the internet ofthings.

Beyond these three user-facing tasks, we alsoconsider three more technical and linguistically-focused tasks, which often inform part of thepipelines of the user-facing tasks but which arerarely if ever encountered “in the wild" by languagetechnology users.Morphological Inflection (Inflec-tion) is the task of generating an inflected wordformgiven a lemma and a morphological specification,

Page 4: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

e.g. producing the third person singular form for“run”: run+3;SG→runs. Syntactic Parsing underthe dependency formalism (DEP) is the task of pro-ducing a syntactic parse of an input sentence, e.g.given the sentence “dogs like bones” specifyingthe “dogs” and “bones” are the subject and objectof “like” respectively. Natural Language Infer-ence (NLI) is a central task in AI and involves theevaluation of information presented in propostionalformat. More specificially, given a sentence calledthe “premise” (e.g. “the dog chewed a big bone”),NLI systems decide whether a separate sentencecalled the “hypothesis” is entailed by the premise(e.g. “the dog gnawed at a bone”), negated by it(e.g. “the dog was sleeping”), or neither (e.g. “thedog likes bones”).

2.3 Correlates of NLP utilityBeyond the performance of individual tasks, wetake a bird’s-eye-view of the field of language tech-nologies in general, as we analyze some of thecorrelates of the scientific production in NLP. Inparticular, we follow two broad guiding questions:(1) does the system of academic incentives pro-mote the development of a more linguistically di-verse NLP? and (2) is economic centrality or sheerdemographic demand the best predictor of NLPtechnologies in any given language?

While a full understanding of the complex causalmechanisms binding society and NLP in generalis outside of the scope of the present article, weset out to provide a first large-scale exploration ofthese matters by considering scientific publicationsappearing in major international NLP conferencesas the basic units of science production. This sim-plification is not without challenges: for instance,some widely used language technologies are de-veloped outside of the traditional scientific circuitbased on proprietary technology, or they are pub-lished in local conferences, possibly in languagesother than English.3 In spite of this, studying sci-entific publications (and their correlates) allows usto evaluate transparent questions on the basis ofpublicly available data at a scale that is unfeasiblefor in-depth analyses.

Therefore, we study the first question by deter-mining whether the cumulative number of citationsa paper receives is correlated with the number oflanguages it is associated with. We investigate

3e.g. the Japanese NLP society’s 2020 conference pub-lished 396 papers: https://www.anlp.jp/proceedings/annual_meeting/2020/

our second question by finding the best predictivemodel of the number of NLP papers in any givenlanguage by contrasting two predictors: estimatednumber of users worldwide and approximate GDPassociated with its users. We model these regres-sion problems in a Bayesian generalized mixedeffects framework (see Appendix B).

2.4 Data

We manually collect information on task perfor-mance for a number of diverse representative NLPtechnologies, as summarized in Table 1 (see Mate-rials & Methods in Appendix A). These range fromuser-facing applications like machine translation(i.e. the automatic translation of text in one lan-guage into another) to more linguistic NLP taskssuch as dependency parsing (i.e. the analysis ofsyntactic or semantic relationships between words).The data is taken from a combination of multi-lingual benchmarks, shared tasks and publishedresults in NLP conferences. Demographic and lin-guistic information necessary for the estimation ofdemands were obtained from a variety of sources,including Ethnologue, Glottolog, and the WorldTrade Organisation.

3 Results and Analysis

3.1 General observations

Figure 1 presents an overview of our main find-ings. Unsurprisingly, most NLP tasks we focus onfare substantially better when utility is measureddemographically rather than linguistically.

Text-to-speech synthesis is the task with the mostlinguistic coverage: the published results (due toa single study (Black, 2019)) cover more than 630languages (or about 10% of the world’s languages).However, for the vast majority of these languagesthe measured quality of the generated speech isabout half as good as the exceptionally good En-glish system (Ren et al., 2021). The next mostlinguistically diverse tasks are those regarding mor-phosyntactic analysis, i.e. morphological inflectionand dependency parsing, which have been evalu-ated over 140 and 90 languages respectively. Forthese more esoteric tasks which do not necessar-ily convey direct utility to a downstream user, themajority of the systems are in general very good.

Natural language inference (NLI; a representa-tive natural language understanding task) and ques-tion answering (QA) lie on the opposite side of thespectrum: the established benchmarks have only

Page 5: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Figure 1: Left panel: linguistic and demographic global utility metrics for a number of language technology tasks.The red curve corresponds to the sequence where first the language with the largest number of users is set toutility 1, then the second, and so on. Right panel: recent historical progression of two language technology tasks:Inflection and Machine Translation from English.

focused on up to 15 and 17 languages respectively,leading to very low scores on the linguistic axis.

In Figure 1 (right panel) we observe the progressof the utility metrics in tasks for which we had ac-cess to comparable data across a span of the last7 years. The extensive efforts of the UniMorphproject (Kirov et al., 2018) to cover as many lan-guages as possible are visible in the “Inflection”plot, with significant improvements over time. Onthe other hand, the machine translation field is stillin the process of ramping up following demograph-ics and/or socioeconomic priorities, with improvedlinguistic coverage over the years.

The granularity of these findings can be in-creased on the basis of available data. Figure 2additionally presents demographic utility acrosslanguage populations for all tasks. The visualiza-tion allows for identification of ostensive gaps in re-ceived utility. The two bottom plots of Figure 2 dis-play our metrics over speakers of a single language,based on question answering results for differentspoken Arabic and Swahili lectal varieties (Faisalet al., 2021). This analysis shows that utility dif-ferences are small between Arabic vernaculars al-though these systems still lag behind the systemsfor Modern Standard Arabic, while the utility levelof Coastal Swahili speakers in Tanzania is about10% lower than that for speakers in Kenya.

3.2 Priorities in NLP development

Given the current snapshot of NLP systems, wecould ask which languages will lead to the largest

global utility improvement. The relative impor-tance of linguistic vs. demographic demands deter-mines the priority ranking, as it can be observed inFigure 3 for a sample of five tasks. Improving onthe demographic-focused utility entails a greateremphasis on Mandarin Chinese, Hindi, Spanish,and other populous languages that are generallywell-served by current technologies. Balancing lin-guistic and demographic considerations leads toprioritizing a more diverse set of languages, mostlyAsian and African languages like Amharic, Bam-bara, Bengali, Thai, or Yoruba, which are bothpopulous and under-served, along with also largebut severely under-served languages like Kurdish,Urdu, and Oromo. Further emphasis on linguis-tic utility would lead to prioritization of indige-nous and potentially endangered languages of smallcommunities like Aimele, Itelmen, North Sami, orWarlpiri, which are currently largely ignored byNLP research (Bird, 2020).

3.3 The role of society, economy, andacademia

Now we turn to our large-scale analysis of NLPpublications. First, this reveals that a substantialproportion of publications do not even describe ina clear and unequivocal manner the language (orlanguages) they are dealing with (Bender, 2011).Given the current prevalence of English of a lan-guage of study in NLP, in most cases, the lack of anexplicit reference to a particular language entailsthe system deals with English exclusively.

Page 6: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Dependency Parsing: M1 = 0.63 Morphological Inflection: M1 = 0.64

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

ces

ell

hin

eng

spa

cmn

deu

fin tam

tgl

amh other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

lin ben

hin

spa

fin deu

eng

ell

tgl other

Natural Language Inference: M1 = 0.42 Question Answering: M1 = 0.36

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

eng

spa

deu

cmn

hin other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

eng

spa

deu

cmn

hin

ben other

Speech Synthesis: M1 = 0.32 Machine Translation (X→English): M1 = 0.49

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

eng

spa

aka

hin

tam

cmn

ben

mal other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

eng

deu

cmn

spa

ell

tam

ben other

Machine Translation (X→Spanish): M1 = 0.36 Machine Translation (X→Bengali): M1 = 0.10

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

spa

por

eng

deu

cmn

hin

ell

tam

ben

kor other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

ben

eng

deu

cmn

spa

por

hin

ell

tam

kor other

QA [on Arabic Vernaculars]: M ara1 = 0.58 QA [on Swahili Vernaculars]: M swa

1 = 0.23

Number of Arabic Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

(Written) Modern Standard Arabic

aeb

ary

arq

ajp

acw

arz other

Number of Swahili Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

(Written) Coastal Swahili

KN

TZ

other

acw: Hijazi Arabic, aeb: Tunisian Arabic, ajp: South Levantine Arabic, aka: Aka, amh: Amharic, arq: Algerian Arabic,

ary: Moroccan Arabic, arz: Egyptian Arabic, ben: Bengali, ces: Czech, cmn: Mandarin Chinese, deu: High German,

ell: Greek, eng: English, fin: Finnish, hin: Hindi, kor: Korean, lin: Lingala, mal: Malayalam, por: Portuguese,

spa: Spanish, swa: Swahili, tam: Tamil, tgl: Tagalog.

Figure 2: Illustration of our metric on demographic-focused utility (τ = 1) on various NLP tasks.

Page 7: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

MT to eng

1

2

3

cmn

hin

ben

τ = 1

hin

ben

cmn

0.7

hin

ben

urd

0.6

ben

hin

urd

0.5

ben

mya

urd

0.3

ben

mya

kor

0.2

ben

mya

sme

0.01

MT from engτ = 1

1

2

3

cmn

hin

ben

0.9

hin

ben

cmn

0.8

ben

hin

ara

0.4

ben

hin

tha

0.3

ben

mal

tha

0.2

mal

ben

tha

0.1

mal

kur

tha

0.01

mal

kur

tat

Syntactic Analysis (Dep. Parsing)τ = 1

1

2

3

cmn

eng

yue

0.9

cmn

yue

tha

0.8

yue

tha

cmn

0.7

yue

tha

amh

0.6

tha

yue

amh

0.5

amh

tha

yue

0.4

amh

tha

bam

0.3

amh

bam

yor

0.1

amh

bam

wbp

0.01

amh

wbp

bam

Morphological Inflectionτ = 1

1

2

3

ind

eng

fra

0.9

ind

fra

eng

0.7

ind

fra

tgl

0.6

ind

tgl

mag

0.4

mag

tgl

ind

0.3

sjo

ail

bra

0.2

ail

sjo

itl

0.1

ail

itl

sjo

Speech Synthesisτ = 1

1

2

3

cmn

ben

spa

0.8

cmn

ben

por

0.6

cmn

ben

tel

0.4

ben

tel

fra

0.3

tel

ben

fra

0.2

mal

tel

fra

0.1

knk

mal

acm

Figure 3: The priority languages (top-3 shown) changewith different balancing of demographic and linguisticutility, with focus shifting from populous languages e.g.Mandarin (cmn) and Hindi (hin) to more under-servedlanguages.

This perhaps reflects a more deep-seated issueat play reflected in the citation of papers over time.Independently of publication venue, year, or sub-field of NLP research, the number of languagesa publication deals with is not predictive of howmany citations it will accrue over time (see Fig-ure 4, top right panel). In other words, if citationscan be regarded as a proxy for academic incentives,scientists and developers are presented with littleto no additional academic reward when tacklingdata, problems, or tasks involving more than onelanguage.

This naturally leads to the question of what ex-plains the production of language technologiesacross languages to start with, which will necessar-ily involve agents, mechanisms, and data, outsideof the scope of NLP publications themselves. Nev-ertheless, in order to contribute to this investigation,we determined whether approximate measures ofeconomic centrality or number of language userswere better predictors of sheer number of paperspublished for any given language (see Appendix C).While both variables are substantially collinear, wefind that approximate GDP (rather than number ofusers) leads to a substantially smaller prediction

error of number of published papers.

4 Discussion

Our study, covering diverse NLP tasks and types ofevidence, makes apparent the immense inequalityin the development of language technologies acrossthe world’s languages. After English, a handful ofWestern European languages dominate the field -inparticular German, French, and and Spanish- aswell as even fewer non-Indo-European languages,primarily Chinese, Japanese, and Arabic. Our pre-liminary investigation suggests it is the economicprowess of the users of a language (rather thanthe sheer demographic demand) what drives thedevelopment of language technologies.

In spite of this, for some tasks (such as In-flection) there is an encouraging trend of bothdemographic- and linguistic-utility improving year-over-year. This is due to the nature of the task; rea-sonably accurate solutions can be achieved throughsmall but highly-curated data. Since linguistic ex-pertise on the languages of the world is, naturally,globally distributed, the main hurdle these tasksface is to pool such expertise under the premiseof a common technical goal. In this respect, rela-tively low-cost and bottom-up actions that gatherexperts to work on specific NLP tasks (such asUniversal Dependencies and UniMorph) have suc-ceeded in accelerating the cross-linguistic devel-opment of language technologies. These prospermainly on the basis of academic incentives, as thoseindividuals or groups who contribute data and/orexpertise are rewarded with individual publicationsor co-authorship in collective publications. Manyof these contributions - which do not necessarilyinvolve hefty resource investments but instead lin-guistic expertise - are markedly different from thetypical publications in language technologies.

However, these more esoteric tasks are tenu-ously associated with those that users are morelikely to interact with, such as Machine Transla-tion or Speech Synthesis. User-facing tasks allhave in common a tight dependency on compu-tational resources and large data, which in turnhinge on substantial financial means. In a con-text of pressing user needs across multiple popu-lations and languages, we submit that future de-velopments on policies aimed at furthering cross-linguistic technologies would benefit from clear(and possibly standardized) metrics that assist instreamlining complex decisions regarding resource

Page 8: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Figure 4: Left panel: treemap of the number of NLP publications per language (with area proportional to thenumber). eng: English, zho: Chinese, deu: German, fra: French, spa: Spanish, jpn: Japanese, rus: Russian, nld:Dutch, ces: Czech, por: Portuguese, tur: Turkish, swe: Swedish, ita: Italian, fin: Finnish, ell: Greek, lat: Latin,hun: Hungarian, ara: Arabic, kor: Korean, hin: Hindi, pol: Polish, dan: Danish. Right top panel: Relative citationrate vs number of languages in the publication. Right bottom panel: Number of publications according to numberof language users and approximate GDP. Point size and transparency scales with number of publications.

allocation. Our measures of global coverage fulfillthat role, and help identifying large but currentlyunder-served languages. While we do not attemptto supplement the necessary in-depth evaluation ofthe need of each individual group and language,they provide a common ground for coordinatingglobal efforts across heterogeneous actors.

Acknowledgements

This work was supported by NSF Award 2040926.

ReferencesJulius A Adebayo et al. 2016. FairML: ToolBox for di-

agnosing bias in predictive modeling. Ph.D. thesis,Massachusetts Institute of Technology.

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavat-ula, Iz Beltagy, Miles Crawford, Doug Downey, Ja-son Dunkelberger, Ahmed Elgohary, Sergey Feld-man, Vu Ha, Rodney Kinney, Sebastian Kohlmeier,Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Pe-ters, Joanna Power, Sam Skjonsberg, Lucy Wang,Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen,and Oren Etzioni. 2018. Construction of the litera-ture graph in semantic scholar. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 3 (IndustryPapers), pages 84–91, New Orleans - Louisiana. As-sociation for Computational Linguistics.

Gopala Krishna Anumanchipalli, Kishore Prahallad,and Alan W Black. 2011. Festvox: Tools for cre-ation and analyses of large speech corpora. InWorkshop on Very Large Scale Phonetics Research,UPenn, Philadelphia, page 70.

Mihael Arcan, Maja Popovic, Paul Buitelaar, et al.2016. Asistent–a machine translation system forslovene, serbian and croatian. In Proceedings of the10th Conference on Language Technologies and Dig-ital Humanities, Ljubljana, Slovenia.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,Dmitry Lepikhin, Melvin Johnson, Maxim Krikun,Mia Xu Chen, Yuan Cao, George Foster, ColinCherry, et al. 2019. Massively multilingual neuralmachine translation in the wild: Findings and chal-lenges. arXiv:1907.05019.

Loïc Barrault, Magdalena Biesialska, Ondrej Bojar,Marta R. Costa-jussà, Christian Federmann, YvetteGraham, Roman Grundkiewicz, Barry Haddow,Matthias Huck, Eric Joanis, Tom Kocmi, PhilippKoehn, Chi-kiu Lo, Nikola Ljubešic, ChristofMonz, Makoto Morishita, Masaaki Nagata, Toshi-aki Nakazawa, Santanu Pal, Matt Post, and MarcosZampieri. 2020. Findings of the 2020 conference onmachine translation (WMT20). In Proceedings ofthe Fifth Conference on Machine Translation, pages1–55, Online. Association for Computational Lin-guistics.

Regina Barzilay and Min-Yen Kan, editors. 2017. Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long

Page 9: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Papers). Association for Computational Linguistics,Vancouver, Canada.

Rachel KE Bellamy, Kuntal Dey, Michael Hind,Samuel C Hoffman, Stephanie Houde, KalapriyaKannan, Pranay Lohia, Jacquelyn Martino, SameepMehta, Aleksandra Mojsilovic, et al. 2018. Ai fair-ness 360: An extensible toolkit for detecting, under-standing, and mitigating unwanted algorithmic bias.arXiv:1810.01943.

Emily M Bender. 2011. On achieving and evaluatinglanguage-independence in nlp. Linguistic Issues inLanguage Technology, 6(3):1–26.

Steven Bird. 2020. Decolonising speech and lan-guage technology. In Proceedings of the 28th Inter-national Conference on Computational Linguistics,pages 3504–3519, Barcelona, Spain (Online). Inter-national Committee on Computational Linguistics.

Alan W Black. 2019. CMU wilderness multilingualspeech dataset. In ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5971–5975. IEEE.

Joy Buolamwini and Timnit Gebru. 2018. Gendershades: Intersectional accuracy disparities in com-mercial gender classification. In Conference on fair-ness, accountability and transparency, pages 77–91.

Paul-Christian Bürkner. 2017. brms: An r package forbayesian multilevel models using stan. Journal ofstatistical software, 80(1):1–28.

Aylin Caliskan, Joanna J Bryson, and ArvindNarayanan. 2017. Semantics derived automaticallyfrom language corpora contain human-like biases.Science, 356(6334):183–186.

Bob Carpenter, Andrew Gelman, Matthew D Hoff-man, Daniel Lee, Ben Goodrich, Michael Betan-court, Marcus A Brubaker, Jiqiang Guo, Peter Li,and Allen Riddell. 2017. Stan: a probabilistic pro-gramming language. Grantee Submission, 76(1):1–32.

Jonathan H Clark, Eunsol Choi, Michael Collins, DanGarrette, Tom Kwiatkowski, Vitaly Nikolaev, andJennimaria Palomaki. 2020. TyDi QA: A bench-mark for information-seeking question answeringin ty pologically diverse languages. Transactionsof the Association for Computational Linguistics,8:454–470.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. XNLI: Evaluatingcross-lingual sentence representations. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 2475–2485.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, Arya D.McCarthy, Katharina Kann, Sabrina J. Mielke, Gar-rett Nicolai, Miikka Silfverberg, David Yarowsky,

Jason Eisner, and Mans Hulden. 2018. The CoNLL–SIGMORPHON 2018 shared task: Universal mor-phological reinflection. In Proceedings of theCoNLL–SIGMORPHON 2018 Shared Task: Univer-sal Morphological Reinflection, pages 1–27, Brus-sels. Association for Computational Linguistics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kübler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017.CoNLL-SIGMORPHON 2017 shared task: Univer-sal morphological reinflection in 52 languages. InProceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinflection,pages 1–30, Vancouver. Association for Computa-tional Linguistics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 shared Task—Morphological reinflection. In Proceedings of the14th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphol-ogy, pages 10–22, Berlin, Germany. Association forComputational Linguistics.

Maria De-Arteaga, Alexey Romanov, Hanna Wal-lach, Jennifer Chayes, Christian Borgs, AlexandraChouldechova, Sahin Geyik, Krishnaram Kentha-padi, and Adam Tauman Kalai. 2019. Bias in bios:A case study of semantic representation bias in ahigh-stakes setting. In Proceedings of the Confer-ence on Fairness, Accountability, and Transparency,pages 120–128.

David M. Eberhard, Gary F. Simons, and Charles D.Fennig. 2018. Ethnologue: Languages of the world.twenty-second edition. SIL International.

Fahim Faisal, Sharlina Keshava, Md Mahfuz ibn Alam,and Antonios Anastasopoulos. 2021. SD-QA: Spo-ken Dialectal Question Answering for the RealWorld. Preprint.

Yoav Goldberg. 2017. Neural network methods for nat-ural language processing. Synthesis Lectures on Hu-man Language Technologies, 10(1):1–309.

Iryna Gurevych and Yusuke Miyao, editors. 2018. Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics,Melbourne, Australia.

Harald Hammarström. 2015. “Ethnologue" 16/17/18theditions: A comprehensive review. Language,91(3):723–737.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, andWeizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with disentangled attention. In Pro-ceedings of the International Conference on Learn-ing Representations.

Page 10: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Kentaro Inui, Jing Jiang, Vincent Ng, and XiaojunWan, editors. 2019. Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguis-tics, Hong Kong, China.

Pratik Joshi, Sebastin Santy, Amar Budhiraja, KalikaBali, and Monojit Choudhury. 2020. The state andfate of linguistic diversity and inclusion in the NLPworld. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 6282–6293, Online. Association for Computa-tional Linguistics.

Nathan Kallus. 2014. Predicting crowd behavior withbig public data. In Proceedings of the 23rd Interna-tional Conference on World Wide Web, pages 625–630.

Christo Kirov, Ryan Cotterell, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sabrina J Mielke, Arya DMcCarthy, Sandra Kübler, et al. 2018. Unimorph2.0: Universal morphology. In Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC 2018).

Kevin Knight, Ani Nenkova, and Owen Rambow, edi-tors. 2016. Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. Association for Computational Linguistics,San Diego, California.

Dan Kondratyuk and Milan Straka. 2019. 75 lan-guages, 1 model: Parsing universal dependenciesuniversally. In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 2779–2795.

Anna Korhonen, David Traum, and Lluís Màrquez, ed-itors. 2019. Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics. Association for Computational Linguistics, Flo-rence, Italy.

R Kubichek. 1993. Mel-cepstral distance measure forobjective speech quality assessment. In Proceedingsof IEEE Pacific Rim Conference on CommunicationsComputers and Signal Processing, volume 1, pages125–128. IEEE.

Samuel Läubli, Rico Sennrich, and Martin Volk. 2018.Has machine translation achieved human parity? acase for document-level evaluation. In Proceedingsof the 2018 Conference on Empirical Methods inNatural Language Processing, pages 4791–4796.

M Paul Lewis, Gary F Simons, Charles D Fennig, et al.2009. Ethnologue: Languages of the world, vol-ume 16. SIL International.

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, andMans Hulden. 2019. The SIGMORPHON 2019shared task: Morphological analysis in context andcross-lingual transfer for inflection. In Proceedingsof the 16th Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 229–244, Florence, Italy. Association for ComputationalLinguistics.

Sylvester O Orimaye, Jojo SM Wong, Karen J Golden,Chee P Wong, and Ireneous N Soyiri. 2017. Pre-dicting probable alzheimer’s disease using linguis-tic deficits and biomarkers. BMC bioinformatics,18(1):1–13.

Martha Palmer, Rebecca Hwa, and Sebastian Riedel,editors. 2017. Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing. Association for Computational Linguistics,Copenhagen, Denmark.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In Proc. ACL, pages311–318.

Inioluwa Deborah Raji and Joy Buolamwini. 2019. Ac-tionable auditing: Investigating the impact of pub-licly naming biased performance results of com-mercial ai products. In Proceedings of the 2019AAAI/ACM Conference on AI, Ethics, and Society,pages 429–435.

Yi Ren, Chenxu Hu, Tao Qin, Sheng Zhao, Zhou Zhao,and Tie-Yan Liu. 2021. Fastspeech 2: Fast and high-quality end-to-end text-to-speech. In Proceedingsof International Conference of Learning Representa-tions (ICLR).

Ellen Riloff, David Chiang, Julia Hockenmaier, andJun’ichi Tsujii, editors. 2018. Proceedings of the2018 Conference on Empirical Methods in Natu-ral Language Processing. Association for Computa-tional Linguistics, Brussels, Belgium.

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,and Noah A Smith. 2019. The risk of racial biasin hate speech detection. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 1668–1678.

Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Kr-ishna P Gummadi, Adish Singla, Adrian Weller, andMuhammad Bilal Zafar. 2018. A unified approachto quantifying algorithmic unfairness: Measuring in-dividual &group unfairness via inequality indices.In Proceedings of the 24th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & DataMining, pages 2239–2248.

Milan Straka. 2018. Udpipe 2.0 prototype at conll 2018ud shared task. In Proceedings of the CoNLL 2018

Page 11: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Shared Task: Multilingual Parsing from Raw Text toUniversal Dependencies, pages 197–207.

Rachael Tatman. 2017. Gender and dialect bias inYouTube’s automatic captions. In Proceedings ofthe First ACL Workshop on Ethics in Natural Lan-guage Processing, pages 53–59, Valencia, Spain. As-sociation for Computational Linguistics.

Rachael Tatman and Conner Kasten. 2017. Effects oftalker dialect, gender & race on accuracy of bingspeech and youtube automatic captions. In INTER-SPEECH, pages 934–938.

Reut Tsarfaty, Dan Bareket, Stav Klein, and AmitSeker. 2020. From SPMRL to NMRL: What didwe learn (and unlearn) in a decade of parsingmorphologically-rich languages (MRLs)? In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 7396–7408, Online. Association for Computational Lin-guistics.

Aki Vehtari, Andrew Gelman, and Jonah Gabry. 2017.Practical bayesian model evaluation using leave-one-out cross-validation and waic. Statistics and comput-ing, 27(5):1413–1432.

Ekaterina Vylomova, Jennifer White, Eliza-beth Salesky, Sabrina J. Mielke, Shijie Wu,Edoardo Maria Ponti, Rowan Hall Maudslay, RanZmigrod, Josef Valvoda, Svetlana Toldova, FrancisTyers, Elena Klyachko, Ilya Yegorov, NataliaKrizhanovsky, Paula Czarnowska, Irene Nikkarinen,Andrew Krizhanovsky, Tiago Pimentel, LucasTorroba Hennigen, Christo Kirov, Garrett Nicolai,Adina Williams, Antonios Anastasopoulos, HilariaCruz, Eleanor Chodroff, Ryan Cotterell, MiikkaSilfverberg, and Mans Hulden. 2020. SIGMOR-PHON 2020 shared task 0: Typologically diversemorphological inflection. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphology,pages 1–39, Online. Association for ComputationalLinguistics.

Marilyn Walker, Heng Ji, and Amanda Stent, editors.2018. Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers). Association for Com-putational Linguistics, New Orleans, Louisiana.

Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu,editors. 2020. Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP). Association for Computational Lin-guistics, Online.

Mengzhou Xia, Antonios Anastasopoulos, RuochenXu, Yiming Yang, and Graham Neubig. 2020. Pre-dicting performance for natural language process-ing tasks. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 8625–8646, Online. Association for Computa-tional Linguistics.

A Materials

Publication data We rely on papers availablethrough the Anthology of the Association of Com-putational Linguistics4 which hosts more than 60thousand papers from all major NLP conferences.We rely on Semantic Scholar (Ammar et al., 2018)for citation information.

We make the working assumption that a mentionof a language in a research paper likely entails thatthe underlying research involves this language. Wefollow an automatic pipeline for finding languagementions in a paper, which starts by convertingthe paper PDF to a machine-readable format. Wethen search within the paper for any mention of alanguage’s English name(s), its endonym, as wellas its ISO or Glottolog code. We then apply apost-processing step to ensure the precision of thispipeline as our simple text-based search is prone tofalse positives for languages whose names matchcommon English words (e.g. She, Male, Label,Even, The, Are), common placenames (e.g. Col-orado, Nara, Sydney), parts of author names (e.g.Su, Kim, Dan, Ali, Rama), or mathematical nota-tion (e.g. Dji, Dii).

In addition, we enrich each publication by imput-ing its research area. There were 16 research areasidentified, based on the ones represented at recentmajor NLP conferences (specifically starting withthe 2019 version of EMNLP, and removing someof the areas that were unique to that conference).For each area, we identified 1-6 publication venuesfrom the ACL Anthology, where more venues werechosen when each venue had relatively few publica-tions. Based on the abstracts of papers from each ofthese venues, we trained a bag-of-words classifierusing the linear support vector machine implemen-tation in scikit-learn5, and applied this classifier tothe abstracts of the papers we wanted to classify.Necessary data and code to reproduce these resultsare released in the supplementary material.

Data Sources and Metrics for Utility The ma-jority of NLP research relies on automatic eval-uation metrics over datasets annotated with gold-standard outputs. The advantage of this approachis that it allows consistent comparisons betweensystems and a seamless evaluation of progress ona specific evaluation set. On the other hand, thereis no guarantee that even statistically significant

4https://www.aclweb.org/anthology/5https://scikit-learn.org/stable/

Page 12: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

improvement on an automatic metric translates toimprovements on user-perceived utility. Neverthe-less, the reality is that virtually all published NLPresearch reports automatic evaluation metrics, withonly a tiny fraction diverging from the norm by e.g.using human evaluations.

Our analysis assumes that all named languageshave standard versions that are comprehensible andacceptable to all members of the population iden-tified as “speakers” in our sources. However, wehave the demographic information necessary formore fine-grained analysis in only a handful oflanguages. While this assumption is certainly anoversimplification, we nevertheless believe it doesnot detract from our paper’s arguments.

For a completely fair comparison across lan-guages, one would ideally compute automatic met-rics over the same or an equally representative eval-uation set. For our language understanding casestudy this requirement is satisfied, as the XNLI 15language test sets are translations of the same eval-uation set. Utility in this case, where the evaluationmetric m is accuracy, will be equal to the accuracyfor each language’s l test set: utility(l,m) = ml.

Natural language understanding results aresourced from the XNLI leaderboard (Conneauet al., 2018), which contains test datasets withpremise-hypothesis pairs in 15 languages.

For question answering (QA) we aggregate re-sults from two established multilingual bench-marks, namely TyDi-QA (Clark et al., 2020) andMLQA (Lewis et al., 2009). Both benchmarks fo-cus on extractive question answering, i.e. findingthe text span of a given document that answers, ifpossible, a given question. The two benchmarksjointly cover 17 languages. We keep the highestresults for languages that are shared between thetwo datasets (English and Arabic). For this task weequate utility with test set F-score, a measure thatmeaningfully combines precision and recall of theretrieved answer span.

For machine translation, we collected morethan 500 published MT results from all WMT andIWSLT evaluation campaigns, as well as more than50 MT studies from the last three years’ ACL,EMNLP, and NAACL conferences (Barzilay andKan, 2017; Gurevych and Miyao, 2018; Palmeret al., 2017; Riloff et al., 2018; Knight et al., 2016;Walker et al., 2018; Korhonen et al., 2019; Inuiet al., 2019; Webber et al., 2020). In the machinetranslation field the most popular evaluation metric

is BLEU (Papineni et al., 2002). In our MT casestudies we estimate utility based on a normalizedversion of BLEU, such that for translation from s tot with BLEU(s, t) over an established test set, wehave utility(s, t,BLEU) ≈ BLEU(s,t)

Z . The nor-malizing factor Z = maxL×LBLEU is equivalentto the largest reported BLEU, which we equate tothe largest attainable utility at the snapshot of in-terest. In all our MT case studies we use Z = 70,which is the BLEU score reported for translationbetween Serbian and Croatian (Arcan et al., 2016).

For text-to-speech synthesis, we relied on re-sults from the CMU Wilderness project (Black,2019), which builds TTS voices with FestVox (Anu-manchipalli et al., 2011), and compared them to theEnglish system of (Ren et al., 2021). The qualityof the synthesized audio is evaluated using mel-cepstral distortion (Kubichek, 1993, MCD) a distor-tion measure that compares synthesized exampleswith originals (lower is better). Each MCD of xl fora language l was converted to a relative utility scoreby applying the transformation xmax−xl

xmax−xmin, where

xmax and xmin correspond to the highest (worst)and lowest (best) observed MCD scores across alllanguages.

For syntactic analysis through dependency pars-ing, we relied on results from two state-of-the-artsystems, UDPipe (Straka, 2018) and UDify (Kon-dratyuk and Straka, 2019). The systems are typi-cally evaluated using two measures, Unlabeled andLabeled Attachment Score (UAS and LAS), whichmeasure the overlap between human-created andautomatically-produced syntactic trees, excludingpunctuation. For our metrics we use LAS, whichconsiders the semantic relation (e.g. Subj) used tolabel the attachment between two words.

The results on morphological inflection weretaken from the findings of the corresponding sharedtasks that have been taking place as part of the SIG-MORPHON workshop for the past 5 years (Cot-terell et al., 2016, 2017, 2018; McCarthy et al.,2019; Vylomova et al., 2020). The systems areevaluated using exact-match accuracy over a pre-defined test set in each language, simply comparingthe correct inflected form with the system’s output.

Population Demand We compile populationstatistics from various sources. We rely on Eth-nologue (Eberhard et al., 2018) for language pop-ulation statistics. We take special care when com-puting population statistics over macro-languages(e.g. Arabic, Chinese) and languages commonly

Page 13: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

spoken by L2 speakers (e.g. English) or acrossmultiple dialects (e.g. for Spanish or Portuguese),aggregating populations across all variants.

Economic Indicators for Demand We aggre-gate economic information on international trade,as provided from the World Trade Organisation(WTO) through the World Integrated Trade Solu-tion.6 Since each language community can be ge-ographically associated with a member nation ofWTO, we can then estimate economic indicatorsfor and between language communities.7

In a monolingual setting, we rely on the most re-cent GDP estimates, associated with each languagecommunity. For example, the 1.7 million Nahuatlspeakers represent about 1.3% of Mexico’s popu-lation, and thus the final GDP associated with theNahuatl language will be 1.3% of Mexico’s GDP.

Modeling demand in a bilingual setting (acrosstwo languages) is also feasible using economicindicators. For instance, the amount of trade be-tween two language communities could be used toapproximate the need for translation between thetwo. Specifically, if we use the normalized importvolume per language community then we can es-timate demand for an s→ t translation system asdemand(s, t) ∝ v

imports→t such that

∑s∈L v

imports→t =

1.Take the Azerbaijani language as an example:

Azerbaijan’s imports mainly come from the Rus-sian Federation (16.8%), Turkey (14.7%), China(11.2%), the US (8.5%), Ukraine (5.5%), and Ger-many (5.5%).8 Hence, we can assign a proportionalweight to model demand for translation from Rus-sian, Turkish, Chinese, English, Ukrainian, andGerman into Azerbaijani respectively. One couldequivalently use the normalized volume of exportsinstead.

This is only straightforward to compute in caseswhere a language is easy to map to a specific coun-try. In cases of languages that are commonly usedacross many countries e.g. German (which is themain language in both Germany and Austria) ormacro-languages spoken in larger regions of theworld, we combine the weights accordingly in or-der to jointly model the demand for the whole lan-

6https://wits.worldbank.org/7Our conclusions and analyses based on WITS data are the

responsibility of the authors and do not represent the opinionof the WTO.

8Source: https://wits.worldbank.org/CountryProfile/en/Country/AZE/Year/2017/TradeFlow/Import

guage community.Table 3 presents the top-15 translation pairs

based on demand estimated from economic indica-tors, namely the import (and export) partner shareof the target (source) language. We note that thisranking does not take underlying populations intoaccount, using only the percentage of demand foreach language community. Several entries in Ta-ble 3 are language pairs that are rarely, if ever, stud-ied in MT case studies, like Belarusian-Russian,Mongolian-Mandarin Chinese, Albanian-Italian, orRussian-Armenian.

B Methods

Predicting Utility on Unseen Languages/PairsOne of the main disadvantages of using solely pub-lished results for estimating quality and, hence,utility, is the lack of evaluations on all languagesor language pairs. Furthermore, not all languagesor pairs are consistently evaluated on newly devel-oped models. To counter this issue, we propose amore comprehensive approach which attempts topredict the expected quality/utility over languagesor language pairs unseen in the collected literature.

A naive approach is to make the approximationthat utility on any unseen language is 0. Howevercrude, this could be a valid assumption in manycases: consider the example of a language under-standing system trained on all languages that ap-pear in Wikipedia. Such a system, without propermodifications, would not be able to handle inputin Yupik or Dhivehi (Maldivian), since these lan-guages are not represented in Wikipedia and theyuse different writing systems than any other lan-guage. Note that, in such a case, for a languageunderstanding system evaluated over a classifica-tion task as in a language understanding setting,the expected utility is not 0, but is rather the ex-pected quality of random outputs (33% in the caseof three-way classification).

Estimating MT quality with pivoting In thecase of machine translation, pivoting is a viableapproach for producing translations between anyarbitrary language pair, as long as the intermediatesystems exist. Even if no published results existon translation from German to Chinese, it is unrea-sonable to assign an expected utility of 0 to such aMT system, since there exist high-quality German-English and English-Chinese systems.

In the case of cascaded systems, though, esti-mating utility requires a careful approach, due to

Page 14: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

error propagation. Consider a system A with accu-racy 80% and a system B with accuracy also 80%.A cascaded system where the output of system A isprovided as input to system B will have an expectedaccuracy 64%, not 80%.

An important point is that there is no reason forpivoting through a single language. Consider theexample of Catalan to Chinese translation. A pathfrom Catalan to Spanish, to English, to Chinesemight have a yield a higher estimated utility from asingle-language pivoting path, since its componentsare of higher quality.

We devise a method that allows us to generalizethis notion in order to find the highest estimatedutility for every language pair. We construct aweighted directed graph G=(V,E) with each nodev ∈ V representing a language. The weighteddirected edge es→t between nodes s and t will havea weight equal to the highest reported normalizedBLEU score on translation from s to t. If no resultshave been published on this language pair, we setthe weight of that edge to 0.

With graph G in hand, as long as a path fromnodes s to t exists, we can estimate the expectednormalized BLEU of s− t translation as the maxi-mum cumulative (multiplicative) weight over anypath from s to t. If a path does not exist, then theestimation is 0. This is possible in cases where alanguage is reported as only source or only targetin the literature; for example, Greek (ell) only ap-pears as a source in a single study (reporting Greek–English translation results) which allows us to esti-mate Greek–X utility by pivoting through English,but we cannot produce estimates for X–Greek. Ta-ble 4 presents translation pairs were our estimatedutility (normalized BLEU score) is higher than thepublished results.

C Bibliometric Analysis

Analysis of Citations To each publication we as-sociate its citation percentile relative to its yearand event. We analyze normalized citations (C)through Bayesian generalized additive mixed ef-fects models implemented in R with brms and Stan(Bürkner, 2017; Carpenter et al., 2017) We utilizedefault weakly informative priors for all parame-ters and we run four MCMC chains for each modelwhich in all cases achieved convergence. The distri-bution of C is described through a beta distribution,of which its expected value is given by

E[C] = logit(f(L) + αA + βA · L) (1)

where f(L) is a smooth function (on the basis ofthin plate splines) depending on the number oflanguages dealt with in the paper (L), and αA andβA are random intercepts and slopes according toeach area, respectively. In order to evaluate thesupport in favor of f(L), we compared the leave-one-out (LOO) performance of this model againsta counterpart without this term,

E[C] = logit(αA + βA · L) (2)

The difference in expected log pointwise predictivedensity (which serves to inform model selection,(Vehtari et al., 2017)) between the two models is-0.9 (SE=0.6), which implies there is no major per-formance difference between the two.

Analysis of Number of Publications We deter-mine the total estimated number of papers in whicheach language l was involved (Pl). The resultingdistribution has a large concentration of zero val-ues, so we opt to model this through a zero-inflatednegative binomial distribution. We focus on twoparameters: the expected value of the number ofpublications (E[P ]) and the mixture probability (π).In both cases, we fit models considering three pos-sibilities: (1) A smooth (thin plate spline) functionof the log-GDP, (2) a smooth (thin plate spline)function of the log-number of speakers, and (3) afixed parameter. This leads to evaluating 9 modelsthrough a LOO criterion. The model that involves(1) for both parameters displays the best overallperformance (see SI).

D Machine Translation Case Studies

We use this section to expand on the discussion ofMT case studies.

Translation involving English Since translationinvolves two languages and language communi-ties, there are two natural ways for a speaker toreceive utility from a MT system: either by beingthe source (with their language being translatedinto another) or by having another language trans-lated into theirs (target). We disentangle the twoby only using each one at a time for our utilitycalculations.

Utilities based on demographics for both settingsare similar, with M = 0.25 (from English) andM1 = 0.27 (to English). Since published resultsonly cover 101 languages, the linguistic diversityscores are much lower, with M0 around 0.005.

Page 15: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Translation among all languages We extendour study on translation among all languages (stillmaintaining the distinction between a languageused as source or target). We base our estimatesfor utility on any reported results, as well as onaccuracy estimates based on a pivoting approach.Briefly outlined, our pivoting estimation approachfinds the best performing translation path for lan-guage pairs without reported results, i.e. since nostudies report translation accuracy when translat-ing from Greek to Chinese, we find that among allpossible translation paths, translating from Greekto English and from English to Chinese yields thehighest expected accuracy. We outline the processin the Materials and Methods section.

Perhaps unexpectedly, the best (and often only)pivot is English in almost all cases. As a result, thefinal utility for a language X is very much depen-dent on the utility of the X-Eng (or Eng-X) systems.This is reflected by our scores for averaged by de-mographics and languages being very similar tothe ones when we only focused on English. Nev-ertheless, the differences between scores for differ-ent languages are stark: the demographic-averagedutility for populous, well-studied languages likeGerman (M1 = 0.356), Chinese (M1 = 0.232), orFrench (M1 = 0.309) is almost double than under-served ones like Bengali (M1 = 0.148), isiXhosa(M1 = 0.156), Amharic (M1 = 148), or Burmese(M1 = 0.092). Figure 5 visualizes the differentscores for translation from 24 languages under thedemographic focus (τ = 1).

rank Lang.pop−eng Number of Studies

(M) X–eng/eng–X

1 cmn 908.8 16/ 42 spa 358.8 5/63 hin 299.5 3/14 ben 232.8 2/05 por 207.7 3/36 ara 205.4 9/67 rus 145.6 9/68 jpn 128.0 7/49 swa 89.2 1/110 msa 80.3 2/011 kor 77.3 4/012 vie 76.0 4/613 mar 73.0 2/014 tam 72.0 2/015 tur 65.9 9/416 guj 48.3 1/117 fra 47.1 12/1718 ind 43.4 2/019 ita 42.8 8/620 urd 35.0 2/021 mya 31.4 2/022 mal 30.7 0/023 deu 30.4 25/3324 orm 28.0 1/025 uzb 27.9 0/026 ukr 27.3 3/127 pol 25.0 2/028 aze 19.5 5/229 sin 17.6 1/130 ron 16.8 13/11

Table 2: Machine Translation research interests on toand from English do not match our population-baseddemand model.

Page 16: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

ara→ X aze→ X ben→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

ara

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

aze

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

ben

hin

eng

deu

cmn

spa

por

ell

kor

tam other

cat→ X cmn→ X deu→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

cat

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

cmn

eng

deu

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

deu

eng

cmn

spa

por

hin

ell

kor

tam

ben other

ell→ X eng→ X fin→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

ell

eng

deu

cmn

spa

por

hin

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

eng

deu

cmn

spa

ell

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

fin eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

fra→ X glg→ X hau→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

fra

eng

spa

deu

por

cmn

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

glg

spa

por

eng

deu

cmn

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

hau

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

ita→ X kin→ X kor→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

ita eng

spa

deu

por

cmn

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

kin

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

kor

eng

deu

cmn

spa

por

hin

ell

tam

ben other

por→ X rus→ X spa→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

por

spa

eng

deu

cmn

ell

hin

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

rus

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

spa

por

eng

deu

cmn

ell

hin

kor

tam

ben other

swa→ X tam→ X tur→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

swa

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

tam

eng

deu

cmn

spa

por

hin

ell

kor

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

tur

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

uig→ X vie→ X zul→ X

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

uig

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

vie

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Number of Speakers

0.0

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e Q

ualit

y

zul

eng

deu

cmn

spa

por

hin

ell

kor

tam

ben other

Figure 5: Visualization of our measure on translation from 24 diverse languages.

Page 17: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

Figure 6: Cumulative citations vs number of languages in publications according to topic

Page 18: arXiv:2110.06733v1 [cs.CL] 13 Oct 2021

RankBased on

Imports Exports

1 rus–bel bel–rus2 rus–kaz mon–cmn3 rus–hye sqi–ita4 rus–mon hye–rus5 rus–cmn tgl–jpn6 spa–som nep–hin7 hin–nep aze–ita8 ita–sqi srp–bos9 lit–lav lav–lit

10 rus–aze msa–jpn11 cmn–mya lit–rus12 rus–fin mya–cmn13 rus–ukr est–fin14 cmn–tha bos–hrv15 jpn–tgl kat–rus

Table 3: Top-15 translation pairs based on demand esti-mated from economic indicators (import (export) part-ner share of the target (source) language).

Language BLEU ScorePivot

Pair Estimated Published

slv–srp 37.09 25.45 eng–hrveng–nep 10.56 6.8 guj–hineng–hrv 60.80 42.15 srpeng–hin 13.78 12.5 gujhrv–eng 50.42 48.07 srpron–deu 29.36 18.4 engron–fra 33.98 26.53 engces–rus 17.56 16.2 engces–deu 23.36 19.3 engces–fra 27.04 18.1 engita–deu 26.08 19.85 engrus–ces 18.19 14.4 engpol–ces 9.90 7.2 engnld–deu 25.0 21.06 engheb–fra 27.41 23.25 engsrp–slv 52.09 35.39 hrvdeu–ron 27.25 16.27 engdeu–ces 25.19 20.1 engdeu–ita 28.42 18.56 engdeu–nld 26.48 20.31 engdeu–fra 44.27 37.3 engfra–ron 23.52 19.3 engfra–ces 21.73 13.7 engfra–heb 18.88 13.54 engspa–ces 17.83 15.2 por–engara–fra 26.83 25.07 engslv–hrv 55.64 40.44 eng–srp

Table 4: Translation pairs with a pivoting estimated util-ity (BLEU score) higher than the published result.

Parameter ELDP difference SENegbinomial Zero-inflated

log-GDP log-GDP 0 0log-GDP log-Users -20.2 6.3log-Users log-GDP -31.9 9.8log-Users log-Users -69.8 13.2log-GDP Fixed -87.9 15.1log-Users Fixed -125.2 17.4

Fixed log-GDP -263.3 40.7Fixed log-Users -307.9 41.9Fixed Fixed -437.1 46.9

Table 5: ELDP model selection for GDP and num-ber of user analysis, ordered from top (best) to bottom(worst).