arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

5
Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique Felermino D. M. A. Ali 1 , Andrew Caines 2 , Jaimito L. A. Malavi 1 1 Department of Computer Engineering, Lurio University, Mozambique 2 Computer Laboratory, University of Cambridge, United Kingdom {felermino.ali,jmalave}@unilurio.ac.mz, [email protected] Abstract Major advancement in the performance of ma- chine translation models has been made possi- ble in part thanks to the availability of large- scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both com- putational and linguistic resources and, to the best of our knowledge, few parallel corpora in- cluding Emakhuwa already exist. In this pa- per we describe the creation of the Emakhuwa- Portuguese parallel corpus, which is a collec- tion of texts from the Jehovah’s Witness web- site and a variety of other sources including the African Story Book website, the Univer- sal Declaration of Human Rights and Mozam- bican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization pro- cesses which remain to be completed, the cor- pus will be made freely available for research use. 1 Introduction Machine translation (MT) is the process by which computational models are trained to transform a source language text into a target language text, and is a technology which in recent years has seen great improvements in performance thanks to the development of MT with neural networks (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2015). NMT typically de- pends on supervised machine learning from large parallel corpora of texts in the source and target language pair. The creation of such corpora usually depends on extrinsic motivation for the existence of abundant parallel text: for instance, government activity in multilingual settings such as the Euro- pean and Canadian Parliaments. In the context of NMT, this fact has led to a dichotomous situation of high and low-resource language pairings. For the academic community, English-German or English- French are prototypical high-resource translation pairs whereas most other language pairs are low- resource in comparison. We present a corpus for a low-resource language pair: Portuguese and Emakhuwa (alternatively, ‘Makhuwa’ or ‘Makua’), the official and the most widely spoken languages of Mozambique, respec- tively. Emakhuwa is spoken in all the provinces of northern Mozambique, namely Niassa, Cabo Delgado and Nampula and also in Zambezia in central northern Mozambique. It is estimated that approximately 25% of the country’s population of 30 million people make use of the language on a daily basis as an alternative to Portuguese (de Paula and Duarte, 2016). Emakhuwa is also spoken in some of the neighbouring countries to the north of Mozambique – namely, Tanzania and Malawi (Ngunga, 2012) but speaker populations in these countries are relatively small. There are 8 variants of Emakhuwa (Ngunga and Faquir, 2012): Elomwe (ISO-639 code: ngl), Esankaci, Esaaka (ISO-639 code: xsq), Echirima (ISO-639 code: mhm), Emarevoni (ISO-639 code: xmc), Emeeto (ISO-639 code: mgh), Enahara and Central (ISO-639 code: vmw). The variants are distributed across different districts of northern and central Mozambique (see Figure 1) and dif- fer slightly in terms of accents and lexicon. Like many languages spoken on the African con- tinent, Emakhuwa has limited resources for compu- tational linguistics compared with English, French, Portuguese, and so on. Thus we are collecting a dataset for MT in which Portuguese texts are paired with Emakhuwa translations. The 47,415 sentence pairs we have collected contain 699,976 word to- kens of Emakhuwa and 877,595 word tokens in Portuguese. The corpus will be made available for arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

Transcript of arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

Page 1: arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

Towards a parallel corpus of Portuguese and the Bantu languageEmakhuwa of Mozambique

Felermino D. M. A. Ali1, Andrew Caines2, Jaimito L. A. Malavi1

1Department of Computer Engineering, Lurio University, Mozambique2 Computer Laboratory, University of Cambridge, United Kingdom

{felermino.ali,jmalave}@unilurio.ac.mz,[email protected]

Abstract

Major advancement in the performance of ma-chine translation models has been made possi-ble in part thanks to the availability of large-scale parallel corpora. But for most languagesin the world, the existence of such corporais rare. Emakhuwa, a language spoken inMozambique, is like most African languageslow-resource in NLP terms. It lacks both com-putational and linguistic resources and, to thebest of our knowledge, few parallel corpora in-cluding Emakhuwa already exist. In this pa-per we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collec-tion of texts from the Jehovah’s Witness web-site and a variety of other sources includingthe African Story Book website, the Univer-sal Declaration of Human Rights and Mozam-bican legal documents. The dataset contains47,415 sentence pairs, amounting to 699,976word tokens of Emakhuwa and 877,595 wordtokens in Portuguese. After normalization pro-cesses which remain to be completed, the cor-pus will be made freely available for researchuse.

1 Introduction

Machine translation (MT) is the process by whichcomputational models are trained to transform asource language text into a target language text,and is a technology which in recent years has seengreat improvements in performance thanks to thedevelopment of MT with neural networks (NMT)(Kalchbrenner and Blunsom, 2013; Sutskever et al.,2014; Bahdanau et al., 2015). NMT typically de-pends on supervised machine learning from largeparallel corpora of texts in the source and targetlanguage pair. The creation of such corpora usuallydepends on extrinsic motivation for the existenceof abundant parallel text: for instance, governmentactivity in multilingual settings such as the Euro-pean and Canadian Parliaments. In the context of

NMT, this fact has led to a dichotomous situation ofhigh and low-resource language pairings. For theacademic community, English-German or English-French are prototypical high-resource translationpairs whereas most other language pairs are low-resource in comparison.

We present a corpus for a low-resource languagepair: Portuguese and Emakhuwa (alternatively,‘Makhuwa’ or ‘Makua’), the official and the mostwidely spoken languages of Mozambique, respec-tively. Emakhuwa is spoken in all the provincesof northern Mozambique, namely Niassa, CaboDelgado and Nampula and also in Zambezia incentral northern Mozambique. It is estimated thatapproximately 25% of the country’s population of30 million people make use of the language on adaily basis as an alternative to Portuguese (de Paulaand Duarte, 2016). Emakhuwa is also spoken insome of the neighbouring countries to the northof Mozambique – namely, Tanzania and Malawi(Ngunga, 2012) but speaker populations in thesecountries are relatively small.

There are 8 variants of Emakhuwa (Ngungaand Faquir, 2012): Elomwe (ISO-639 code: ngl),Esankaci, Esaaka (ISO-639 code: xsq), Echirima(ISO-639 code: mhm), Emarevoni (ISO-639 code:xmc), Emeeto (ISO-639 code: mgh), Enahara andCentral (ISO-639 code: vmw). The variants aredistributed across different districts of northernand central Mozambique (see Figure 1) and dif-fer slightly in terms of accents and lexicon.

Like many languages spoken on the African con-tinent, Emakhuwa has limited resources for compu-tational linguistics compared with English, French,Portuguese, and so on. Thus we are collecting adataset for MT in which Portuguese texts are pairedwith Emakhuwa translations. The 47,415 sentencepairs we have collected contain 699,976 word to-kens of Emakhuwa and 877,595 word tokens inPortuguese. The corpus will be made available for

arX

iv:2

104.

0575

3v1

[cs

.CL

] 1

2 A

pr 2

021

Page 2: arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

Figure 1: Geographic distribution of Emakhuwa variants in Mozambique (Ngunga and Faquir, 2012).

research use when final normalization and data col-lection has been completed1. In addition we willseek out new data sources and continue to expandthe corpus.

2 Related Work

In general, African languages have been relativelylittle explored by the NLP research community.However, lately this scenario seems to be improv-ing as more data are being made available forresearch use, including translation work. OPUS(Tiedemann, 2012), the 1000Langs corpus crawledfrom Bible corpora (Asgari and Schutze, 2017),and the JW300 project (Agic and Vulic, 2019)are examples of this trend, as they made availableparallel corpora of over 300 languages includingEmakhuwa-Portuguese. The Crubadan Project2

also used Jehovah’s Witness resources to createa corpus of Emakhuwa. However, the corpusis monolingual and relatively small with around44,071 tokens extracted from 4 documents.

The Masakhane project works mainly on low-resource MT for African languages, and is an ex-cellent source of African language parallel corpora(Nekoto et al., 2020). Currently, they have a grow-ing online community of researchers who collab-orate, contribute and share advancements in NLP

1placeholder.com2http://crubadan.org/languages/vmw

for African languages. They have an open reposi-tory 3 for sharing data as well as code and tools forbuilding and facilitating NLP research in Africanlanguages. Nevertheless, the Emakhuwa languagehas not been explored yet by the community.

3 Data

The dataset contains Portuguese-Emakhuwa par-allel sentences, collected from crawling the Je-hovah’s Witness (JW) website4 and the AfricanStory Book website5. The African Story Book con-tains 17 short stories in Emakhuwa with Portuguesetranslations. The Jehovah’s Witness website con-tains The Watchtower—Study Edition magazine aswell as The Watchtower and Awaque! from 2016to 2021. Also, contains 27 books of the New Testa-ment Bible.

The corpus also contains text from various in-stitutional sources. These were digitized by per-forming optical character recognition (OCR) on aset of PDF files from Centro Catequetico Paulo VI(Paul VI Catechetical Centre), the Universal Dec-laration of Human Rights, and Mozambican LandLaw. The majority of texts come from the CentroCatequetico Paulo VI which has published a series

3https://github.com/masakhane-io/masakhane-mt

4https://www.jw.org5https://www.africanstorybook.org

Page 3: arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

Source Portuguese Emakhuwa EnglishJW depois, eu percebi que

essa tinha sido a decisaocerta.

nuuvira okathi kihoonawira vaari voolokaothaama.

After that, I realized thatit was the right choice

CentroCatequeticoPaulo VI

vivo aquilo em queacredito ?

vekekhai kinnettela seiyosina minaka ?

Do I live on what Ibelieve?

AfricanStory Book

Certo homem tinha doisfilhos

Mulopwana mmosaaahikhalana an’awe anli.

A man had two sons

Declarationof HumanRights

todos os seres humanosnascem livres e iguais emdignidade e em direitos .dotados de razao e deconsciencia , devem agiruns para com os outrosem espırito defraternidade .

atthu othene aniyariaoolikana ni owilamulamoota ontthunaya okhala ,variyari v edignidade niedireito . akhalanne esariani otthokelela , ahaanaakhalasaka othene sayavamurettele .

the law needs to be knownby the whole community,because from theknowledge of the law wecan defend and safeguardour rights.

Land Law ofMozam-bique

nas areas rurais , ascomunidades locaisparticipam :

mulaponi mwa imatta ,ikomunidade sa elapoyeyo sokhalana mpantta :

in rural areas, localcommunities participate:

Table 1: Examples from the Portuguese-Emakhuwa parallel corpus, with English translation.

of three catechism books: Yesu Mopoli Ahu – Je-sus Nosso Salvador (‘Jesus Our Saviour’). Thesebooks, besides being religious, also touch upon awide range of topics.

The dataset contains texts written in the centralEmakhuwa variant only, because this variant hasrelatively more resources than others. The sen-tences come from 125 pair documents in total, andamount to 699,976 word tokens of Emakhuwa. Ta-ble 1 shows examples from the dataset, with Por-tuguese and Emakhuwa parallel sentences and En-glish glosses.

Processing of the OCR obtained from these doc-uments involved the removal of some inconsisten-cies. For instance, the letter “ı” was wrongly recog-nized as “T”, and the letter “a” was recognized as“6”, to give just two examples of OCR errors. Pro-cessing also involved dividing longer sentences intoshorter and concise sentences as well as discard-ing sentences wrongly recognized. Then followedthe alignment of Portuguese and Emakhuwa sen-tences which has been manually carried out. Moreprocessing needs to be done, as it transpires thatspelling standards vary between the data sources.As an example, there are some lexical conflicts inthe corpus, particularly regarding the use of the let-ters “j” and “c” along with the suffix or infix “erya”.JW texts tend to use the letter “cerya”, while others

Source Lang Sentences TokensJW PT 42,840 798,371

VMW 638,365CCA PT 4067 73,221

VMW 56,724ASB PT 294 2746

VMW 1945LL PT 128 1656

VMW 1832HR PT 86 1601

VMW 1110Total PT 47,415 877,595

VMW 699,976

Table 2: Dataset counts where: JW = Jehovah’s Wit-ness, CCA = Centro Catequetico Paulo VI, ASB =African Book Story, LL = Land Law of Mozambique,HR = Universal Declaration of Human Rights; PT =Portuguese, VMW = Emakhuwa.

use “jerya”. So for example the English word “be-gin” would be written as “opacerya” in JW texts,whereas in others resources it would be written as“opajerya”. Another example is the English word“kiss”: its translation to Emakhuwa in the AfricanStory Book is “epeeco”, whereas in other resourcesis translated as “epexo”.

Word borrowing is another issue to take into

Page 4: arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

TargetSource Portuguese EmakhuwaPortuguese - 12.42Emakhuwa 13.09 -

Table 3: BLEU scores for Emakhuwa-Portuguese andPortuguese-Emakhuwa NMT models.

account, since Emakhuwa takes many words fromPortuguese, but without a consensus on spellingconventions. As an example, take the word “bible”which in Portuguese is ”bıblia”. In Emakhuwait is sometimes written as “biblia” and at othertimes written as “biibiliya”. This inconsistency alsoexists within Emakhuwa language resources itself,as the alphabet and spelling standards went throughseveral revisions – the latest in 2012 (Ngunga andFaquir, 2012). Therefore, some future processingis necessary to normalize the word standards in thedataset. The corpus will be available for researchuse, and updates may be obtained from our projectwebsite6. Table 2 summarizes the corpus in termsof sentences and the number of tokens coming fromeach data source.

4 Baseline translation model

We train an initial NMT model with the defaultOpenNMT configuration7, which consists of a 2-layer LSTM with 500 hidden units on both theencoder and decoder (Klein et al., 2017). Sentencesfrom the corpus are randomly assigned to trainingand test sets at a ratio of 9:1. These data splitswill be released with the corpus so that others maycompare the performance of their models againstthis one. We evaluate with BLEU (Papineni et al.,2002) as shown in Table 3. This presents a baselinelevel of performance to improve upon in futurework.

5 Conclusion

In this paper we describe the creation of a new par-allel corpus of Emakhuwa and Portuguese. Thepreparation of the corpus is on-going as the textsrequire more processing and normalization, butit will be made freely available for research use,most probably for machine translation. Currently,the dataset is made up of mostly religious and legalcontent but in future our objective is to diversify the

6placeholder.com7https://github.com/OpenNMT/OpenNMT-

py/blob/master/docs/source/quickstart.md

range of sources and topics covered. This is impor-tant in order that Portuguese-Emakhuwa translationmodels work well across various domains.

Acknowledgements

The authors wish to acknowledge the CentroCatequetico Paulo VI for allowing to use their mate-rials for research purpose. The second author is sup-ported by Research England via the University ofCambridge Global Challenges Research Fund. Hewishes to thank Paula Buttery, Tanya Hall, CarolNightingale & Sara Serradas Duarte for their helpand guidance.

ReferencesZeljko Agic and Ivan Vulic. 2019. JW300: A wide-

coverage parallel corpus for low-resource languages.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages3204–3210, Florence, Italy. Association for Compu-tational Linguistics.

Ehsaneddin Asgari and Hinrich Schutze. 2017. Past,present, future: A computational investigation of thetypology of tense in 1000 languages. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP).

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Pro-ceedings of ACL 2017, System Demonstrations.

Wilhelmina Nekoto, Vukosi Marivate, TshinondiwaMatsila, Timi Fasubaa, Taiwo Fagbohungbe,Solomon Oluwole Akinola, Shamsuddeen Muham-mad, Salomon Kabongo Kabenamualu, SalomeyOsei, Freshia Sackey, Rubungo Andre Niyongabo,Ricky Macharm, Perez Ogayo, Orevaoghene Ahia,Musie Meressa Berhe, Mofetoluwa Adeyemi,Masabata Mokgesi-Selinga, Lawrence Okegbemi,Laura Martinus, Kolawole Tajudeen, Kevin Degila,Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer,Jason Webster, Jamiil Toure Ali, Jade Abbott,Iroro Orife, Ignatius Ezeani, Idris AbdulkadirDangana, Herman Kamper, Hady Elsahar, Good-ness Duru, Ghollah Kioko, Murhabazi Espoir,Elan van Biljon, Daniel Whitenack, Christopher

Page 5: arXiv:2104.05753v1 [cs.CL] 12 Apr 2021

Onyefuluchi, Chris Chinenye Emezue, BonaventureF. P. Dossou, Blessing Sibanda, Blessing Bassey,Ayodele Olabiyi, Arshath Ramkilowan, Alp Oktem,Adewale Akinfaderin, and Abdallah Bashir. 2020.Participatory research for low-resourced machinetranslation: A case study in African languages.In Findings of the Association for ComputationalLinguistics: EMNLP 2020.

Armindo Ngunga. 2012. Interferencias delınguas mocambicanas em portugues faladoem mocambique. Revista Cientıfica Da Universi-dade Eduardo Mondlane, Serie: Letras e CienciasSociais, 1:7–20.

Armindo Ngunga and Osvaldo Faquir. 2012.Padronizacao da Ortografia de LınguasMocambicanas: Relatorio do III Seminario.Coleccao As Nossas Lınguas III. Maputo: CEA.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics.

Ronaldo Rodrigues de Paula and Fabio Bonfim Duarte.2016. Diversidade linguıstica em mocambique. InKadila: culturas e ambientes – Dialogos Brasil-Angola. Sao Paulo: Bluche.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In Proceedings of the 27th International Confer-ence on Neural Information Processing Systems(NeurIPS).

Jorg Tiedemann. 2012. Parallel data, tools and inter-faces in opus. In Proceedings of the Eight Interna-tional Conference on Language Resources and Eval-uation (LREC’12).