1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala...

18
1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva AbstractSinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field. Index Terms—Sinhala, Natural Language Processing, Resource Poor Language 1 I NTRODUCTION Sinhala 1 language, being the native language of the Sin- halese people [24], who make up the largest ethnic group of the island country of Sri Lanka, enjoys being reported as the mother tongue of Approximately 16 million people [5, 6]. To give a brief linguistic background for the purpose of aligning the Sinhala language with the baseline of English, primarily it should be noted that Sinhala language belongs same the Indo-European language tree [7, 8]. However, un- like English, which is part of the Germanic branch, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, unlike English, which borrowed the Latin alphabet, has its own writing system, which is a descendant of the Indian Brahmi script [915]. By extension, this makes Sinhala Script a member of the Aramaic family of scripts [16, 17]. Thus by in- heritance, Sinhala writing system is abugida (alphasyllabary), which to say that consonant-vowel sequences are written as single units [18]. It should be noted that the modern Sinhala language have loanwords from languages such as Tamil, English, Portuguese, and Dutch due to various historical reasons [19]. Regardless of the rich historical array of litera- ture spanning several millennia (starting between 3 rd to 2 nd century BCE [20, 21]), modern natural language processing tools for the Sinhala language are scarce [22]. Natural Language Processing (NLP) is a broad area cov- ering all computational processing and analysis of human languages. To achieve this end, NLP systems operate at different levels [2325]. A graphical representation of NLP layers and application domains are shown in Figure 1. On Nisansa de Silva is with the Department of Computer Science & Engi- neering, University of Moratuwa. E-mail: [email protected] Manuscript revised: January 14, 2020. 1. Englebretson and Genetti [1] observe that in some contexts the Sinhala language is also referred as Sinhalese, Singhala, and Singhalese one hand, according to Liddy [24], these systems can be categorized into the following layers; phonological, morpho- logical, lexical, syntactic, semantic, discourse, and pragmatic. The phonological layer deals with the interpretation of lan- guage sounds. As such, it consists of mainly speech-to-text and text-to-speech systems. In cases where one is working with written text of the language rather than speech, it is possible to replace this layer with tools which handle Op- tical Character Recognition (OCR) and language rendering standards (such as Unicode [26]). The morphological layer analyses words at their smallest units of meaning. As such, analysis on word lemmas and prefix-suffix-based inflection are handled in this layer. Lexical layer handles individual words. Therefore tasks such as Part of Speech (PoS) tagging happens here. The next layer, syntactic, takes place at the phrase and sentence level where grammatical structures are utilized to obtain meaning. Semantic layer attempts to derive the meanings from the word level to the sentence level. Starting with Named Entity Recognition (NER) at the word level and working its way up by identifying the contexts they are set in until arriving at overall meaning. The discourse layer handles meaning in textual units larger than a sentence. In this, the function of a particular sentence maybe contextualized within the document it is set in. Finally, the pragmatic layer handles contexts read into contents without having to be explicitly mentioned [23, 24]. Some forms of anaphora (coreference) resolution [2731] fall into this application. On the other hand, Wimalasuriya and Dou [25] catego- rize NLP tools and research by utility. They introduce three categories with increasing complexity; Information Retrieval (IR), Information Extraction (IE), and Natural Language Un- derstanding (NLU). Information Retrieval covers applications, which search and retrieve information which are relevant to a given query. For pure IR, tools and methods up-to and including the syntactic layer in the above analysis are arXiv:1906.02358v5 [cs.CL] 13 Jan 2020

Transcript of 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala...

Page 1: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

1

Survey on Publicly Available Sinhala NaturalLanguage Processing Tools and Research

Nisansa de Silva

Abstract—

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongsto the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in theperspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economicdrive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups fromSri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing.However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paperis to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that theresearchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv andperpetually update it periodically to reflect the advances made in the field.

Index Terms—Sinhala, Natural Language Processing, Resource Poor Language

F

1 INTRODUCTION

Sinhala1 language, being the native language of the Sin-halese people [2–4], who make up the largest ethnic group ofthe island country of Sri Lanka, enjoys being reported as themother tongue of Approximately 16 million people [5, 6].To give a brief linguistic background for the purpose ofaligning the Sinhala language with the baseline of English,primarily it should be noted that Sinhala language belongssame the Indo-European language tree [7, 8]. However, un-like English, which is part of the Germanic branch, Sinhalabelongs to the Indo-Aryan branch. Further, Sinhala, unlikeEnglish, which borrowed the Latin alphabet, has its ownwriting system, which is a descendant of the Indian Brahmiscript [9–15]. By extension, this makes Sinhala Script amember of the Aramaic family of scripts [16, 17]. Thus by in-heritance, Sinhala writing system is abugida (alphasyllabary),which to say that consonant-vowel sequences are written assingle units [18]. It should be noted that the modern Sinhalalanguage have loanwords from languages such as Tamil,English, Portuguese, and Dutch due to various historicalreasons [19]. Regardless of the rich historical array of litera-ture spanning several millennia (starting between 3rd to 2nd

century BCE [20, 21]), modern natural language processingtools for the Sinhala language are scarce [22].

Natural Language Processing (NLP) is a broad area cov-ering all computational processing and analysis of humanlanguages. To achieve this end, NLP systems operate atdifferent levels [23–25]. A graphical representation of NLPlayers and application domains are shown in Figure 1. On

• Nisansa de Silva is with the Department of Computer Science & Engi-neering, University of Moratuwa.E-mail: [email protected]

Manuscript revised: January 14, 2020.1. Englebretson and Genetti [1] observe that in some contexts the

Sinhala language is also referred as Sinhalese, Singhala, and Singhalese

one hand, according to Liddy [24], these systems can becategorized into the following layers; phonological, morpho-logical, lexical, syntactic, semantic, discourse, and pragmatic.The phonological layer deals with the interpretation of lan-guage sounds. As such, it consists of mainly speech-to-textand text-to-speech systems. In cases where one is workingwith written text of the language rather than speech, it ispossible to replace this layer with tools which handle Op-tical Character Recognition (OCR) and language renderingstandards (such as Unicode [26]). The morphological layeranalyses words at their smallest units of meaning. As such,analysis on word lemmas and prefix-suffix-based inflectionare handled in this layer. Lexical layer handles individualwords. Therefore tasks such as Part of Speech (PoS) tagginghappens here. The next layer, syntactic, takes place at thephrase and sentence level where grammatical structuresare utilized to obtain meaning. Semantic layer attempts toderive the meanings from the word level to the sentencelevel. Starting with Named Entity Recognition (NER) atthe word level and working its way up by identifying thecontexts they are set in until arriving at overall meaning. Thediscourse layer handles meaning in textual units larger than asentence. In this, the function of a particular sentence maybecontextualized within the document it is set in. Finally, thepragmatic layer handles contexts read into contents withouthaving to be explicitly mentioned [23, 24]. Some formsof anaphora (coreference) resolution [27–31] fall into thisapplication.

On the other hand, Wimalasuriya and Dou [25] catego-rize NLP tools and research by utility. They introduce threecategories with increasing complexity; Information Retrieval(IR), Information Extraction (IE), and Natural Language Un-derstanding (NLU). Information Retrieval covers applications,which search and retrieve information which are relevantto a given query. For pure IR, tools and methods up-toand including the syntactic layer in the above analysis are

arX

iv:1

906.

0235

8v5

[cs

.CL

] 1

3 Ja

n 20

20

Page 2: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

2

Phonological Layer

Morphological Layer

Lexical Layer

Syntactic Layer

Semantic Layer

Discourse Layer

Pragmatic Layer

Pronunciation model

Morphological Rules

Lexicon

Discourse context

Domainknowledge

Speech Analysis 

Morphological Analysis

PoS tagging

GrammarParsing

Semantic modelNER and Semantic

tagging

ContextualReasoning

Application reasoning andexecution

Utteranceplanning

Semantic realisation

Syntactic realisation

Lexical realisation

Morphological realisation

Speech synthesis 

Input text Output text

Information Retrieval

Natural Language Understanding

Information Extraction

Fig. 1: NLP layers and tasks [23]

used. Information Extraction, on the other hand, extractsstructured information. The difference between IR and IEis the fact that IR does not change the structure of thedocuments in question. Be them structured, semi-structured,or unstructured, all IR does is fetching them as they are. Incomparison, IE, takes semi-structured or unstructured textand puts them in a machine readable structure. For this, IEutilizes all the layers used by IR and the semantic layer. Nat-ural Language Understanding is purely the idea of cognition.Most NLU tasks fall under AI-hard category and remainunsolved [23]. However, with varying accuracy, some NLUtasks such as machine translation2 are being attempted.The pragmatic layer of the above analysis belongs to theNLU tasks while the discourse layer straddles informationextraction and natural language understanding [23].

The objective of this paper is to serve as a comprehensivesurvey on the state of natural language processing resourcesfor the Sinhala language. The initial structure and contentof this survey are heavily influenced by the preliminarysurveys carried out by de Silva [22] and Wijeratne et al.[23]. However, our hope is to host this survey at arXivas a perpetually evolving work which continuously getsupdated as new research and tools for Sinhala languageare created and made publicly available. Hence, it is ourhope that this work will help future researchers who areengaged in Sinhala NLP research to conduct their literaturesurveys efficiently and comprehensively. For the success of

2. This is, however, not without the criticism of being nothing morethan a Chinese room [32] rather than true NLU.

this survey, we shall also consider the Sri Lankan NLP toolsrepository, lknlp3. This manuscript is at version 3.2. Thelatest version of the manuscript can be obtained from arXiv4

or ResearchGate5.The remainder of this survey is organized as follows;

Section 2 introduces some important properties and con-ventions of the Sinhala language which are important forthe development and understanding of Sinhala NLP. Sec-tion 3 discusses the various tools and research available forSinhala NLP. In this section we would discuss both pureSinhala NLP tools and research as well as hybrid Sinhala-English work. We will also discuss research and tools whichcontributes to Sinhala NLP either along with or by the helpof Tamil, the other official language of Sri Lanka. Section 4gives a brief introduction to the primary language sourcesused by the studies discussed in this work. Finally, Section 5,concludes the survey.

2 PROPERTIES OF THE SINHALA LANGUAGE

Before moving on to discussing Sinhala NLP resources, weshall give a brief introduction to some of the importantproperties of Sinhala language, which impact the develop-ment of Sinhala NLP resources. Sinhala grammar has twoforms: written (literary) and spoken. These forms differ fromeach-other in their core grammatical structures [1, 33, 34].The written form strictly adheres to the SOV (Subject,

3. https://github.com/lknlp/lknlp.github.io4. https://arxiv.org/abs/1906.023585. http://bit.ly/31AhvvR

Page 3: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

3

TABLE 1: Examples for grammatical cases and inflection of animate common nouns

Form CaseSingular Plural

Masculine Feminine Masculine FeminineDefinite Indefinite Definite Indefinite

1 Nominative සා ෙස ගැහැ යගැහැ යගැහැ ෙය

ස ගැහැ

2 Accusativeසා ෙස ගැහැ ය ගැහැ යක ගැහැ3 Auxiliary

4 Dative සාට ෙස ට ගැහැ යටගැහැ ය ටගැහැ යකට

ට ගැහැ ට

5 Genitiveසාෙ ෙස ෙ ගැහැ යෙ ගැහැ යකෙ ෙ ගැහැ ෙ6 Locative

7 Instrumentalසාෙග ෙස ෙග ගැහැ යෙග ගැහැ යකෙග ෙග ගැහැ ෙග8 Ablative

9 Vocative ස ස ගැහැ ය ගැහැ ය ගැහැ

TABLE 2: Examples for grammatical cases and inflectionof inanimate common nounsNote that the grammatical cases of Auxiliary and Vocative do not existfor inanimate nouns in Sinhala.

Form CaseSingular PluralDefinite Indefinite

1 Nominativeෙප ත ෙප ත ෙප2 Accusative

4 Dative ෙප තට ෙප තකට ෙප වලට5 Genitive ෙප ෙ

ෙප ෙතෙප තක ෙප වල6 Locative

7 Instrumental ෙප ෙතෙප

ෙප ත ෙප ව8 Ablative

Object, and Verb) configuration [35, 36]. Further, in thewritten form, subject-verb agreement is enforced [37] suchthat, in order to be grammatically correct, the subject and theverb must agree in terms of: gender (male/female), num-ber (singular/plural) and person (1st/2nd/3rd). However,in spoken Sinhala, the SOV order can be neglected [38]and male singular 3rd person verb can be used for allnouns [37]. Sinhala is also a head-final language, wherethe complements and modifiers would appear before theirheads [39] this is similar to that of English and dissimilarto that of French. In total, according to Abhayasinghe [40],there are 25 types of simple sentence structures in Sinhala.Similar to many Indo Aryan languages, animacy plays amajor role in Sinhala grammar in syntactic and semanticroles [41–43]. Comparative studies done by Noguchi [44]and by Miyagishi [34, 45] have found that animacy extendsits influence from phrase level to sentence level in Sinhala(e.g., Usage of post-positions [35, 46]). On this matter, Ta-ble 1 explains grammatical cases and inflections of animatecommon nouns while Table 2 explains grammatical casesand inflections of inanimate common nouns. We providea comparative analysis of parsing the very simple Englishsentence “I eat a red apple” and its Sinhala, Hindi, and Frenchtranslations in Fig 2. English and French parsing was doneusing the Stanford Parser6. Hindi parsing was done using the

6. http://nlp.stanford.edu:8080/parser/

IIIT-Hyderabad Parser7 and the study by Singh et al. [47].Herath et al. [48, 49] argue that pure Sinhala words did

not have suffixes and that adding suffixes was incorporatedto Sinhala after 12th century BC with the influx of Sanskritwords. With this, they declare Sinhala to have to followingtypes of words:

1) Suffixes2) Nouns3) Cases4) Verbs5) Conjunctions and articles6) Adjectives7) Demonstratives, Interrogatives, and negatives8) Particles and prefixes

They further divide nouns into five groups: material, agen-tive, common, abstract, and proper. In addition to these,they also introduce compound nouns.We show the nouncategorization proposed by Herath et al. [48] in Table 3.

TABLE 3: Noun categorization by Herath et al. [48]

Type ExamplesMaterial , ෙගයAgentive ව නා, මCommon ෙග යා, සාAbstract , උසProper ෙක ළඹ, අමරCompound බ ( +බ ), ම ( +ම )

Herath et al. [49] categorize Sinhala suffixes along theattributes of: gender, number, definiteness, case, and con-junctive. They further claim that there are 3 types of suffixes:Suf1 adds gender, number, and definiteness; Suf2 adds case;and Suf3 adds conjunctive. Conjunctive is claimed to beequivalent to too and and in English. We show an extensionof the suffix structure proposed by Herath et al. [49] inTable 4.

7. http://ltrc.iiit.ac.in/analyzer/

Page 4: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

4

I

PRP

NP

S

eat a red apple

DT JJ NN

NPVBP

VP

(a) English

මම

PRP

NP

S

ර� ඇප� ෙග�ය� ක�

NN

VP

JJ

NPJJ

NP V

(b) Sinhala

म�

PRP

NP

S

लाल सब खाता

VM

VGF

NN

VP

JJ

NP

VP

(c) Hindi

Je

CLS

VN

mange une pomme rouge

N ADJ

MWNDET

NP

V

S

(d) French

Fig. 2: Parse trees for the sentence “I eat a red apple” in four languages.

3 SINHALA NLP RESOURCES

In this section we generally follow the structure shown inFigure 1 for sectioning. However, in addition to that, wealso discuss topics such as available corpora, other datasets, dictionaries, and WordNets. We focus on NLP toolsand research rather than the mechanics of language scripthandling [50–55]. One of the earliest attempts on SinhalaNLP was done by Herath et al. [56]. However, progresson that project has been minimal due to the limitations oftheir time. The later work by Nandasara [57] has not caughtmuch of the advances done up to the time of its publication.Given that it was a decade old by the time the first editionof this survey was compiled, we observe the existence of

many new discoveries in Sinhala NLP which have not beentaken into account by it. A review on some challenges andopportunities of using Sinhala in computer science wasdone by Nandasara and Mikami [58]. At this point, it isworthy to note that the largest number of studies in SinhalaNLP has been on optical character recognition (OCR) ratherthan on higher levels of the hierarchy shown in Figure 1. Onthe other hand, the most prolific single project of SinhalaNLP we have observed so far is an attempt to create anend-to-end Sinhala-to-English translator [18, 59–75]. Tamil,the other official language of Sri Lanka is also a resourcepoor language. However, due to the existence of largerpopulations of Tamil speakers worldwide, including but notlimited to economic powerhouses such as India, there are

Page 5: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

5

TABLE 4: Extension of the suffix structure proposed by Herath et al. [49]Number = Singular (S) / Plural (P)Definite = Definite (D) / Indefinite (I) / Undecided (U)Case = Nominative (N) / Accusative (A) / Dative (Da) / Genitive (G) / Instrumentive (In) / Auxiliary (Au) / Locative (L) / Ablative (Ab)Conjun = with suffix ”th” (Y) / without suffix ”th” (N)

Suf1 Suf2 Suf3 Noun -tree- AttributesNumber Definite Case Conjun

- - - ගස (-s) P U N, A, V Nඅ - - ගස (the-) S D N, A, V Nඅ - - ගස (a-) S I N, A Nඅක - - ගසක (of a-) S I G Nඅ ට - ගසට (to the-) S D Da N- ඒ - ගෙස (in the-) S D Au N- වල - ගසවල (on -s) P U L N- වලට - ගසවලට (to -s) P U Da N- එ - ගෙස (of the-) S D G Nඅ ඉ - ගස (from a-) S I Au N- - උ ග (-s and) P U N, A Yඅ - ගස (-and) S D N, A Yඅ - උ ගස (a- too) S I N, A Yඅක - ගසක (of a -, too) S U N, A Yඅ ට ගසට (to the- too) S D Da Y- ඒ ගෙස (in -s too) S D Au Y- වල ගසවල (on -s too) P U L Y- වලට ගසවලට (to -s too) P U Da Y- එ ගෙස (of the- too) S D G Yඅ ඉ උ ගස (from a- too) S I Au Y

more research and tools available for Tamil NLP tasks [23].Therefore, it is rational to notice that Sinhala and Tamil NLPendeavours can help each other. Especially, given the abovefact, that these are official languages of Sri Lanka, resultsin the generation of parallel data sets in the form of officialgovernment documents and local news items. A numberof researchers make use of this opportunity. We shall bediscussing those applications in this paper as well. Further,there have been some fringe implementations, which bridgeSinhala with other languages such as Japanese [8, 21, 76–78].

3.1 CorporaFor any language, the key for NLP applications and im-plementations is the existence of adequate corpora. On thismatter, a relatively substantial Sinhala text corpus8 wascreated by Upeksha et al. [79, 80] by web crawling. Later asmaller Sinhala newes corpus9 was created by de Silva [22].Both of the above corpora are publicly available. However,none of these come close to the massive capacity and rangeof the existing English corpora. A word corpus of ap-proximately 35,000 entries was developed by Weerasingheet al. [81]. But it does not seem to be online anymore. A

8. https://osf.io/a5quv/9. https://osf.io/tdb84/

number of Sinhala-English parallel corpora were introducedby Guzman et al. [82]. This includes a 600k+ Sinhala-English subtitle pairs10 initially collected by [83], 45k+Sinhala-English sentence pairs from GNOME11, KDE12, andUbuntu13. Guzman et al. [82] further provided two mono-lingual corpora for Sinhala. Those were a 155k+ sentences offiltered Sinhala Wikipedia14 and 5178k+ sentences of Sinhalacommon crawl15.

As for Sinhala-Tamil corpora, Hameed et al. [84] claimto have built a sentence aligned Sinhala-Tamil parallel cor-pus and Mohamed et al. [85] claim to have built a wordaligned Sinhala-Tamil parallel corpus. However, at the timeof writing this paper, neither of them was publicly available.A very small Sinhala-Tamil aligned parallel corpus createdby Farhath et al. [86] using order papers of government ofSri Lanka is available to download16.

10. http://bit.ly/2KsFQxm11. http://bit.ly/2Z8q0fo12. http://bit.ly/2WLY6bI13. http://bit.ly/2wLVZGt14. http://bit.ly/2EQZ7oM15. http://bit.ly/2ZaQFZo16. http://bit.ly/2HTMEme

Page 6: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

6

3.2 Data SetsSpecific data sets for Sinhala, as expected, is scarce. How-ever, a Sinhala PoS tagged data set [87–89] is available todownload from github17. Further, a Sinhala NER data setcreated by Manamini et al. [90] is also available to downloadfrom github18.

Facebook has released FastText [91–93] models for theSinhala language trained using the Wikipedia corpus. Theyare available as both text models19 and binary files20. Usingthe above models by Facebook, a group at University ofMoratuwa has created an extended FastText model trainedon Wikipedia, News, and official government documents.The binary file21 of the trained model is available to bedownloaded. Herath et al. [94] has compiled a report onthe Sinhala lexicon for the purpose of establishing a basisfor NLP applications.

3.3 DictionariesA necessary component for the purpose of bridging Sinhalaand English resources are English-Sinhala dictionaries. Theearliest and most extensive Sinhala-English dictionary avail-able for consumption was by Malalasekera [95]. However,this dictionary is locked behind copyright laws and is notavailable for public research and development. This copy-right issue is shared with other printed dictionaries [96–101] as well. The dictionary by Kulatunga [102] is publiclyavailable for usage through an online web interface but doesnot provide API access or means to directly access the dataset. The largest publicly available English-Sinhala dictionarydata set is from a discontinued FireFox plug-in EnSiTip [103]which bears a more than passing resemblance to the abovedictionary by Kulatunga [102]. Hettige and Karunananda[62] claim to to have created a lexicon to help in theirattempt to create a system capable of English-to-Sinhalamachine translation. A review on the requirements forEnglish-Sinhala smart bilingual dictionary was conductedby Samarawickrama and Hettige [104].

There exists the government sponsored trilingual dic-tionary [105], which matches Sinhala, English, and Tamil.However, other than a crude web interface on the ministrywebsite, there is no efficient API or any other way for aresearcher to access the data on this dictionary. Weerasingheand Dias [106] have created a multilingual place namedatabase for Sri Lanka which may function both as a dic-tionary and a resource for certain NER tasks.

3.4 WordNetsWordNets [107] are extremely powerful and act as a versatilecomponent of many NLP applications. They encompass anumber of linguistic properties which exist between thewords in the lexicon of the language including but notlimited to: hyponymy, hypernymy, synonymy, and meronymy.Their uses range from simple gazetteer listing applica-tions [25] to information extraction based on semantic simi-larity [108, 109] or semantic oppositeness [110]. An attempt

17. http://bit.ly/2Krhrbv18. http://bit.ly/2XrwCoK19. http://bit.ly/2JXAyL820. http://bit.ly/2JY5J9c21. http://bit.ly/2WowH0h

has been made to build a Sinhala Wordnet by Wijesiri et al.[111]. For a time it was hosted on [112] but it too is nowdefunct and all the data and applications are lost. However,even at its peak, due to the lack of volunteers for the crowd-soured methodology of populating the WordNet, it was atbest an incomplete product. Another effort to build a SinhalaWordnet was initiated by Welgama et al. [113] indepen-dently from above; but it too have stopped progression evenbefore achieving the completion level of Wijesiri et al. [111].

3.5 Morphological AnalyzersAs shown in Fig 1, morphological analysis is a groundlevel necessary component for natural language processing.Given that Sinhala is a highly inflected language [22, 37, 38],a proper morphological analysis process is vital. The ear-liest attempt on Sinhala morphological analysis we haveobserved are the studies by Herath et al. [48, 49]. Theyare more of an analysis of Sinhala morphology rather thana working tool. It is also worth to note that these workspredates the introduction of Sinhala unicode and thus use atransliteration of Sinhala in the Latin alphabet.

The next attempt by Herath et al. [114] creates a modularunit structure for morphological analysis of Sinhala. Muchlater, as a step on their efforts to create a system with theability to do English-to-Sinhala machine translation, Hettigeand Karunananda [59] claim to have created a morphologi-cal analyzer (void of any public data or code), which linksto their studies of a Sinhala parser [60] and computationalgrammar [18]. Hettige et al. [71] further propose a multi-agent System for morphological analysis. Welgama et al.[115] attempted to evaluate machine learning approachesfor Sinhala morphological analysis. Yet another indepen-dent attempt to create a morphological parser for Sinhalaverbs was carried out by Fernando and Weerasinghe [116].Later, another study, which was restricted to morphologicalanalysis of Sinhala verbs was conducted by Dilshani andDias [117]. There was no indication on whether this workwas continued to cover other types of words. Further, otherthan this singular publication, no data or tools were madepublicly accessible. Nandathilaka et al. [118] proposed arule based approach for Sinhala lemmatizing. An extremelysimple plagiarism detection tool which only uses n-gramsof simply tokenized text was proposed by Basnayake et al.[119]. The work by Welgama et al. [120] claim to have seta set of gold standard definitions for the morphology ofSinhala Words; but given that their results are not publiclyavailable, further usage or confirmation of these claimscannot not be done.

3.6 Part of Speech TaggersThe next step after morphological analysis is Part of Speech(PoS) tagging. The PoS tags differ in number and function-ality from language to language. Therefore, the first step increating an effective PoS tagger is to identifying the PoStag set for the language. This work has been accomplishedby Fernando et al. [87] and Dilshani et al. [88]. Expandingon that, Fernando et al. [87] has introduced an SVM BasedPoS Tagger for Sinhala and then Fernando and Ranathunga[89] give an evaluation of different classifiers for the taskof Sinhala PoS tagging. While here it is obvious that there

Page 7: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

7

TABLE 5: Morphological Analyzers comparisonBase: Rule-based (RB) / Machine Learning (ML)Able to Handle Part of Speech (Handles): Yes (Y) / No (N)Outputs: Yes (Y) / No (N) / No Information (O)Abbreviations: Nouns (Nu), Verbs (Ve), Adjectives (Aj), Adverbs (Av), Function Words (Fn), Root (R), Person (P), Number (Nb), Gender (G),Article (A), Case (C)

Base Modus Operandi Handles OutputNu Ve Aj Av Fn R P Nb G A C

Hettige and Karunananda [59] RB Finite State Automata Y Y N N N Y Y Y Y Y YHettige et al. [71] RB Agent-based Y Y Y Y Y N N N N N NNandathilaka et al. [118] RB N/A Y N N N N Y N N N N NWelgama et al. [115] ML Morfessor algorithm Y Y Y Y Y Y N N N N NFernando and Weerasinghe [116] RB Finite State Transducer N Y N N N Y Y Y Y N YDilshani and Dias [117] RB N/A N Y N N N O O O O O O

has been some follow up work after the initial foundation,it seems, all of that has been internal to one researchgroup at one institution as neither the data nor the toolsof any of these findings have been made available for theuse of external researchers. Several attempts to create astochastic PoS tagger for Sinhala has been done with thestudies by Herath and Weerasinghe [121], Jayaweera andDias [122], and Jayasuriya and Weerasinghe [123] being themost notable. Within a single group which did one of theabove stochastic studies [122], yet another set of studieswas carried out to create a Sinhala PoS tagger starting withthe foundation of Jayaweera and Dias [124] which thenextended to a Hidden Markov Model (HMM) based ap-proach [125] and an analysis of unknown words [126, 127].Further, this group presented a comparison of few SinhalaPoS taggers that are available to them [128]. A RESTFul PoStagging web service created by Jayaweera and Dias [129]using the above research can still be accessed22 via POSTand GET. A hybrid PoS tagger for Sinhala language wasproposed by Gunasekara et al. [130].

3.7 ParsersThe PoS tagged data then needs to be handed over toa parser. This is an area which is not completely solvedeven in English due to various inherent ambiguities innatural languages. However, in the case of English, thereare systems which provide adequate results [131] even if notperfect yet. The Sinhala state of affairs, is that, the first parserfor the Sinhala language was proposed by Hettige andKarunananda [60] with a model for grammar [18]. The studyby Liyanage et al. [38] is concentrated on the same given thatthey have worked on formalizing a computational grammarfor Sinhala. While they do report reasonable results, yetagain, do not provide any means for the public to access thedata or the tools that they have developed. Kanduboda [37]have worked on Sinhala differential object markers relevantfor parsing.

The first attempt at a Sinhala parser, as mentioned above,was by Hettige and Karunananda [60] where they createdprototype Sinhala morphological analyzer and a parser aspart of their larger project to build an end-to-end transla-tor system. The function of the parser is based on threedictionaries: Base Dictionary, Rule Dictionary, and ConceptDictionary. They are built as follows:

22. http://bit.ly/2F0jKid

• The Base Dictionary: prakurthi (base words), nipatha(prepositions), upasarga (prefixes), and vibakthi (Irreg-ular Verbs).

• The Rule Dictionary: inflection rules used to gener-ate various forms of verbs and nouns from the basewords.

• The Concept Dictionary: synonyms and antonymsfor the words found in the base dictionary.

Parsers are, in essence, a computational representationof the grammar of a natural language. As such, in buildingSinhala parsers, it is crucial to create a computational modelfor Sinhala grammar. The first such attempt was takenby Hettige and Karunananda [18] with special considerationgiven to Morphology and the Syntax of the Sinhala languageas an extension to their earlier work [60]. Here, it is worthyto note that, unlike in their earlier attempt [60], where theyexplicitly mentioned that they are building a parser, in thisstudy [18], they use the much conservative claim of buildinga computational grammar. Under Morphology, they againhandled Sinhala inflection. Their system is based on a FiniteState Transducer (FST) and Context-Free Grammar (CFG)where they they modeled 85 rules for nouns and 18 rulesfor verbs. The specific implementation is more partial to arule-based composer rather than parser. It is also worthy tonote that this system could only handle simple sentenceswhich only contained the following 8 constituents: Attribu-tive Adjunct of Subject, Subject, Attributive Adjunct of Object,Object, Attributive Adjunct of Predicate, Attributive Adjunctof the Complement of Predicate, Complement of Predicate, andPredicate. With these, they propose the following grammarrules for Sinhala:

S = Subject AkkyanayaSubject = SimpleSubject | ComplexSubjectComplexSubject = SimpleSubject ConSubSimpleSubject = Noun | Adjective NounConSub = Conjunction SimpleSubjectAkkyanaya = VerbP | Object VerbPObject = SimpleObject | ComplexObjectComplexObject = Conjunction SimpleObjectSimpleObject = Noun | Adjective NounVerbP = Verb | Adverb Verb

The later work by Liyanage et al. [38] also involvesformalizing a computational grammar for Sinhala. Theyclaim that Sinhala can have any order of words in practice.However, they do not note that this is happening because

Page 8: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

8

practices of the spoken language, which does not sharethe strong SOV conventions of the written language, areslowly seeping into written text. However, they do makenote of how Sinhala grammar is modeled as a head-finallanguage [39]. They propose the Sinhala Noun Phrase (NP )to be defined as shown in equation 1 where NN is a nounwhich can be of types: common noun (N ), pronoun (PrN )or proper noun (PropN ). The adjectival phrase (ADJP )is then defined as as shown in equation 2 where: Det isa Determiner, Adj is the adjective, and Deg is an optionaloperator Degrees which can be used to intensify the meaningof the adjective in cases where the adjective is qualitative.While they note that according to Gunasekara [132], therehas to three classes of adjectives (qualitative, quantitative,and demonstrative), they do not implement this distinctionin their system. Similarly, they propose Sinhala Verb Phrase(V P ) to be defined as shown in equation 3 where V is asingle verb. They here note that they are ignoring compoundverbs and auxiliary verbs in their grammar. The adverbialphrases (ADV P ) are then recursively defined as as shownin equation 4.

NP = [ADJP ][NN ] (1)

ADJP =

[[Det]

[[Deg][Adj]

]](2)

V P = [ADV P ][V ] (3)

ADV P =

[[NP ]

[[ADV P ]

[[Deg][ADV ]

]]](4)

Similar to Hettige and Karunananda [18], the workby Liyanage et al. [38] also builds a CFG for Sinhala covering10 out of the 25 types of simple sentence structures in Sin-hala reported by Abhayasinghe [40]. This parser is unableto parse sentences where inanimate subjects do not considerthe number. Further, sentences which contain, compoundverbs, auxiliary verbs, present participles, or past participlescannot be handled by this parser. If the verbs have imperativemood or negation those too cannot be handled by this. Non-verbal sentences which end with adjectives, oblique nominals,locative predicates, adverbials, or any other language entitywhich is not a verb cannot be handled by this parser.

The study by Kanduboda [37] covers not the whole ofSinhala parsing but analyzes a very specific property ofSinhala observed by Aissen [133] which states that it is pos-sible to notice Differential Object Marking (DOM) in Sinhalaactive sentences. Kanduboda [37] define this as the choiceof /wa/ and /ta/ object markers. They further observe threeunique aspects of DOM in Sinhala: (a) it is only observedin active sentences which contain transitive verbs, (b) it canoccur with accusative marked nouns but not with any othercases, (c) it exists only if the sentence has placed an animatenoun in the accusative position. They do a statistical analysisand provide a number of short gazetteer lists as appendixes.However, they observe that further work has to be donefor this particular language rule in Sinhala given that theyfound some examples which proved to be exceptions to thegeneral model which they proposed.

3.8 Named Entity Recognition Tools

As shown in Fig 1, once the text is properly parsed, ithas to be processed using a Named Entity Recognition(NER) system. The first attempt of Sinahla NER was doneby Dahanayaka and Weerasinghe [134]. Given that theywere conducting the first study for Sinhala NER, they basedtheir approach on NER research done for other languages. Inthis, they gave prominent notice to that of Indic languages.On that matter, they were the first to make the interestingobservation that NER for Indic languages (including, butnot limited to Sinhala) is more difficult than that of Englishby the virtue of the absence of a capitalization mechanic.Following prior work done on other languages, they usedConditional Random Fields (CRF) as their main model andcompared it against a baseline of a Maximum Entropy (ME)model. However, they only use the candidate word, ContextWords around the candidate word, and a simple analysis ofSinhala suffixes as their features.

The follow up work by Senevirathne et al. [135] kept theCRF model with all the previous features but did not reportcomparative analysis with an ME model. The innovation in-troduced by this work is a richer set of features. In additionto the features used by Dahanayaka and Weerasinghe [134],they introduced, Length of the Word as a threshold feature.They also introduced First Word feature after observingcertain rigid grammatical rules of Sinhala. A feature of clueWords in the form of a subset of Context Words featurewas first proposed by this work. Finally, they introduceda feature for Previous Map which is essentially the NE valueof the preceding word. Some of these feature extractionsare done with the help of a rule-based post-processor whichutilizes context-based word lists.

The third attempt at Sinhala NER was by Manamini et al.[90] who dubbed their system Ananya. They inherit the CRFmodel and ME baseline from the work of Dahanayaka andWeerasinghe [134]. In addition to that, they take the en-hanced feature list of Senevirathne et al. [135] and enrich itfurther more. They introduce a Frequency of the Word featurebased on the assumption that most commonly occurringwords are not NEs. Thus, they model this as a Booleanvalue with a threshold applied on the word frequency. Theyextend the First Word feature proposed by Senevirathneet al. [135] to a First Word/ Last Word of a Sentence featurenoting that Sinhala grammar is of SOV configuration. Theyintroduce a (PoS) Tag feature and a gazetteer lists basedfeature keeping in line with research done on NER inother languages. They formally introduce clue Words, whichwas initially proposed as a sub-feature by Dahanayakaand Weerasinghe [134], as an independent feature. Utilizingthe fact that they have the ME model unlike Dahanayakaand Weerasinghe [134], they introduce a complementaryfeature to Previous Map named Outcome Prior, which uses theunderlying distribution of the outcomes of the ME model.Finally, they introduce a Cutoff Value feature to handle theover-fitting problem.

The table 6 provides comparative summery of the dis-cussion above. It should be noted that all three of these mod-els only tag NEs of types: person names, location names andorganization names. The Ananya system by Manamini et al.

Page 9: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

9

TABLE 6: NER system comparison* Denotes a baseline.F1 to F11 denotes Context Words, Word Prefixes and Suffixes, Length of the Word, Frequency of the Word, First Word/ Last Word of aSentence, (POS) Tags, Gazetteer Lists, Clue Words, Outcome Prior, Previous Map, and Cutoff Value

CRF ME FeaturesF1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

Dahanayaka and Weerasinghe [134] Yes Yes* Yes Yes No No No No No No No No NoSenevirathne et al. [135] Yes No Yes Yes Yes No Yes No No Yes No Yes NoManamini et al. [90] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

[90] is available to download at GitHub 23. The data andcode for the approaches by Dahanayaka and Weerasinghe[134] and by Senevirathne et al. [135] are not accessible tothe public.

3.9 Semantic ToolsApplications of the semantic layer are more advanced thanthe ones below it in Figure 1. But even with the obviouslack of resources and tools, a number of attempts havebeen made on semantic level applications for the SinhalaLanguage. The earliest attempt on semantic analysis wasdone by Herath et al. [136] using their earlier work whichdealt with Sinhala morphological analysis [48]. A Sinhalasemantic similarity measure has been developed for shortsentences by Kadupitiya et al. [137]. This work has beenthen extended by Kadupitiya et al. [138] for the applicationuse case of short answer grading. Data and tools for theseprojects are not publicly available. A deterministic processflow for automatic Sinhala text summarizing was proposedby Welgama [139].

There have been multiple attempts to do word sensedisambiguation (WSD) [140–144] for Sinhala. For this, Aruk-goda et al. [145] have proposed a system based on the LeskAlgorithm[146] while Marasinghe et al. [147] have proposeda system based on probabilistic modeling. A dialogue actrecognition system which utilizes simple classification algo-rithms has been proposed by Palihakkara et al. [148].

Text classification is a popular application on the seman-tic layer of the NLP stack. A very basic Sinhala text classi-fication using Naıve Bayes Classifier, Zipf’s Law Behavior,and SVMs was attempted by Gallege [149]. A smaller imple-mentation of Sinhala news classification has been attemptedby de Silva [22]. As mentioned in Section 3.2, their newscorpus is publicly available24. A word2vec based tool25 forsentiment analysis of Sinhala news comments is available.Another attempt on Sinhala text classification using sixpopular rule based algorithms was done by Lakmali andHaddela [150]. Even-though they talk about building a cor-pus named SinNG5, they do not indicate of means for othersto obtain the said corpus. Another study by Kumari andHaddela [151] utilizes the SinNG5 corpus as the data set fortheir attempt to use LIME [152] for human interpretability ofSinhala document classification. However, they too do notprovide access corpus. A methodology for constructing asentiment lexicon for Sinhala Language in a semi-automatedmanner based on a given corpus was proposed by Chathu-ranga et al. [153]. Nanayakkara and Ranathunga [154] have

23. http://bit.ly/2XrwCoK24. https://osf.io/tdb84/25. http://bit.ly/2QKI9Np

implemented a system which uses corpus-based similaritymeasures for Sinhala text classification. Gunasekara andHaddela [155] claim to have created a context aware stopword extraction method for Sinhala text classification basedon simple TF-IDF. An LSTM based textual entailment sys-tem for Sinhala was proposed by Jayasinghe and Sirts [156].

3.10 Phonological ToolsOn the case of phonological layer, a report on Sinhala pho-netics and phonology was published by Wasala and Gamage[157]. Wickramasinghe et al. [158] discussed the practicalissues in developing Sinhala Text-to-Speech and SpeechRecognition systems. Based on the earlier work by Weeras-inghe et al. [159], Wasala et al. [160] have developed meth-ods for Sinhala grapheme-to-phoneme conversion alongwith a set of rules for schwa epenthesis. This work was thenextended by Nadungodage et al. [161]. Weerasinghe et al.[162] developed a Sinhala text-to-speech system. However,it is not publicly accessible. They internally extended it tocreate a system capable of helping a mute person achievesynthesized real-time interactive voice communication inSinhala [163]. A rule based approach for automatic segmen-tation of a small set of Sinhala text into syllables was pro-posed by Kumara et al. [164]. An ew prosodic phrasing methodto help with Sinhala Text-to-Speech process was proposedby Bandara et al. [165, 166, 167]. Sodimana et al. [168]proposed a text normalization methodology for Sinhala text-to-speech systems. Further, Sodimana et al. [169] formalizeda step-by-step process for building text-to-speech voices forSinhala. A separate group has done work on Sinhala text-to-speech systems independent to above [170].

On the converse, Nadungodage et al. [171] have done aseries of work on Sinhala speech recognition with specialnotice given to Sinhala being a resource poor language. Thisproject divides its focus on: continuity [172], active learn-ing [173], and speaker adaptation [174]. A Sinhala speechrecognition for voice dialing which is speaker independentwas proposed by Amarasingha and Gamini [175] and onthe other end, a Sinhala speech recognition methodologyfor interactive voice response systems, which are accessedthrough mobile phones was proposed by Manamperi et al.[176]. Priyadarshani [177] proposes a method for speakerdependant speech recognition based on their previous workon: dynamic time warping for recognizing isolated Sin-hala words [178], genetic algorithms [179], and syllablesegmentation method utilizing acoustic envelopes [180].The method proposed by Gunasekara and Meegama [181]utilizes an HMM model for Sinhala speech-to-text. A Sin-hala speech recognizer supporting bi-directional conver-sion between Unicode Sinhala and phonetic English was

Page 10: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

10

proposed by Punchimudiyanse and Meegama [182]. Thework by Karunanayake et al. [183] transfer learns CNNs fortranscribing free-form Sinhala and Tamil speech data setsfor the purpose of classification.

The Sinhala speech classification system proposedby Buddhika et al. [184] does so without converting thespeech-to-text. However, they report that this approachonly works for specific domains with well-defined limitedvocabularies.

3.11 Optical Character Recognition Applications

While it is not necessarily a component of the NLP stackshown in Fig 1, which follows the definition by Liddy [24],it is possible to swap out the bottom most phonological layerof the stack in favour of an Optical Character Recognition(OCR) and text rendering layer.

An attempt for Sinhala OCR system has been takenby Rajapakse et al. [185] before any other work has beendone on the topic. Much later, a linear symmetry basedapproach was proposed by Premaratne and Bigun [186, 187].They then used hidden Markov model-based optimizationon the recognized Sinhala script [188]. Similarly, Hewavitha-rana et al. [189] used hidden Markov models for off-lineSinhala character recognition. Statistical approaches withhistogram projections for Sinhala character recognition isproposed by Hewavitharana and Kodikara [190], by Ajwardet al. [191], and by Madushanka et al. [192]. Karunanayakaet al. [193] also did off-line Sinhala character recognitionwith a use case for postal city name recognition. A separategroup had attempted Sinhala OCR [194] mainly involvingthe nearest-neighbor method [195]. A study by Ediriweera[196] uses dictionaries to correct errors in Sinhala OCR.An early attempt for Sinhala OCR by Dias et al. [197] hasbeen extended to be online and made available to use viadesktops [198] and hand-held devices [199] with the abilityto recognize handwriting. A simple neural network basedapproach for Sinhala OCR was utilized by Rimas et al. [200].A fuzzy-based model for identifying printed Sinhala charac-ters was proposed by Gunarathna et al. [201]. Premachandraet al. [202] proposes a simple back-propagation artificialneural network with hand crafted features for Sinhala char-acter recognition. Another neural network with specializedfeature extraction for Sinhala character recognition wasproposed by Jayamaha and Naleer [203]. On the matter ofneural networks and feature extraction, a feature selectionprocess for Sinhala OCR was proposed by Kumara andRagel [204]. Jayawickrama et al. [205] worked on Sinhalaprinted characters with special focus on handling diacriticvowels. However, they opted to refer to diacritic vowelsas modifiers in their work. Gunawardhana and Ranathunga[206] proposed a limited approach to recognize Sinhalaletters on Facebook images.

Fernando et al. [207] claim to have created a databasefor handwriting recognition research in Sinhala languageand further claims that the data set is available at NationalScience Foundation (NSF) of Sri Lanka. However, the paperprovides no URLs and we were not able to find the dataset on the NSF website either. The work by Karunanayakaet al. [208] is focused on noise reduction and skew correctionof Sinhala handwritten words. A genetic algorithm-based

approach for non-cursive Sinhala handwritten script recog-nition was proposed by Jayasekara and Udawatta [209]. Ni-laweera et al. [210] compare projection and wavelet-basedtechniques for recognizing handwritten Sinhala script. Silvaand Kariyawasam [211] worked on segmenting Sinhalahandwritten characters with special focus on handling dia-critic vowels. A comparative study of few available Sinhalahandwriting recognition methods was done by Silva et al.[212]. Silva et al. [213] uses contour tracing for isolated char-acters in handwritten Sinhala text. A Sinhala handwritingOCR system which utilizes zone-based feature extractionhas been proposed by Dharmapala et al. [214]. The studyby Walawage and Ranathunga [215] specifically focuses onsegmentation of overlapping and touching Sinhala hand-written characters.

Summarizing on optically recognized old Sinhala textfor the purpose of archival search and preservation wasexplored by Rathnasena et al. [216]. The work of Peiris[217] also focused on OCR for ancient Sinhala inscriptions.A neural network based method for recognizing ancientSinhala inscriptions was proposed by Karunarathne et al.[218]. Chanda et al. [219] proposed a Gaussian kernel SVMbased method for word-wise Sinhala, Tamil, and Englishscript identification.

3.12 Translators

A series of work has been done by a group towards Englishto Sinhala translation as mentioned in some of the abovesubsections. This work includes; building a morphologi-cal analyzer [59], lexicon databases [62], a transliterationsystem [63], an evaluation model [68], a computationalmodel of grammar [18], and a multi-agent solution [74].After working on human-assisted machine translation [64],Hettige and Karunananda [67, 69] have attempted to es-tablish a theoretical basics for English to Sinhala machinetranslation. A very simplistic web based translator wasproposed [65, 66]. The same group have worked on aSinhala ontology generator for the purpose of machinetranslation [73] and a phrase level translator [75] basedon the previous work on a multi-agent system for trans-lation [72]. Further, an application of the English to Sinhalatranslator on the use case of selected text for reading wasimplemented [70].

Another group independently attempted English-to-Sinhala machine translation [220] with a statistical ap-proach [221]. Wijerathna et al. [222] and De Silva et al.[223] have proposed simple rule based translators. Anexample-based method applied on the English-Sinhala sen-tence aligned government domain corpus was proposedby Silva and Weerasinghe [224]. A translator based on alook-up system was proposed by Vidanaralage et al. [225].In a preprint, Joseph et al. [226] proposes an evolutionaryalgorithm for Sinhala to English translation with a basis ofPoint-wise Mutual Information (PMI) and claims that thecode will be shared once the paper is accepted. However,they do not report any quantitative results to be comparedand the reported qualitative results are also superficial.

Most of the cross Sinhala and Tamil work has been donein the domain of machine translation. A neural machinetranslation for Sinhala and Tamil languages was initiated

Page 11: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

11

by Tennage et al. [227, 228]. Then they further enhancedit with transliteration and byte pair encoding [229] andused synthetic training data to handle the rare word prob-lem [230]. This project produced Si-Ta [231] a machinetranslation system of Sinhala and Tamil official documents.In the statistical machine translation front, Farhath et al.[232] worked on integrating bilingual lists. The attemptsby Weerasinghe [233] and Sripirakas et al. [234] were also fo-cused on statistical machine translation while Jeyakaran andWeerasinghe [235] attempted a kernel regression method.A yet another attempt was made by Pushpananda et al.[236] which they later extended with some quality im-provements [237]. An attempt on real-time direct translationbetween Sinhala and Tamil was done by Rajpirathap et al.[238]. Dilshani et al. [239] have done a study on the linguisticdivergence of Sinhala and Tamil languages in respect tomachine translation. Mokanarangan [240] claims to havebuilt a named entity translator between Sinhala and Tamilfor official government documents. But this work is lockedbehind an institutional repository wall and thus is not ac-cessible by other researchers. Arukgoda et al. [241] studiedthe possibility of using deep learning techniques to improveSinhala-Tamil translation. While not related to Tamil, therehave been attempts to link Sinhala NLP with Japanese byHerath et al. [21, 76, 77], Thelijjagoda [78], and Kanduboda[8]. There has been an attempt to use dictionary-basedmachine translation [242] between Sinhala and the liturgicallanguage of Buddhism, Pali [243–245].

3.13 Miscellaneous Applications

In this section, we discuss NLP tools and research whichare either hard to categorize under above sections or arereasonably involving multiples of them. The first miscel-laneous application of Sinhala NLP is spell checking. Theopen-source data driven approach proposed by Wasala et al.[246, 247] claims to be able to check and correct spellingerrors in Sinhala. The approach by Jayalatharachchi et al.[248] attempts to obtain synergy between two algorithmsfor the same purpose. These efforts [246, 248] were thenextended by Subhagya et al. [249].

On the matter of Sinhala sign language, strides havebeen made in the domains of computer interpreting for writ-ten Sinhala [250] and animation of finger-spelled words andnumber signs [251]. A simple Sinhala chat bot which utilizesa small knowledge base has been proposed by Hettige andKarunananda [61]. Fernando [252] proposed a method forinexact matching of Sinhala proper names. A study on deter-mining canonical word order of colloquial Sinhala sentencesusing priority information was conducted by Kandubodaand Tamaoka [253] which they later extended [254–256].Jayakody et al. [257] uses simple KNN and SVM methods ona PoS tagged Sinhala corpus to create a question-answeringsystem which they name Mahoshadha.

4 PRIMARY SOURCES

Even though the main objective of this survey is to coverNLP tools and research, we noticed that much of these NLPtools and research depend on primary sources of Sinhalalanguage such as printed books in the role of knowledge

sources and ground truth. Therefore, for the benefit of otherresearchers who venture into Sinhala NLP, we decided toadd a short introduction to the available primary sources ofSinhala language used by their peers. We note that the bodyof work by a single scholar, Disanayake [2, 35, 258–268], isquite prominent in the case of being used for NLP appli-cations. For formal introduction of the language, the booksby Disanayake [2] and Perera [3] are commonly used. Incases which deal with the Sinhala alphabet, the introductionby Indrasena [269] and by Disanayake [258, 259] have beenused. An analysis of modern Sinhala linguistics has beendone by Jayathilake [270] and by Pallatthara and Weihene[36]. The early study by Henadeerage [271] covers a numberof topics on the Sinhala language such as grammaticalrelations, argument structure, phrase structure and focusconstructions.

As we discussed in Section 3.3, a number of printedSinhala dictionaries exist, Malalasekera [95] being the mostprominent English-Sinahala dictionary among them. In ad-dition to that seminal work, previous researchers of SinhalaNLP have utilized a number of other dictionaries of vari-ous configurations such as: English-Sinhala [97, 100, 101],Sinhala-Sinhala [96, 99], and English-Sinhala-Tamil [98].

A number of NLP applications have utilized first sourcesintended to teach children [272–276] or foreigners [132, 277,278]26. This makes sense given that an introduction writtenfor children would start with basic principles and thus beideal for crafting rule based NLP systems and an introduc-tion written for foreigners would have Sinhala languagedescribed in terms of English, making easy the process ofrule based translation of English NLP tools to Sinhala.

For applications where a rule based approach for Sinhalaspelling correction is utilized, the books by Disanayake[265, 266], by Koparahewa [279], and by Gair and Karunatil-lake [280] are used to provide a basis. A number of NLPapplications which handle spoken Sinhala in the capacity ofphonological layer (Section 3.10) applications or otherwise,make note of the fact that spoken Sinhala is considerablydistinguishable from written Sinhala, as such, they referprimary sources which explicitly deal with spoken Sin-hala [35, 264, 268, 274, 277, 281, 282].

Primary sources used in NLP application for Sinhalagrammar are varied. A number of them provide overviewsof the entirety of Sinhala grammar [19, 36, 39, 267, 283–292]. There are specific primary sources focusing onverbs [263, 275, 293], nouns [262, 276], prepositions [274],compounds [261], derivation [260], case system [294], andsentence structure [40] of the Sinhala language. The bookby Rajapaksha [295] is commonly used in NLP applica-tions as a guide for word tagging and punctuation markhandling. NLP studies that tackle the hard problem ofhandling questions expressed in Sinhala often refer to thebook by Kariyakarawana [296]. Kekulawala [297] has aptlydiscussed the much controversial topic of the situation offuture tense of Sinhala.

5 CONCLUSION

At this point, a reader might think, there seems to be asignificant number of implementations of NLP for Sinhala.

26. Note that [278] is an extension of [132].

Page 12: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

12

Therefore, how can one justify listing Sinhala as a resourcepoor language? The important point which is missing inthat assumption is that in the cases of almost all of theabove listed implementations and findings, the only thingthat is publicly available for a researcher is a set of researchpapers. The corpora, tools, algorithm, and anything else thatwere discovered through these research are either lockedaway as properties of individual research groups or worselost to the time with crashed ancient servers, lost harddrives, and expired web hosts. This reason and probablyacademic/research rivalry have caused these separate re-search groups not to cite or build upon the works of each-other. In many cases where similar work is done, it is are-hashing on the same ideas adopted from resource richlanguages because of, the unavailability of (or the reluctanceto), referring and building on work done by another group.This has resulted in multiple groups building multiplefoundations behind closed doors but no one ending upwith a completed end-to-end NLP work-flow. In conclusion,what can be said is that, even though there are islands ofimplementations done for Sinhala NLP, they are of verysmall scale and/or are usually not readily accessible forfurther use and research by other researchers. Thus, so far,sadly, Sinhala stays a resource poor language.

ACKNOWLEDGMENTS

The authors would like to thank Romain Egele for checkingthe examples we have provided in French for their accuracy.Similarly, the authors would also like to thank Shravan Kalefor checking the examples we have provided in Hindi fortheir accuracy.

REFERENCES[1] R. Englebretson and C. Genetti, “Santa barbara papers in linguis-

tics: Proceeding from the workshop on sinhala linguistics,” SantaBarbara, CA: Department of Linguistics at the University of California,Santa Barbara, 2005.

[2] J. B. Disanayake, National Languages of Sri-Lanka: Sinhala. De-partment of cultural affairs, 1976.

[3] T. G. Perera, Sinhala Bhashawa, 1985.[4] L. Bauer, Linguistics Student’s Handbook. Edinburgh University

Press, 2007.[5] Department of Census and Statistics Sri Lanka. Percentage of

population aged 10 years and over in major ethnic groups bydistrict and ability to speak sinhala, tamil and english languages.[Online]. Available: https://goo.gl/nnVZSd

[6] J. W. Gair and W. Karunatilaka, Literary Sinhala. ERIC, CornellUniversity. New York, 1974.

[7] H. Young. A language family tree - in pictures— education — the guardian. [Online]. Avail-able: https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures

[8] A. B. Kanduboda, “The role of animacy in determining nounphrase cases in the sinhalese and japanese languages,” Science ofwords, vol. 24, pp. 5–20, 2011.

[9] P. Fernando, “Palaeographical development of the brahmi scriptin ceylon from 3rd century bc to 7th century ad,” 1949.

[10] D. Bandara, N. Warnajith, A. Minato, and S. Ozawa, “Creation ofprecise alphabet fonts of early brahmi script from photographicdata of ancient sri lankan inscriptions,” Canadian Journal on Arti-ficial Intelligence, Machine Learning and Pattern Recognition, vol. 3,no. 3, pp. 33–39, 2012.

[11] P. T. Daniels and W. Bright, The world’s writing systems. OxfordUniversity Press on Demand, 1996.

[12] M. H. Sirisoma, “Brahmi inscriptions of sri lanka from 3rdcentury bc to 65 ad,” pp. 3–54, 1990.

[13] M. Dias, “Lakdiwa sellipiwalin heliwana sinhala bhashaweprathyartha namayange vikashanaya,” Department of Archaeology,Colombo Sri Lanka, p. 1, 1996.

[14] A. S. Hettiarachchi, “Investigation of 2nd, 3rd and 4th centuryinscriptions,” Inscriptions: Volume Two, Archaeological DepartmentCentenary (1890–1990), Commemorative Series. Colombo: Departmentof Archaeology, pp. 57–104, 1990.

[15] S. Paranavitana and S. L. P. Departamentuva, Inscriptions ofCeylon. Dept. of Archaeology, 1970.

[16] R. Salomon, Indian epigraphy: a guide to the study of inscriptionsin Sanskrit, Prakrit, and the other Indo-Aryan languages. OxfordUniversity Press, 1998.

[17] H. Falk, Schrift im alten Indien: ein Forschungsbericht mit Anmerkun-gen. Gunter Narr Verlag, 1993, vol. 56.

[18] B. Hettige and A. S. Karunananda, “Computational model ofgrammar for english to sinhala machine translation,” in Advancesin ICT for Emerging Regions (ICTer), 2011 International Conferenceon. IEEE, 2011, pp. 26–31.

[19] A. M. Gunasekara, A Comprehensive Grammar of the SinhaleseLanguage. Asian Educational Services, New Delhi, Madras,India, 1986.

[20] H. P. Ray, The archaeology of seafaring in ancient South Asia. Cam-bridge University Press, 2003.

[21] A. Herath, Y. Hyodo, Y. Kawada, T. Ikeda, and S. Herath, “Apractical machine translation system from japanese to modernsinhalese,” Gifu University, pp. 153–162, 1994.

[22] N. de Silva, “Sinhala Text Classification: Observations from thePerspective of a Resource Poor Language,” 2015.

[23] Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, “Natural Lan-guage Processing for Government: Problems and Potential,”LIRNEasia, 2019.

[24] E. D. Liddy, “Natural language processing,” 2001.[25] D. C. Wimalasuriya and D. Dou, “Ontology-based information

extraction: An introduction and a survey of current approaches,”Journal of Information Science, vol. 36, no. 3, pp. 306–323, 2010.

[26] U. Consortium et al., “The unicode standard: A tech-nical introduction,” online document, http://www. unicode.org/unicode/standards/principles. html, 1996.

[27] R. A. Van der Sandt, “Presupposition projection as anaphoraresolution,” Journal of semantics, vol. 9, no. 4, pp. 333–377, 1992.

[28] S. Lappin and H. J. Leass, “An algorithm for pronominalanaphora resolution,” Computational linguistics, vol. 20, no. 4, pp.535–561, 1994.

[29] W. M. Soon, H. T. Ng, and D. C. Y. Lim, “A machine learning ap-proach to coreference resolution of noun phrases,” Computationallinguistics, vol. 27, no. 4, pp. 521–544, 2001.

[30] V. Ng and C. Cardie, “Improving machine learning approachesto coreference resolution,” in Proceedings of the 40th annual meetingon association for computational linguistics. Association for Com-putational Linguistics, 2002, pp. 104–111.

[31] R. Mitkov, Anaphora resolution. Routledge, 2014.[32] J. Preston and M. J. M. Bishop, Views into the Chinese room: New

essays on Searle and artificial intelligence. OUP, 2002.[33] G. H. Fairbanks, J. W. Gair, and M. W. S. D. Silva, Colloquial

Sinhalese. ERIC, Cornell University. New York, 1968.[34] T. Miyagishi, “Accusative subject of subordinate clause in literary

sinhala,” Journal of Yasuda Women’s University, vol. 33, 2005.[35] J. B. Disanayake, Say it in Sinhala. Lake House Investments

Limited, 1985.[36] S. Pallatthara and P. Weihene, Sinhala Grammar in Linguistic

Perspective, 1966.[37] A. P. B. Kanduboda, “On the usage of sinhalese differential object

markers object marker /wa/ vs. object marker /ta/,” Theory andPractice in Language Studies, vol. 3, no. 7, p. 1081, 2013.

[38] C. Liyanage, R. Pushpananda, D. L. Herath, and R. Weerasinghe,“A computational grammar of Sinhala,” in International Confer-ence on Intelligent Text Processing and Computational Linguistics.Springer, 2012, pp. 188–200.

[39] W. S. Karunatilaka, Sinhala Bhasa Vyakaranaya. M D GunasenaPublishers, Colombo, 1997.

[40] A. A. Abhayasinghe, Sinhala bhashave sarala vakya vibagaya, 1998.[41] C. Jany, “The relationship between case marking and s, a, and o

in spoken sinhala,” Santa Barbara Papers in Linguistics, no. 17, pp.68–84, 2006.

[42] J. Garland, “Morphological typology and the complexity of nom-inal morphology in sinhala,” Santa Barbara Papers in Linguistics,no. 17, pp. 1–19, 2005.

Page 13: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

13

[43] M. Henderson, “Between lexical and lexico-grammatical classifi-cation: nominal classification in sinhala,” Santa Barbara Papers inLinguistics, p. 29, 2005.

[44] T. Noguchi, “Shinharago nyuumon [introductory to the sinhaleselanguage],” Tokyo: Daigaku Shorin, 1984.

[45] T. Miyagishi, “A comparison of word order between japaneseand sinhalese,” Bulletin of Japanese Language and Literature, pp.101–107, 2003.

[46] D. Chandralal, Sinhala. John Benjamins Publishing, 2010, vol. 15.[47] S. P. Singh, A. Kumar, P. Sahu, and P. Verma, “Syntax based

machine translation using blended methodology,” in 2016 2ndInternational Conference on Next Generation Computing Technologies(NGCT). IEEE, 2016, pp. 242–247.

[48] S. Herath, T. Ikeda, S. Yokoyama, H. Isahara, and S. Ishizaki,“Sinhalese morphological analysis: a step towards machine pro-cessing of sinhalese,” in [Proceedings 1989] IEEE InternationalWorkshop on Tools for Artificial Intelligence. IEEE, 1989, pp. 100–107.

[49] S. Herath, T. Ikeda, S. Ishizaki, and Y. Anzai, “Formalization ofsinhalese morphology,” in Proc. of the 40th National Congress ofIPSJ. Jyouhou Syori Gakkai, vol. 1, 1990, pp. 327–328.

[50] V. K. Samaranayake, J. B. Disanayaka, and S. T. Nandasara, “Astandard code for sinhala characters,” Proceedings, 9th AnnualSessions of the Computer Society of Sri Lanka, Colombo, 1989.

[51] V. K. Samaranayake, S. T. Nandasara, J. B. Disanayaka, A. R.Weerasinghe, and H. Wijayawardhana, “An introduction to uni-code for sinhala characters,” University Of Colombo School ofComputing, 2003.

[52] G. Dias and A. Goonetilleke, “Development of standards forSinhala computing,” in 1st Regional Conference on ICT and E-Paradigms, 2004.

[53] G. V. Dias, “Challenges of enabling it in the sinhala language,” in27th Internationalization and Unicode Conference, 2005.

[54] A. R. Weerasinghe, D. L. Herath, and K. Gamage, “The sinhalacollation sequence and its representation in unicode,” LocalizationFocus, 2006.

[55] G. Sandeva, “Design and evaluation of user-friendly yet efficientsinhala input methods,” 2009.

[56] S. Herath, S. Ishizaki, T. Ikeda, Y. Anzai, and H. Aiso, “Machineprocessing of sinhala natural language: a step toward intelligentsystems,” Cybernetics and systems, vol. 22, no. 3, pp. 331–348, 1991.

[57] S. T. Nandasara, “From the past to the present: Evolution ofcomputing in the sinhala language,” IEEE Annals of the Historyof Computing, vol. 31, no. 1, pp. 32–45, 2009.

[58] S. T. Nandasara and Y. Mikami, “Bridging the digital divide insri lanka: some challenges and opportunities in using sinhala inict,” International Journal on Advances in ICT for Emerging Regions(ICTer), vol. 8, no. 1, 2016.

[59] B. Hettige and A. S. Karunananda, “A morphological analyzer toenable english to sinhala machine translation,” in Information andAutomation, 2006. ICIA 2006. International Conference on. IEEE,2006, pp. 21–26.

[60] ——, “A parser for sinhala language-first step towards englishto sinhala machine translation,” in Industrial and InformationSystems, First International Conference on. IEEE, 2006, pp. 583–587.

[61] ——, “First sinhala chatbot in action,” Proceedings of the 3rdAnnual Sessions of Sri Lanka Association for Artificial Intelligence(SLAAI), University of Moratuwa, 2006.

[62] ——, “Developing lexicon databases for english to sinhala ma-chine translation,” in Industrial and Information Systems, 2007.ICIIS 2007. International Conference on. IEEE, 2007, pp. 215–220.

[63] ——, “Transliteration system for english to sinhala machinetranslation,” in Industrial and Information Systems, 2007. ICIIS2007. International Conference on. IEEE, 2007, pp. 209–214.

[64] ——, “Using human-assisted machine translation to overcomelanguage barrier in sri lanka,” Proceedings of 4th Annual session ofSri Lanka Association for Artificial Intelligence, p. 10, 2007.

[65] ——, “Web-based english-sinhala translator in action,” in 20084th International Conference on Information and Automation for Sus-tainability. IEEE, 2008, pp. 80–85.

[66] ——, “Web-based english to sinhala selected texts translationsystem,” Sri Lanka Association for Artificial Intelligence, p. 26, 2008.

[67] ——, “Theoretical based approach to english to sinhala machinetranslation,” in 2009 International Conference on Industrial andInformation Systems (ICIIS). IEEE, 2009, pp. 380–385.

[68] ——, “An evaluation methodology for english to sinhala machine

translation,” in Information and Automation for Sustainability (ICI-AFs), 2010 5th International Conference on. IEEE, 2010, pp. 31–36.

[69] ——, “Varanageema: A theoretical basics for english to sinhalamachine translation,” in Sri Lanka Association for Artificial Intelli-gence (SLAAI), 2010.

[70] B. Hettige, G. Rzevski, and A. S. Karunananda, “Selected textmachine translator for english to sinhala,” 2013.

[71] B. Hettige, A. S. Karunananda, and G. Rzevski, “Multi-agentsystem technology for morphological analysis,” Proceedings of the9th Annual Sessions of Sri Lanka Association for Artificial Intelligence(SLAAI), Colombo, 2012.

[72] ——, “Masmt: A multi-agent system development framework forenglish-sinhala machine translation,” International Journal of Com-putational Linguistics and Natural Language Processing (IJCLNLP),vol. 2, no. 7, pp. 411–416, 2013.

[73] ——, “Sinhala ontology generator for english to sinhala machinetranslation,” in Proc. of KDU International Research Conference,2014.

[74] ——, “A multi-agent solution for managing complexity in englishto sinhala machine translation,” Complex Systems: Fundamentals &Applications, vol. 90, p. 251, 2016.

[75] ——, “Phrase-level english to sinhala machine translation withmulti-agent approach,” in 2017 IEEE International Conference onIndustrial and Information Systems (ICIIS). IEEE, 2017, pp. 1–6.

[76] A. Herath, Y. Hyodo, T. Ikeda, and S. Herath, “Generation ofsinhalese units from japanese bunsetsu structure,” 1993.

[77] A. Herath, Y. Hyodo, Y. Kunieda, T. Ikeda, and S. Herath,“Bunsetsu-based japanese-sinhalese translation system,” Informa-tion sciences, vol. 90, no. 1-4, pp. 303–319, 1996.

[78] S. Thelijjagoda, “Japanese-sinhalese mt system (jaw/sinhalese),”in Proceedings of Asian Symposium on Natural Language Processingto Overcome Language Barriers, IJCNLP-04 Satellite Symposium,2004.

[79] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun,C. Wimalasuriya, N. H. N. D. de Silva, and G. Dias, “Implement-ing a Corpus for Sinhala Language,” in Symposium on LanguageTechnology for South Asia 2015, 2015.

[80] ——, “Comparison between performance of various databasesystems for implementing a language corpus,” in Interna-tional Conference: Beyond Databases, Architectures and Structures.Springer, May 2015, pp. 82–91.

[81] R. Weerasinghe, D. Herath, and V. Welgama, “Corpus-based sin-hala lexicon,” in Proceedings of the 7th Workshop on Asian LanguageResources. Association for Computational Linguistics, 2009, pp.17–23.

[82] F. Guzman, P.-J. Chen, M. Ott, J. Pino, G. Lample, P. Koehn,V. Chaudhary, and M. Ranzato, “Two new evaluation datasets forlow-resource machine translation: Nepali-english and sinhala-english,” arXiv preprint arXiv:1902.01382, 2019.

[83] P. Lison and J. Tiedemann, “Opensubtitles2016: Extracting largeparallel corpora from movie and tv subtitles,” 2016.

[84] R. A. Hameed, N. Pathirennehelage, A. Ihalapathirana, M. Z.Mohamed, S. Ranathunga, S. Jayasena, G. Dias, and S. Fernando,“Automatic creation of a sentence aligned sinhala-tamil parallelcorpus,” in Proceedings of the 6th Workshop on South and SoutheastAsian Natural Language Processing (WSSANLP2016), 2016, pp. 124–132.

[85] M. Z. Mohamed, A. Ihalapathirana, R. A. Hameed, N. Pathiren-nehelage, S. Ranathunga, S. Jayasena, and G. Dias, “Automaticcreation of a word aligned sinhala-tamil parallel corpus,” inEngineering Research Conference (MERCon), 2017 Moratuwa. IEEE,2017, pp. 425–430.

[86] F. Farhath, P. Theivendiram, S. Ranathunga, S. Jayasena, andG. Dias, “Improving domain-specific smt for low-resourced lan-guages using data from different domains,” in Proceedings ofthe Eleventh International Conference on Language Resources andEvaluation (LREC-2018), 2018.

[87] S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “Com-prehensive part-of-speech tag set and svm based pos tagger forsinhala,” in Proceedings of the 6th Workshop on South and SoutheastAsian Natural Language Processing (WSSANLP2016), 2016, pp. 173–182.

[88] N. Dilshani, S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias,“A comprehensive part of speech (pos) tag set for sinhala lan-guage.” The Third International Conference on Linguistics inSri Lanka, ICLSL 2017 . . . , 2017.

[89] S. Fernando and S. Ranathunga, “Evaluation of different clas-

Page 14: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

14

sifiers for sinhala pos tagging,” in 2018 Moratuwa EngineeringResearch Conference (MERCon). IEEE, 2018, pp. 96–101.

[90] S. A. P. M. Manamini, A. F. Ahamed, R. A. E. C. Rajapakshe,G. H. A. Reemal, S. Jayasena, G. V. Dias, and S. Ranathunga,“Ananya-a named-entity-recognition (ner) system for sinhalalanguage,” in Moratuwa Engineering Research Conference (MER-Con), 2016. IEEE, 2016, pp. 30–35.

[91] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jegou, andT. Mikolov, “Fasttext. zip: Compressing text classification mod-els,” arXiv preprint arXiv:1612.03651, 2016.

[92] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrichingword vectors with subword information,” Transactions of theAssociation for Computational Linguistics, vol. 5, pp. 135–146, 2017.

[93] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricksfor efficient text classification,” in Proceedings of the 15th Confer-ence of the European Chapter of the Association for ComputationalLinguistics: Volume 2, Short Papers, 2017, pp. 427–431.

[94] D. Herath, K. Gamage, and A. Malalasekara, “Research report onsinhala lexicon,” Langugae Technology Research Laboratory, UCSC.

[95] G. P. Malalasekera, “English-sinhalese dictionary.” 1967.[96] D. B. Jayathilake, Sinhala Shabdakoshaya (Sinhala Dictionary),

Prathama Bhagaya (Vol 1). Sri Lankan branch of Royal AsianSociety, 1937.

[97] S. Maitipe, Gunasena English-Sinhalese Concise Dictionary. MDGunasena & Co., 1988.

[98] A. Weerasinghe and C. P. Weerasinghe, “Godage english-sinhala-tamil dictionary,” Sri Lanka: S. Godage and brothers, Godage bookshop, vol. 661, 1999.

[99] H. Wijayathunga, Maha Sinhala Sabdakoshaya. M. D. Gunasena &Co. Ltd., Colombo, 2003.

[100] S. Ranaweera, Wasana English-Sinhala Dictionary. WasanaPrakashakayo, Dankotuwa, Sri Lanka, 2004.

[101] K. Gunaratne, Ratna English-Sinhalese Dictionary. Ratna PothPrakasakayo, 513, Maradana road, Colombo 10, Sri Lanka, 2006.

[102] M. Kulatunga. Madura english-sinhala dictionary - onlinelanguage translator. [Online]. Available: https://maduraonline.com/

[103] A. Wasala and R. Weerasinghe, “Ensitip: a tool to unlock theenglish web,” in 11th international conference on humans and com-puters, Nagaoka University of Technology, Japan, 2008, pp. 20–23.

[104] L. Samarawickrama and B. Hettige, “Requirements for anenglish-sinhala smart bilingual dictionary: A review.”

[105] Department of Official Languages, Sri Lanka. Tri-lingual dictio-nary. [Online]. Available: https://www.trilingualdictionary.lk/

[106] A. Weerasinghe and G. Dias, “Construction of a multilingualplace name database for sri lanka,” 2013.

[107] G. A. Miller, “Wordnet: a lexical database for english,” Communi-cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995.

[108] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,”in Proceedings of the 32nd annual meeting on Association for Compu-tational Linguistics. Association for Computational Linguistics,1994, pp. 133–138.

[109] J. J. Jiang and D. W. Conrath, “Semantic similarity based on cor-pus statistics and lexical taxonomy,” in Proc of 10th InternationalConference on Research in Computational Linguistics, ROCLING’97.Citeseer, 1997.

[110] N. de Silva, D. Dou, and J. Huang, “Discovering inconsisten-cies in pubmed abstracts through ontology-based informationextraction,” in Proceedings of the 8th ACM International Conferenceon Bioinformatics, Computational Biology, and Health Informatics.ACM, 2017, pp. 362–371.

[111] I. Wijesiri, M. Gallage, B. Gunathilaka, M. Lakjeewa, D. Wimala-suriya, G. Dias, R. Paranavithana, and N. de Silva, “Building awordnet for Sinhala,” in Proceedings of the Seventh Global WordNetConference, 2014, pp. 100–108.

[112] Sinhala wordnet. [Online]. Available: http://www.wordnet.lk/[113] V. Welgama, D. L. Herath, C. Liyanage, N. Udalamatta,

R. Weerasinghe, and T. Jayawardana, “Towards a sinhala word-net,” in Proceedings of the Conference on Human Language Technologyfor Development, 2011.

[114] S. Herath, T. Ikeda, S. Ishizaki, Y. Anzai, and H. Aiso, “Analysissystem for sinhalese unit structure,” Journal of Experimental &Theoretical Artificial Intelligence, vol. 4, no. 1, pp. 29–48, 1992.

[115] V. Welgama, R. Weerasinghe, and M. Niranjan, “Evaluating a ma-chine learning approach to sinhala morphological analysis,” inProceedings of the 10th International Conference on Natural LanguageProcessing, Noida, India, 2013.

[116] N. Fernando and R. Weerasinghe, “A morphological parser forsinhala verbs,” in Proceedings of the International Conference onAdvances in ICT for Emerging Regions, 2013.

[117] W. S. N. Dilshani and G. Dias, “A corpus-based morphologicalanalysis of sinhala verbs.” The Third International Conferenceon Linguistics in Sri Lanka, ICLSL 2017 . . . , 2017.

[118] M. Nandathilaka, S. Ahangama, and G. T. Weerasuriya, “A rule-based lemmatizing approach for sinhala language,” in 2018 3rdInternational Conference on Information Technology Research (ICITR).IEEE, 2018, pp. 1–5.

[119] S. Basnayake, H. Wijekoon, and T. K. Wijayasiriwardhane, “Pla-giarism detection in sinhala language: A software approach.”

[120] V. Welgama, R. Weerasinghe, and N. Mahesan, “Defining the goldstandard definitions for the morphology of sinhala words.”

[121] D. L. Herath and A. R. Weerasinghe, “A stochastic part ofspeech tagger for sinhala,” in Proceedings of the 06th InternationalInformation Technology Conference, 2004, pp. 27–28.

[122] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Evaluation of stochasticbased tagging approach for sinhala language,” 2012.

[123] M. Jayasuriya and A. R. Weerasinghe, “Learning a stochastic partof speech tagger for sinhala,” in Advances in ICT for EmergingRegions (ICTer), 2013 International Conference on. IEEE, 2013, pp.137–143.

[124] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Part of speech (pos)tagger for sinhala language,” 2011.

[125] ——, “Hidden markov model based part of speech tagger forsinhala language,” arXiv preprint arXiv:1407.2989, 2014.

[126] ——, “Unknown words analysis in pos tagging of sinhala lan-guage,” in Advances in ICT for Emerging Regions (ICTer), 2014International Conference on. IEEE, 2014, pp. 270–270.

[127] ——, “Handling issues with unknown words in pos tagging.”Book of Abstracts, Annual Research Symposium 2014, 2014.

[128] M. Jayaweera and N. G. J. Dias, “Comparison of part of speechtaggers for sinhala language,” 2016.

[129] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Restful pos taggingweb service for sinhala language,” in 2015 Fifteenth InternationalConference on Advances in ICT for Emerging Regions (ICTer). IEEE,2015, pp. 50–57.

[130] D. Gunasekara, W. V. Welgama, and A. R. Weerasinghe, “Hybridpart of speech tagger for sinhala language,” in Advances in ICTfor Emerging Regions (ICTer), 2016 Sixteenth International Conferenceon. IEEE, 2016, pp. 41–48.

[131] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard,and D. McClosky, “The Stanford CoreNLP natural languageprocessing toolkit,” in Association for Computational Linguistics(ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available:http://www.aclweb.org/anthology/P/P14/P14-5010

[132] A. M. Gunasekara, A comprehensive grammar of the Sinhalese lan-guage: adapted for the use of English readers and prescribed for the CivilService examinations. GJA Skeen, 1891.

[133] J. Aissen, “Differential object marking: Iconicity vs. economy,”Natural Language & Linguistic Theory, vol. 21, no. 3, pp. 435–483,2003.

[134] J. K. Dahanayaka and A. R. Weerasinghe, “Named entity recogni-tion for sinhala language,” in Advances in ICT for Emerging Regions(ICTer), 2014 International Conference on. IEEE, 2014, pp. 215–220.

[135] K. U. Senevirathne, N. S. Attanayake, A. W. M. H. Dhananjanie,W. A. S. U. Weragoda, A. Nugaliyadde, and S. Thelijjagoda,“Conditional random fields based named entity recognition forsinhala,” in 2015 IEEE 10th International Conference on Industrialand Information Systems (ICIIS). IEEE, 2015, pp. 302–307.

[136] S. Herath, S. Ishizaki, T. Ikeda, Y. Anzai, and H. Aiso, “Syntacticand semantic analysis of sinhala: a step towards intelligence com-puting systems,” in Proceedings. 5th IEEE International Symposiumon Intelligent Control 1990. IEEE, 1990, pp. 316–324.

[137] J. C. S. Kadupitiya, S. Ranathunga, and G. Dias, “Sinhalashort sentence similarity calculation using corpus-based andknowledge-based similarity measures,” in Proceedings of the 6thWorkshop on South and Southeast Asian Natural Language Processing(WSSANLP2016), 2016, pp. 44–53.

[138] ——, “Sinhala short sentence similarity measures using corpus-based simi-larity for short answer grading,” in 6th Workshop onSouth and Southeast Asian Natural Language Processing, 2017, pp.44–53.

[139] W. V. Welgama, “Automatic text summarization for sinhala,”2012.

[140] D. Yarowsky, “Word-sense disambiguation using statistical mod-

Page 15: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

15

els of roget’s categories trained on large corpora,” in Proceedingsof the 14th conference on Computational linguistics-Volume 2. Asso-ciation for Computational Linguistics, 1992, pp. 454–460.

[141] N. Ide and J. Veronis, “Introduction to the special issue onword sense disambiguation: the state of the art,” Computationallinguistics, vol. 24, no. 1, pp. 2–40, 1998.

[142] D. Yarowsky, “Unsupervised word sense disambiguation rivalingsupervised methods,” in 33rd annual meeting of the association forcomputational linguistics, 1995, pp. 189–196.

[143] S. Banerjee and T. Pedersen, “An adapted lesk algorithm for wordsense disambiguation using wordnet,” in International conferenceon intelligent text processing and computational linguistics. Springer,2002, pp. 136–145.

[144] R. Navigli, “Word sense disambiguation: A survey,” ACM com-puting surveys (CSUR), vol. 41, no. 2, p. 10, 2009.

[145] J. Arukgoda, V. Bandara, S. Bashani, V. Gamage, and D. Wimala-suriya, “A word sense disambiguation technique for sinhala,”in 2014 4th International Conference on Artificial Intelligence withApplications in Engineering and Technology. IEEE, 2014, pp. 207–211.

[146] M. Lesk, “Automatic sense disambiguation using machine read-able dictionaries: how to tell a pine cone from an ice cream cone,”in Proceedings of the 5th annual international conference on Systemsdocumentation. Citeseer, 1986, pp. 24–26.

[147] C. Marasinghe, S. Herath, and A. Herath, “Word sense disam-biguation of sinhala language with unsupervised learning,” inProc. International Conference on Information Technology and Appli-cations, 2002, pp. 25–29.

[148] S. Palihakkara, D. Sahabandu, A. Shamsudeen, C. Bandara, andS. Ranathunga, “Dialogue act recognition for text-based sinhala,”in Proceedings of the 12th International Conference on Natural Lan-guage Processing, 2015, pp. 367–375.

[149] S. Gallege, “Analysis of sinhala using natural language process-ing techniques,” 2010.

[150] K. B. N. Lakmali and P. S. Haddela, “Effectiveness of rule-based classifiers in sinhala text categorization,” in 2017 NationalInformation Technology Conference (NITC). IEEE, 2017, pp. 153–158.

[151] P. K. S. Kumari and P. S. Haddela, “Use of lime for humaninterpretability in sinhala document classification,” in 2019 In-ternational Research Conference on Smart Computing and SystemsEngineering (SCSE). IEEE, 2019, pp. 97–102.

[152] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?:Explaining the predictions of any classifier,” in Proceedings of the22nd ACM SIGKDD international conference on knowledge discoveryand data mining. ACM, 2016, pp. 1135–1144.

[153] P. D. T. Chathuranga, S. A. S. Lorensuhewa, and M. A. L. Kalyani,“Sinhala sentiment analysis using corpus based sentiment lexi-con,” in International Conference on Advances in ICT for EmergingRegions (ICTer), vol. 1, 2019, p. 7.

[154] P. Nanayakkara and S. Ranathunga, “Clustering sinhala news ar-ticles using corpus-based similarity measures,” in 2018 MoratuwaEngineering Research Conference (MERCon). IEEE, 2018, pp. 437–442.

[155] S. V. S. Gunasekara and P. S. Haddela, “Context aware stopwordsfor sinhala text classification,” in 2018 National Information Tech-nology Conference (NITC). IEEE, 2018, pp. 1–6.

[156] S. H. Jayasinghe and K. Sirts, “Deep learning textual entailmentsystem for sinhala language,” 2019.

[157] A. Wasala and K. Gamage, “Research report on phonetics andphonology of sinhala,” Language Technology Research Laboratory,University of Colombo School of Computing, vol. 35, 2005.

[158] R. I. P. Wickramasinghe, K. H. Kumara, and N. G. J. Dias,“Practical issues in the development of tts and sr for the sinhalalanguage,” 2007.

[159] R. Weerasinghe, A. Wasala, and K. Gamage, “A rule basedsyllabification algorithm for sinhala,” in International Conferenceon Natural Language Processing. Springer, 2005, pp. 438–449.

[160] A. Wasala, R. Weerasinghe, and K. Gamage, “Sinhala grapheme-to-phoneme conversion and rules for schwa epenthesis,” in Pro-ceedings of the COLING/ACL on Main conference poster sessions.Association for Computational Linguistics, 2006, pp. 890–897.

[161] T. Nadungodage, C. Liyanage, A. Prerera, R. Pushpananda, andR. Weerasinghe, “Sinhala g2p conversion for speech processing,”in Proc. The 6th Intl. Workshop on Spoken Language Technologies forUnder-Resourced Languages, pp. 112–116.

[162] R. Weerasinghe, A. Wasala, V. Welgama, and K. Gamage,

“Festival-si: A sinhala text-to-speech system,” in InternationalConference on Text, Speech and Dialogue. Springer, 2007, pp. 472–479.

[163] M. S. Amarasekara, K. M. N. S. Bandara, B. V. A. I. Vithana,D. H. De Silva, and A. Jayakody, “Real-time interactive voicecommunication-for a mute person in sinhala (rtivc),” in 2013 8thInternational Conference on Computer Science & Education. IEEE,2013, pp. 671–675.

[164] K. H. Kumara, N. G. J. Dias, and H. Sirisena, “Automatic seg-mentation of given set of sinhala text into syllables for speechsynthesis,” pp. 53–62, 2007.

[165] W. M. C. Bandara, W. M. S. Lakmal, T. D. Liyanagama, S. V. Bu-lathsinghala, G. Dias, and S. Jayasena, “A ew prosodic phrasingmethod for sinhala language,” 2017.

[166] W. M. C. Bandara, S. V. Bulathsinghala, W. M. S.Lakmal, T. D.Liyanagama, G. Dias, and S. Jayasena, “Sinhala text to speechsystem,” 2009.

[167] W. M. C. Bandara, V. M. S. Lakmal, T. D. Liyanagama, S. V. Bu-lathsinghala, G. Dias, and S. Jayasena, “A new prosodic phrasingmodel for sinhala language,” 2013.

[168] K. Sodimana, P. De Silva, R. Sproat, A. Theeraphol, C. F. Li,A. Gutkin, S. Sarin, and K. Pipatsrisawat, “Text normalizationfor bangla, khmer, nepali, javanese, sinhala, and sundanese text-to-speech systems,” 2018.

[169] K. Sodimana, K. Pipatsrisawat, L. Ha, M. Jansche, O. Kjartansson,P. De Silva, and S. Sarin, “A step-by-step process for buildingtts voices using open source data and framework for bangla,javanese, khmer, nepali, sinhala, and sundanese,” 2018.

[170] L. Nanayakkara, C. Liyanage, P.-T. Viswakula, T. Nagungodage,R. Pushpananda, and R. Weerasinghe, “A human quality textto speech system for sinhala,” in Proc. The 6th Intl. Workshop onSpoken Language Technologies for Under-Resourced Languages, pp.157–161.

[171] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Speechrecognition for low resourced languages: Efficient use of trainingdata for sinhala speech recognition by active learning.”

[172] T. Nadungodage and R. Weerasinghe, “Continuous sinhalaspeech recognizer,” in Conference on Human Language Technologyfor Development, Alexandria, Egypt, 2011, pp. 2–5.

[173] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Efficientuse of training data for Sinhala speech recognition using activelearning,” in Advances in ICT for Emerging Regions (ICTer), 2013International Conference on. IEEE, 2013, pp. 149–153.

[174] ——, “Speaker Adaptation Applied to Sinhala Speech Recogni-tion.” Int. J. Comput. Linguistics Appl., vol. 6, no. 1, pp. 117–129,2015.

[175] W. G. T. N. Amarasingha and D. D. A. Gamini, “Speakerindependent sinhala speech recognition for voice dialling,” inInternational Conference on Advances in ICT for Emerging Regions(ICTer2012). IEEE, 2012, pp. 3–6.

[176] W. Manamperi, D. Karunathilake, T. Madhushani, N. Galagedara,and D. Dias, “Sinhala speech recognition for interactive voiceresponse systems accessed through mobile phones,” in 2018Moratuwa Engineering Research Conference (MERCon). IEEE, 2018,pp. 241–246.

[177] P. G. N. Priyadarshani, “Speaker dependent speech recognitionon a selected set of sinhala words,” 2012.

[178] P. G. N. Priyadarshani, N. G. J. Dias, and A. Punchihewa, “Dy-namic time warping based speech recognition for isolated sinhalawords,” in 2012 IEEE 55th International Midwest Symposium onCircuits and Systems (MWSCAS). IEEE, 2012, pp. 892–895.

[179] ——, “Genetic algorithm approach for sinhala speech recog-nition,” in 2012 IEEE 55th International Midwest Symposium onCircuits and Systems (MWSCAS). IEEE, 2012, pp. 896–899.

[180] P. G. N. Priyadarshani and N. G. J. Dias, “Automatic segmen-tation of separately pronounced sinhala words into syllables,”2011.

[181] M. K. H. Gunasekara and R. G. N. Meegama, “Real-time transla-tion of discrete sinhala speech to unicode text,” in 2015 FifteenthInternational Conference on Advances in ICT for Emerging Regions(ICTer). IEEE, 2015, pp. 140–145.

[182] M. Punchimudiyanse and R. G. N. Meegama, “Unicode sinhalaand phonetic english bi-directional conversion for sinhala speechrecognizer,” in 2015 IEEE 10th International Conference on Industrialand Information Systems (ICIIS). IEEE, 2015, pp. 296–301.

[183] Y. Karunanayake, U. Thayasivam, and S. Ranathunga, “Transferlearning based free-form speech command classification for low-

Page 16: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

16

resource languages,” in Proceedings of the 57th Conference of the As-sociation for Computational Linguistics: Student Research Workshop,2019, pp. 288–294.

[184] D. Buddhika, R. Liyadipita, S. Nadeeshan, H. Witharana,S. Javasena, and U. Thayasivam, “Domain specific intent clas-sification of sinhala speech data,” in 2018 International Conferenceon Asian Language Processing (IALP). IEEE, 2018, pp. 197–202.

[185] R. K. Rajapakse, A. R. Weerasinghe, and E. K. Seneviratne, “Aneural network based character recognition system for sinhalascript,” Department of Statistics and Computer Science, University ofColombo, 1995.

[186] H. L. Premaratne and J. Bigun, “Recognition of printed sinhalacharacters using linear symmetry,” in The 5th Asian Conference onComputer Vision, 2002, pp. 23–25.

[187] ——, “A segmentation-free approach to recognise printed sinhalascript using linear symmetry,” Pattern recognition, vol. 37, no. 10,pp. 2081–2089, 2004.

[188] H. L. Premaratne, E. Jarpe, and J. Bigun, “Lexicon and hid-den markov model-based optimisation of the recognised sinhalascript,” Pattern recognition letters, vol. 27, no. 6, pp. 696–705, 2006.

[189] S. Hewavitharana, H. C. Fernando, and N. D. Kodikara, “Off-linesinhala handwriting recognition using hidden markov models.”in ICVGIP, 2002.

[190] S. Hewavitharana and N. D. Kodikara, “A statistical approachto sinhala handwriting recognition,” in Proc. of the InternationalInformation Technology Conference (IITC), Colombo, Sri Lanka, 2002.

[191] S. Ajward, N. Jayasundara, S. Madushika, and R. Ragel, “Con-verting printed sinhala documents to formatted editable text,” in2010 Fifth International Conference on Information and Automationfor Sustainability. IEEE, 2010, pp. 138–143.

[192] P. T. C. Madushanka, R. Bandara, and L. Ranathunga, “Sinhalahandwritten character recognition by using enhanced thinningand curvature histogram based method,” in 2017 IEEE 2nd Inter-national Conference on Signal and Image Processing (ICSIP). IEEE,2017, pp. 46–50.

[193] M. L. M. Karunanayaka, N. D. Kodikara, and G. D. S. P.Wimalaratne, “Off line sinhala handwriting recognition with anapplication for postal city name recognition,” Il’I’C 2004, 2004.

[194] R. Weerasinghe, A. Wasala, D. Herath, and V. Welgama, “Nlpapplications of sinhala: Tts & ocr,” in Proceedings of the Third Inter-national Joint Conference on Natural Language Processing: Volume-II,2008.

[195] A. R. Weerasinghe, D. L. Herath, and N. P. K. Medagoda, “Anearest-neighbor based algorithm for printed sinhala characterrecognition,” Innovations for a Knowledge Economy, p. 11, 2006.

[196] D. N. Ediriweera, “Improviing the accuracy of the output ofsinhala ocr by using a dictionary,” Ph.D. dissertation, Universityof Moratuwa Sri Lanka, 2012.

[197] G. Dias, T. N. P. Patikirikorala, C. I. Arambewela, R. P. M.Darshana, and N. D. Alahendra, “Sinhala optical character recog-nition for desktops,” 2013.

[198] G. Dias, T. N. P. Patikirikorala, C. I. Arambewela, R. P. M.Darshani, and N. D. Alahendra, “Online sinhala handwrittencharacter recognition for desktops,” 2013.

[199] M. H. P. Ranmuthugala, G. D. N. C. Pathiragoda, S. H. C.Jayasundara, G. Dias, and A. S. Karunananda, “Online sinhalahandwritten character recognition on handheld devices,” Innova-tions for a Knowledge Economy, p. 1, 2006.

[200] M. Rimas, R. P. Thilakumara, and P. Koswatta, “Optical characterrecognition for sinhala language,” in 2013 IEEE Global Humanitar-ian Technology Conference: South Asia Satellite (GHTC-SAS). IEEE,2013, pp. 149–153.

[201] G. I. Gunarathna, M. A. P. Chamikara, and R. G. Ragel, “A fuzzybased model to identify printed sinhala characters,” in 7th Inter-national Conference on Information and Automation for Sustainability.IEEE, 2014, pp. 1–6.

[202] H. W. H. Premachandra, C. Premachandra, T. Kimura, andH. Kawanaka, “Artificial neural network based sinhala characterrecognition,” in International Conference on Computer Vision andGraphics. Springer, 2016, pp. 594–603.

[203] J. M. H. M. Jayamaha and H. M. M. Naleer, “Feature extractiontechnique based character recognition using artificial neural net-work for sinhala characters,” 2016.

[204] T. N. Kumara and R. Ragel, “A systematic feature selectionprocess for a sinhala character recognition system,” in 2016IEEE International Conference on Information and Automation forSustainability (ICIAfS). IEEE, 2016, pp. 1–6.

[205] B. R. Jayawickrama, L. Ranathunga, K. L. Mahaliyanaarachchi,L. G. B. Subhagya, and W. H. A. Nimasha, “Letter segmentationand modifier detection in printed sinhala signage,” in 2018 18thInternational Conference on Advances in ICT for Emerging Regions(ICTer). IEEE, 2018, pp. 203–208.

[206] S. Gunawardhana and L. Ranathunga, “Segmentation and iden-tification of presence of sinhala characters in facebook images,”in 2018 IEEE 13th International Conference on Industrial and Infor-mation Systems (ICIIS). IEEE, 2018, pp. 77–82.

[207] H. C. Fernando, N. D. Kodikara, and S. Hewavitharana, “Adatabase for handwriting recognition research in sinhala lan-guage.” in ICDAR, 2003, pp. 1262–1264.

[208] M. L. M. Karunanayaka, C. A. Marasinghe, and N. D. Kodikara,“Thresholding, noise reduction and skew correction of sinhalahandwritten words.” in MVA, 2005, pp. 355–358.

[209] B. Jayasekara and L. Udawatta, “Non-cursive sinhala hand-written script recognition: A genetic algorithm based alphabettraining approach,” in Proceedings of the International Conferenceon Information and Automation, 2005.

[210] N. P. T. I. Nilaweera, H. L. Premeratne, and D. U. J. Sonnadara,“Comparison of projection and wavelet based techniques inrecognition of sinhala handwritten scripts,” in Proceedings of the25th National IT Conference, 2007.

[211] C. Silva and C. Kariyawasam, “Segmenting sinhala handwrittencharacters,” International Journal of Conceptions on Computing andInformation Technology, vol. 2, no. 4, pp. 22–26, 2014.

[212] C. M. Silva, N. D. Jayasundere, and C. Kariyawasam, “State ofhandwriting recognition of modern sinhala script,” 2014.

[213] ——, “Contour tracing for isolated sinhala handwritten characterrecognition,” in 2015 Fifteenth International Conference on Advancesin ICT for Emerging Regions (ICTer). IEEE, 2015, pp. 25–31.

[214] K. A. K. N. D. Dharmapala, W. P. M. V. Wijesooriya, C. P.Chandrasekara, U. K. A. U. Rathnapriya, and L. Ranathunga,“Sinhala handwriting recognition mechanism using zone basedfeature extraction,” 2017.

[215] K. S. A. Walawage and L. Ranathunga, “Segmentation of overlap-ping and touching sinhala handwritten characters,” in 2018 3rdInternational Conference on Information Technology Research (ICITR).IEEE, 2018, pp. 1–6.

[216] K. A. M. P. Rathnasena, K. M. S. J. Kumarasinghe, D. T. P.Paranavitharana, D. V. A. U. Dayarathne, and L. Ranathunga,“Summarization based approach for old sinhala text archivalsearch and preservation,” in 2018 18th International Conference onAdvances in ICT for Emerging Regions (ICTer). IEEE, 2018, pp.182–188.

[217] T. M. T. H. Peiris, “Recognition of inscriptions in ancient srilanka,” 2012.

[218] K. G. N. D. Karunarathne, K. V. Liyanage, D. A. S. Ruwanmini,K. Dias, and S. Nandasara, “Recognizing ancient sinhala inscrip-tion characters using neural network technologies,” InternationaJournal of Scientific Emgineering and Applied Sciences, vol. 3, no. 1,2017.

[219] S. Chanda, S. Pal, and U. Pal, “Word-wise sinhala tamil andenglish script identification using gaussian kernel svm,” in 200819th International Conference on Pattern Recognition. IEEE, 2008,pp. 1–4.

[220] J. Liyanapathirana and R. Weerasinghe, “English to sinhalamachine translation: Towards better information access for srilankans,” in Conference on Human Language Technology for Devel-opment, 2011, pp. 182–186.

[221] J. U. Liyanapathirana, “A statistical approach to english andsinhala translation,” 2013.

[222] L. Wijerathna, W. L. S. L. Somaweera, S. L. Kaduruwana, Y. V.Wijesinghe, D. I. De Silva, K. Pulasinghe, and S. Thellijjagoda, “Atranslator from sinhala to english and english to sinhala (sees),”in International Conference on Advances in ICT for Emerging Regions(ICTer2012). IEEE, 2012, pp. 14–18.

[223] D. De Silva, A. Alahakoon, I. Udayangani, V. Kumara, D. Kolon-nage, H. Perera, and S. Thelijjagoda, “Sinhala to english languagetranslator,” in 2008 4th International Conference on Information andAutomation for Sustainability. IEEE, 2008, pp. 419–424.

[224] A. M. Silva and R. Weerasinghe, “Example based machine trans-lation for english-sinhala translations,” in Proceedings of the 09thInternational IT Conference, 2008, pp. 27–28.

[225] A. J. Vidanaralage, A. U. Illangakoon, S. Y. Sumanaweera,C. Pavithra, and S. Thelijjagoda, “Sinhala language decoder,” in2018 National Information Technology Conference (NITC). IEEE,

Page 17: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

17

2018, pp. 1–5.[226] J. Joseph, W. Chathurika, A. Nugaliyadde, and

Y. Mallawarachchi, “Evolutionary algorithm for sinhala toenglish translation,” arXiv preprint arXiv:1907.03202, 2019.

[227] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath,S. Ranathunga, S. Jayasena, and G. Dias, “Neural machinetranslation for sinhala and tamil languages,” in Asian LanguageProcessing (IALP), 2017 International Conference on. IEEE, 2017,pp. 189–192.

[228] P. N. Tennage, M. W. D. P. Sandaruwan, J. K. M. M. Thilakarathne,A. N. Herath, S. Ranathunga, S. Jayasena, and G. Dias, “Neuralmachine translation for sinhala-tamil,” 2017.

[229] P. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan, andS. Ranathunga, “Transliteration and byte pair encoding to im-prove tamil to sinhala neural machine translation,” in 2018Moratuwa Engineering Research Conference (MERCon). IEEE, 2018,pp. 390–395.

[230] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, andS. Ranathunga, “Handling rare word problem using synthetictraining data for sinhala and tamil neural machine translation,”in Proceedings of the Eleventh International Conference on LanguageResources and Evaluation (LREC-2018), 2018.

[231] S. Ranathunga, F. Farhath, U. Thayasivam, S. Jayasena, andG. Dias, “Si-ta: Machine translation of sinhala and tamil officialdocuments,” in 2018 National Information Technology Conference(NITC). IEEE, 2018, pp. 1–6.

[232] F. Farhath, S. Ranathunga, S. Jayasena, and G. Dias, “Integrationof bilingual lists for domain-specific statistical machine trans-lation for sinhala-tamil,” in 2018 Moratuwa Engineering ResearchConference (MERCon). IEEE, 2018, pp. 538–543.

[233] R. Weerasinghe, “A statistical machine translation approach tosinhala-tamil language translation,” Towards an ICT enabled Soci-ety, p. 136, 2003.

[234] S. Sripirakas, A. R. Weerasinghe, and D. L. Herath, “Statisticalmachine translation of systems for sinhala-tamil,” in Advances inICT for Emerging Regions (ICTer), 2010 International Conference on.IEEE, 2010, pp. 62–68.

[235] M. Jeyakaran and R. Weerasinghe, “A novel kernel regressionbased machine translation system for sinhala-tamil translation,”in Proceedings of 4th Annual UCSC Research Symposium, 2013.

[236] R. Pushpananda, R. Weerasinghe, and M. Niranjan, “Towardssinhala tamil machine translation,” in Advances in ICT for Emerg-ing Regions (ICTer), 2013 International Conference on. IEEE, 2013,pp. 288–288.

[237] ——, “Sinhala-tamil machine translation: Towards better transla-tion quality,” in Proceedings of the Australasian Language TechnologyAssociation Workshop 2014, 2014, pp. 129–133.

[238] S. Rajpirathap, S. Sheeyam, K. Umasuthan, and A. Chelvara-jah, “Real-time direct translation system for sinhala and tamillanguages,” in 2015 Federated Conference on Computer Science andInformation Systems (FedCSIS). IEEE, 2015, pp. 1437–1443.

[239] W. S. N. Dilshani, S. Yashothara, R. T. Uthayasanker, andS. Jayasena, “Linguistic divergence of sinhala and tamil lan-guages in machine translation,” in 2018 International Conferenceon Asian Language Processing (IALP). IEEE, 2018, pp. 13–18.

[240] T. Mokanarangan, “Translation of named entities between sin-hala and tamil for official government documents,” 2019.

[241] A. Arukgoda, A. R. Weerasinghe, and R. Pushpananda, “Improv-ing sinhala-tamil translation through deep learning techniques,”2019.

[242] R. M. M. Shalini and B. Hettige, “Dictionary based machinetranslation system for pali to sinhala,” in SLAAI-InternationalConference on Artificial Intelligence, 2017, p. 23.

[243] R. C. Childers, A dictionary of the Pali language. Trubner, 1875.[244] S. Salaville, An introduction to the study of eastern liturgies. Sands

& Company, 1938.[245] A. J. Liddicoat, “Choosing a liturgical language,” Australian Re-

view of Applied Linguistics, vol. 16, no. 2, pp. 123–141, 1993.[246] A. Wasala, R. Weerasinghe, R. Pushpananda, C. Liyanage, and

E. Jayalatharachchi, “A data-driven approach to checking andcorrecting spelling errors in sinhala,” Int. J. Adv. ICT Emerg. Reg,vol. 3, no. 01, 2010.

[247] R. A. Wasala, R. Weerasinghe, R. Pushpananda, C. Liyanage, andE. Jayalatharachchi, “An open-source data driven spell checkerfor sinhala,” ICTer, vol. 3, no. 1, 2011.

[248] E. Jayalatharachchi, A. Wasala, and R. Weerasinghe, “Data-drivenspell checking: the synergy of two algorithms for spelling error

detection and correction,” in International Conference on Advancesin ICT for Emerging Regions (ICTer2012). IEEE, 2012, pp. 7–13.

[249] L. G. B. Subhagya, L. Ranathunga, W. H. A. Nimasha, B. R.Jayawickrama, and K. L. Mahaliyanaarchchi, “Data driven ap-proach to sinhala spellchecker and correction,” in 2018 18thInternational Conference on Advances in ICT for Emerging Regions(ICTer). IEEE, 2018, pp. 01–06.

[250] M. Punchimudiyanse and R. G. N. Meegama, “Computer in-terpreter for translating written sinhala to sinhala sign,” OUSLJournal, vol. 12, no. 1, pp. 70–90, 2017.

[251] ——, “Animation of fingerspelled words and number signs ofthe sinhala sign language,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 16, no. 4,p. 24, 2017.

[252] S. C. Fernando, “Inexact matching of proper names in sinhala,”2011.

[253] A. B. P. Kanduboda and K. Tamaoka, “Priority information indetermining canonical word order of colloquial sinhalese sen-tences,” in Proceedings of the 139th Conference of the LinguisticSociety of Japan, vol. 1, 2009, pp. 32–37.

[254] ——, “Priority information for canonical word order of writtensinhala sentences,” in Proceedings of the 140th Conference of theLinguistic Society of Japan, 2010, pp. 358–363.

[255] K. Tamaoka, P. B. A. Kanduboda, and H. Sakai, “Effects of wordorder alternation on the sentence processing of sinhalese writtenand spoken forms,” Open Journal of Modern Linguistics, vol. 1,no. 02, pp. 24–32, 2011.

[256] A. B. P. Kanduboda and K. Tamaoka, “Priority information deter-mining the canonical word order of written sinhalese sentences,”Open Journal of Modern Linguistics, vol. 2, no. 01, p. 26, 2012.

[257] J. Jayakody, T. Gamlath, W. Lasantha, K. Premachandra, A. Nu-galiyadde, and Y. Mallawarachchi, ““mahoshadha”, the sinhalatagged corpus based question answering system,” in Proceedingsof First International Conference on Information and CommunicationTechnology for Intelligent Systems: Volume 1. Springer, 2016, pp.313–322.

[258] J. B. Disanayake, Basaka Mahima: 2 - Akuru ha Pili. Godage &brothers, 2000.

[259] ——, Basaka Mahima: 6 - Prakurthi. Godage & brothers, 2004.[260] ——, Sinhala Reethiya: 7 - Pada Nirmanaya. Sumitha Books, 2014.[261] ——, Basaka Mahima: 8 - Tadditha. Godage & brothers, 2000.[262] ——, Basaka Mahima: 10 - Nama Padaya. Godage & brothers, 2008.[263] ——, Basaka Mahima: 11 - Kriya Padaya. Godage & brothers, 2001.[264] ——, The structure of spoken Sinhala: Sounds and their patterns.

National Institute of Education, 1991.[265] ——, Sinhala Akshara Vicharaya (Sinhala Graphology). Sumitha

Publishers, 2006.[266] ——, The Usage of Dental and Cerebral Nasals. Sumitha Publishers,

2007.[267] ——, Bashavaka rata samudaya. Lake house investment Co. Ltd.

Colombo 2, 1969.[268] ——, Grammar of Contemporary Literary Sinhala-Introduction to

Grammar, Structure of Spoken Sinhala. Godage & Bros, 1995, vol.661.

[269] D. A. Indrasena, Sinhala Akshara Malava, 2001.[270] K. Jayathilake, Modern Sinhalese lingustics. Pradeepa Publica-

tions, 1991.[271] D. K. Henadeerage, “Topics in sinhala syntax,” Ph.D. disserta-

tion, The Australian National University, 2002.[272] A. E. S. Dasanayaka, kumara rachanaya; Grade 4. M D Gunasena

Publishers, Colombo, 1990.[273] ——, kumara rachanaya; Grade 5. M D Gunasena Publishers,

Colombo, 2005.[274] S. O. Fernando, Wara nonamena Nipatha. Sammana, January 1994.[275] ——, Kriya pada igena ganimu. Sammana, January 1994.[276] ——, Sinhala Nouns year 11. Sammana, March 1994.[277] E. Ranawake, Spoken Sinhalese for Foreigners. MD Gunasena &

Co. Ltd., 1986.[278] A. M. Gunasekara, A comprehensive grammar of the Sinhalese lan-

guage. Asian Educational Services, 1999.[279] S. Koparahewa, Dictionary of Sinhala Spelling. S. Godage and

Brothers, Colombo, 2006.[280] J. W. Gair and W. S. Karunatillake, The Sinhala Writing System, A

Guide to Transliteration. Sinhamedia, PO Box 1027, Trumansburg,NY 14886, 2006.

[281] W. S. Karunatillake, An Introduction to Spoken Sinhala. M DGunasena & Company, 1990.

Page 18: 1 Survey on Publicly Available Sinhala Natural Language ...1 Survey on Publicly Available Sinhala Natural Language Processing Tools and Research Nisansa de Silva Abstract— Sinhala

18

[282] M. Inman, Duration and Stress in Sinhala. Stanford University,1986.

[283] K. Munidasa, Vyakarana Vivaranaya. MD Gunasena Publishers,Colombo, 1938.

[284] National Institute of Education, Sinhala Laekhana Reethiya, 1989.[285] K. Jayathilake, Nuthana Sinhala Vyakaranaye Mul Potha. Pradeepa

Publications, 1991.[286] U. S. Sannasgala and A. Perera, Viyakarana Vimansawa, 1995.[287] V. G. Balagalle, Basha Adauanayasaha Sinhala Vivaharaya, 1995.[288] W. S. Karunatilaka, Sinhala Basha Viharanaya. M D Gunasena

Publishers, Colombo, 2004.[289] S. Karunarathna, Sinahala Viharanaya. Washana prakasakayo,

Dankotuwa, Sri Lanka, 2004.[290] P. Alwis, Niwaeradi Wahara. Suriya Prakashakayo, 2006, vol. 109.[291] ——, Niwaeradi Wahara -2. Suriya Prakashakayo, 2007.[292] K. C. Perera, Prayogika Sinhla Viyakaranaya.[293] K. Munidasa, Kriya Vivaranaya. MD Gunasena Publishers,

Colombo, 1993.[294] T. Jayawardana, The surface case system in Sinhala. University of

Kelaniya, 1989.[295] D. Rajapaksha, Sinhala bhashave pada bedima saha virama lakshana

bhavithaya, 2008.[296] S. M. Kariyakarawana, The syntax of focus and wh-questions in

Sinhala. Karunaratne & Sons Limited, 1998.[297] S. L. Kekulawala, The future tense in Sinhalese – an ‘unorthodox’

point of view. Vidyalankara University of Ceylon, 1972.