A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA...

of 162 /162
A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION B. Hettige (08/8021) Degree of Master of Philosophy Department of Information Technology University of Moratuwa Sri Lanka December 2010

Embed Size (px)

Transcript of A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA...

  • A COMPUTATIONAL GRAMMAR OF SINHALA FOR

    ENGLISH-SINHALA MACHINE TRANSLATION

    B. Hettige

    (08/8021)

    Degree of Master of Philosophy

    Department of Information Technology

    University of Moratuwa

    Sri Lanka

    December 2010

  • A COMPUTATIONAL GRAMMAR OF SINHALA FOR

    ENGLISH-SINHALA MACHINE TRANSLATION

    Budditha Hettige

    (08/8021)

    Thesis submitted in partial fulfillment of the requirements for the degree

    Master of Philosophy

    Department of Information Technology

    University of Moratuwa

    Sri Lanka

    December 2010

  • Declaration of the Candidate and the Supervisor

    I declare that this is my own work and this thesis does not incorporate any material

    previously submitted for a Degree or Diploma in any other University or institute of

    higher learning, without acknowledgement. It does not contain any material

    previously published or written by another person except where the

    acknowledgement is made in the text to the best of my knowledge and belief.

    Also, I hereby grant to University of Moratuwa the non-exclusive right to reproduce

    and distribute my thesis, in whole or in part in print, electronic or other medium. I

    retain the right to use this content in whole or part in future works (such as articles or

    books).

    Signed

    ..

    Budditha Hettige Date

    Candidate

    The above candidate has carried out research for the M. Phil. dissertation under my

    supervision.

    .. ..

    Prof. Asoka S. Karunananda Date

    . ..

    Dr.

    i

  • Abstract

    Communication is fundamental to the evolution and development of all kinds of living beings. With no disputes, languages should be recognized as the most amazing artifacts ever developed by mankind to enable communication. Computer has also become such a unique machine, due to its capacity to communicate with humans through languages. It is worth mentioning that the languages understood by computers and humans are quite different, yet people can communicate with computers. This has been possible since the computer is fundamentally an artifact that can translate one language to another. Therefore, computers must be able to do language translations than any other computing task. Nowadays, computing is evolving to enable machine-machine communication with no or little human intervention, yet humans continue to face with what is called language barrier for communication. In particular, a vast collection of world knowledge written in English has been inaccessible to communities who cannot communicate in English. Such communities are unable to contribute to the development of world knowledge due to the language barrier. As a result many people have embarked into research in computer aided natural language translation. This area is commonly known as Machine Translation. Among others, Aptium, Bable fish, Google translator, SYSTRAN, EDR, Anusaaraka, AngalaHindi, AnagalaBarathi, and Mantra are some examples for popular machine translation systems. These systems use various approaches including Human-assisted, Rule-based, Corpus-based, Knowledge-based, Hybrid and Agent-based to translate from one language to another. However, due to inherent diversifications of natural languages, a generic machine translation approach is far from reality. This thesis presents a computational grammar for Sinhala language to develop English to Sinhala machine translation system with an underlying theoretical basis. This system is known as BEES, an acronym for Bilingual Expert for English to Sinhala machine translation. The concept of Varanegeema (conjugation) in Sinhala language has been considered as the philosophical basis of this approach to the development of BEES. The Varanegeema in Sinhala language is able to handle large number of language primitives associated with nouns and verbs. For instance, Varanegeema handles the language primitives such as person, gender, tense, number, preposition and subjectivity/objectivity. More importantly, Varanegeema allows deriving all associated word forms from a given base word. This enables to drastically reduce the size of the Sinhala dictionary. Since the concept of Varanegeema can be expressed by a set of rules, it nicely goes with rule-based implementation of machine translation systems. BEES implements 85 grammar rules for Sinhala nouns and 18 rules for Sinhala verbs. BEES compresses with seven modules namely English Morphological analyzer, English Parser, English to Sinhala base word translator, Sinhala Morphological Generator, Sinhala Parser, Transliteration module and Intermediate Editor. In addition to the main modules, system comprises of four dictionaries, namely, English dictionary, Sinhala dictionary, English-Sinhala Bilingual dictionary and the Concept dictionary. BEES primarily shares the features with the Rule-based, Context-based and Human-assisted approaches to machine translation. The BEES has been implemented using Java and Swi-Prolog to run on both Linux and Windows environments. The English to Sinhala Machine Translation system, BEES has been evaluated to test the hypothesis that concepts of Varanegeema can be used to drive English to Sinhala machine translation. The English to Sinhala machine translation system has been evaluated through three steps. As the first step, all the language processing primitives such as morphological analyzers, parsers, translator and the transliteration module have been tested through the white box testing approach. In order to test each module, several online testing tools

    ii

  • including English morphological analyzer, English parser and Sinhala word generator have been implemented. By using these online tools each module has been completely tested through a carefully created test plan. In addition, an online evaluation test bed has also been implemented to continuously capture feedback from online users. This online evaluation test bed gives facilities to make different types of sentences using a given set of words. Word Error Rate and the Sentence Error Rate were calculated by using these evaluation results. Finally the intelligibility and the accuracy tests have been conducted through the human support. In order to evaluate the intelligibility and the accuracy of the English to Sinhala machine translation system, following steps were followed. Two hundred sample sentences were collected and grouped into 20 sets (10 sentences per each set). Then each sentence was translated using the English to Sinhala Machine Translation system. Each set was given to the human translators and scored. The intelligibility and the accuracy were calculated through the above evaluation results. The experimental result shows that English morphological analyzer, English parser, English to Sinhala base word translator, Sinhala morphological generator and the Sinhala sentence generator successfully work with more than 90% accuracy. Overall result of the evaluation shows 89% accuracy with the word error rate of 7.2% and the sentence error rate of 5.4%. The BEES successfully translates English sentences with simple or complex subjects and objects. The translation system successfully handles most commonly used patterns of the tenses including active and passive voice forms.

    iii

  • Acknowledgements

    This thesis is the result of four years of devoted work whereby I have been

    accompanied and supported by many people. It is a pleasant aspect that I have now

    the opportunity to express my gratitude for all of them.

    I am grateful to the University of Moratuwa especially to the faculty of

    Information Technology for providing me the opportunity to do a research study.

    The first person I would like to thank is my supervisor Prof. Asoka Karunananda

    for whom a few lines are too short to make a complete account of my deep

    appreciation. This study would not have been such a success without his

    commonsense knowledge and perceptiveness. I owe him lots of gratitude for

    showing me this way of research. Besides apart from being an excellent supervisor

    Prof. Karunananda has been an understanding teacher and he has provided me

    support in every aspect for the success of this research.

    I am also grateful to thank Dr. Sarath Bannayake, Head, Department of Statistics

    and Computer Science, University of Sri Jayawardenepura for assistance he has

    given to me during the research work.

    With the great pleasure and deep sense of gratitude, I acknowledge Mr. P. Dias

    former head; Senior Lecturer Department of Statistics and Computer Science,

    University of Sri Jayawardenepura for the great help provided me to make a method

    for evaluation.

    I would also like to thank Mr. Niranjan Bandara, Lecturer, Department of Sinhala

    and Mass Communication, University of Sri Jayawerdenepura for his valuable

    support to correct some Sinhala language issues.

    I would like to give my great pleasure and deep sense of gratitude to Venerable

    Kirioruwe Dhamananada thera, Venerable Kukulpane Sudassi thera and Venerable

    Matttumagala Chandanada thera for their valuable support given to me to solve

    Sinhala and English language problems by sharing their knowledge of Sinhala, Pali

    and Sanskrit Language structures.

    iv

  • I am deeply indebted to Mr. Duminda de Silva, Head, Department of Mathematics

    and Computer Science, The Open University of Sri Lanka for the encouragement

    extended to me throughout this study.

    I wish to extend my sincere gratitude to Ms. G. S. Makalanda, Dr. T.G.I. Fernando

    and Dr. E. A. T. A. Edirisooriya, for their great support and encouragement extended

    to me throughout this study.

    My deepest gratitude goes to my mother and my wife for the unconditional support

    given and without their support, this would have been impossible. Again, I must give

    a big thank to my wife Lakshimi for tolerating my busy schedules due to the research

    work. Last but not least I thank all who supported me to make this work a success.

    January 3, 2011 Budditha Hettige

    v

  • Table of Contents

    Declaration of the Candidate and the Supervisor i

    Abstract ii

    Acknowledgements iv

    Table of Contents vi

    List of Figures xi

    List of Tables xii

    Chapter 1 Introduction 1

    1.1 Preamble 1

    1.2 English to Sinhala Machine Translation 2

    1.3 What are Machine Translation Systems? 2

    1.4 Aim of the Research 4

    1.5 Objectives of the Research 4

    1.6 1.5 Scope of the Project 5

    1.7 Hypothesis 5

    1.8 Structure of the Thesis 5

    1.9 Summary 7

    Chapter 2 State of the Art of Machine Translations 8

    2.1 Introduction 8

    2.2 Fundamentals of the Natural Language Processing 8

    2.3 Machine Translation Systems 9

    2.4 Current Approaches to Machine Translation 10

    2.4.1 Human-assisted Machine Translation 10

    2.4.2 Rule-based Machine Translation 12

    2.4.2.1 Transfer-based Machine Translation 14

    2.4.2.2 Interlingua Machine Translation 15

    2.4.2.3 Dictionary based Machine Translation 16

    2.4.3 Statistical Machine Translation 17

    2.4.4 Example-based Machine Translation 18

    2.4.5 Knowledge-based Machine Translation 19

    2.4.6 Hybrid Machine Translation 20

    2.4.7 Agent-based Machine Translation 20

    2.5 Existing English to Sinhala Machine Translation Systems 21

    2.6 Concepts and Techniques for Machine Translation 22

    vi

  • 2.6.1 Morphological Analysis 23

    2.6.2 Syntax Analysis 24

    2.7 Problem Definition 25

    2.8 Summary 26

    Chapter 3 Overview of the English and Sinhala Languages 28

    3.1 Introduction 28

    3.2 The English Language 28

    3.3 The English Language Morphology 28

    3.3.1 English Noun Morphology 29

    3.3.2 English Verb Morphology 30

    3.3.3 English Adjective Morphology 31

    3.4 Syntax of the English Language 32

    3.4.1 The English Sentence Subject 33

    3.4.2 The English Predicate 33

    3.4.3 Verb Tense 33

    3.4.4 The Complement 34

    3.5 Semantics of English Language 35

    3.5.1 Word Level Semantics 35

    3.5.2 Sentence Level Semantics 35

    3.5.3 The paragraphs Level Semantics 35

    3.6 The Sinhala Language 35

    3.6.1 Sinhala Alphabet 36

    3.7 Sinhala Language Morphology 38

    3.7.1 Sinhala Noun Morphology 38

    3.7.2 Sinhala Verb Morphology 41

    3.8 Syntax of the Sinhala Language 43

    3.9 Semantics of the Sinhala Language 44

    3.10 Comparison Between English and Sinhala 44

    3.10.1 Fundamental Differences 45

    3.10.2 Morphological Differences 45

    3.10.3 Syntax in the two Languages 46

    3.11 Language Issues 46

    3.11.1 Grammatical Issues 47

    3.11.2 Text Manipulation Issues 47

    vii

  • 3.12 Challenges in English to Sinhala Machine Translation 48

    3.12.1 Word and Sentence Segmentation 49

    3.12.2 Lexical Selection 49

    3.12.3 Conjugation 49

    3.12.4 Tense Detection 50

    3.12.5 Article Insertion 50

    3.12.6 Sentence boundaries 50

    3.12.7 Word Order 50

    3.13 Summary 51

    Chapter 4 Novel Approach to Machine Translation 52

    4.1 Introduction 52

    4.2 A Theoretical-based Approach to Machine Translation 52

    4.3 Computational Model of Grammar for Sinhala 53

    4.3.1 Computational Model for Sinhala Morphology 53

    4.3.2 Context-Free Grammar for Sinhala language 53

    4.4 Hypothesis 57

    4.5 Approach in a Nutshell 57

    4.6 Features of BEES 57

    4.7 Input for BEES 58

    4.8 Output of BEES 58

    4.9 Process of BEES 58

    4.10 Summary 59

    Chapter 5 Design of BEES 60

    5.1 Introduction 60

    5.2 Design of BEES 60

    5.2.1 English Morphological Analyzer 60

    5.2.2 English Parser 62

    5.2.3 English to Sinhala Base Word Translator 62

    5.2.4 Sinhala Morphological Generator 63

    5.2.5 Sinhala Parser 63

    5.2.6 Transliteration module 64

    5.2.7 Intermediate Editor 64

    5.2.8 Lexical Resources 65

    5.3 Supporting modules 66

    viii

  • 5.3.1 Dictionary Updater 66

    5.3.2 Sinhala Word Generator 67

    5.3.3 Online Search module 67

    5.4 Summary 68

    Chapter 6 Implementation 69

    6.1 Introduction 69

    6.2 Development Stages 69

    6.3 Implementation of the BEES 70

    6.3.1 English Morphological Analyzer 70

    6.3.2 English Parser 74

    6.3.3 English to Sinhala Bilingual Translator 77

    6.3.4 Sinhala Morphological Generator 78

    6.3.5 Sinhala Sentence Composer 81

    6.3.6 Transliteration Module 82

    6.3.7 Intermediate Editor 83

    6.3.8 Lexical Resources 84

    6.3.8.1 English Dictionary 84

    6.3.8.2 Sinhala dictionary 86

    6.3.8.3 English-Sinhala Bilingual dictionary 89

    6.3.8.4 Concept Dictionary 90

    6.4 Supporting modules 91

    6.4.1 Online Updater 91

    6.4.2 Sinhala Word Generator 92

    6.4.3 Online Search module 93

    6.5 Summary 94

    Chapter 7 BEES in Action 95

    7.1 Introduction 95

    7.2 BEES as an Online Translator 95

    7.3 BEES as a Web Page Translator 97

    7.4 BEES as a Selected Sentence Translator 100

    7.5 BEES as a Desktop Application 102

    7.6 Summary 106

    Chapter 8 Evaluation 107

    8.1 Introduction 107

    ix

  • x

    8.2 Evaluation of MT systems 107

    8.3 BEES Evaluation 109

    8.4 Stage1: Module Testing 110

    8.4.1 English Morphological Analyzer 110

    8.4.2 English Parser 111

    8.4.3 English to Sinhala Base Word Translator 112

    8.4.4 Sinhala Morphological Generator 113

    8.4.5 Sinhala Sentence Composer 114

    8.4.6 Transliteration Module 115

    8.5 Stage 2: Performance Testing 115

    8.6 Stage 3: Accuracy Testing 117

    8.7 Result of the Experiments 118

    8.8 Summary 121

    Chapter 9 Conclusion and Further Work 122

    9.1 Introduction 122

    9.2 Revisited Objectives 122

    9.3 Limitations 124

    9.4 Further Works 124

    9.5 Summary 125

    References 126

    Appendix A: English Morphological analyzer- Test plan 135

    Appendix B: Conjugation Table for Sinhala Language 137

    Appendix C: Context-Free Grammar for Sinhala Language 143

    Appendix D: Finite State Transducer for Sinhala Transliteration 145

    Appendix E: Sample Evaluation form 147

    Appendix F: Sample of evaluators Comments 148

  • List of Figures

    Figure 2.1: Architecture for a rule-based machine translation system 13

    Figure 4.1: Finite State Automata for Kaputu Ganaya 54

    Figure 4.2: Parser tree for the sample sentence 56

    Figure 5.1: Design of the BEES 61

    Figure 5.2: FST for Vowels in model 1 transliteration 64

    Figure 5.3: Design of the three supporting module 67

    Figure 6.1: The Intermediate Editor 83

    Figure 7.1: Web based architecture for the BEES 95

    Figure 7.2: User interface of the Online BEES 96

    Figure 7.3: A web page translator 97

    Figure 7.4: BEES as a web page translator 100

    Figure 7.5: Selected sentence translator 101

    Figure 7.6: Desktop screen for selected sentence translation 101

    Figure 7.7: User interface of the BEES 103

    Figure 8.1: English Morphological analyzer with test results 111

    Figure 8.2: Sinhala word conjugator 114

    Figure 8.3: User interface of the evaluation test bed 116

    Figure 8.4: Online evaluation form 117

    Figure 8.5: Translation accuracy 121

    xi

  • xii

    List of Tables

    Table 2.1: Existing Machine translation systems 26

    Table 3.1: Regular and irregular forms of the English Noun 29

    Table 3.2: English Noun Morphological rules 30

    Table 3.3: English verb Morphology 31

    Table 3.4: Morphological rules for English Verbs 32

    Table 3.5: Tense patterns (Active voice) 33

    Table 3.6: The Sinhala Alphabet 36

    Table 3.7: Vocalic Stokes and their position 37

    Table 3.8:The consonant l with vocalic stokes 37

    Table 3.9: Sample case makers in Sinhala 40

    Table 3.10: conjugation table for we;a ganaya 41

    Table 3.11: Inflection form of the Sinhala verbs (Active) 42

    Table 3.12: Inflection form of the Sinhala verbs (Passive) 43

    Table 4.1: Paradigm table for Kaputu Ganaya 54

    Table 6.1: Grammatical notations for the English Dictionary 84

    Table 8.1: Sample test plan for English Morphological analyzer 110

    Table 8.2: Sample test plan for English parser 112

    Table 8.3: Sample Sinhala Morphological rules 113

    Table 8.4: Results for module testing 119

    Table 8.5: Human evaluation results 120

    Table 8.6: Accuracy results 120

    Table 8.7: Final evaluation results 121

    .

  • 1

    Chapter 1

    INTRODUCTION

    1.1 Preamble

    A Natural Language is a kind of marvelous artifact ever invented by mankind. It is a

    cornerstone of all kinds of communications. Each natural language plays the role of

    describing thoughts of humans in a particular environment. As such, a natural

    language has a strong bearing on the culture and the environment within which a

    certain community of persons live. This is why we identify large number of different

    natural languages worldwide. Despite the differences in languages, people still want

    to communicate with persons who use different languages. Differences in languages

    have become a barrier for cross-cultural communications. In particular, many nations

    have not been able to access a huge reservoir of world knowledge written in English,

    unless those nations have a sound knowledge in English. On the other hand, people

    do not know English will not be able to contribute to the world knowledge. It is

    undisputable the importance of mother tongue for discovery and creation of new

    systems of knowledge. Consequently, this has resulted in what is called language

    barrier for communication. In fact, this issue is not only between English and other

    languages, but also between any two languages.

    Of course, people have been practicing a solution for the issue. That is nothing but

    translation between two languages by knowing the both languages. However, can we

    really expect everyone to know every language? Undoubtedly, this is impractical.

    The emergence of digital computer technology in early 1950s had postulated the

    concept of machine translation to seek assistance from computers to seek solutions

    for long felt language needs of humans. Since then hundreds of research works have

    been conducted to translate between natural languages. The machine translation has

    been a branch of Natural Language Processing, which comes under the broad area of

    Artificial Intelligence. It is commonly cited that machine translation has been one of

  • 2

    the least achieved area in Artificial Intelligence over the last sixty years. As such, a

    generic approach to machine translation has been an unrealized dream of researchers.

    Thus, machine translation approaches have become so much language specific.

    1.2 English to Sinhala Machine Translation

    This thesis presents a research conducted to develop English to Sinhala machine

    translation system. Sinhala is one of the Indo Aryan family languages and it is the

    spoken language of 74% of the people in Sri Lanka. Sinhala has also been one of the

    constitutionally recognized official languages of Sri Lanka [53]. Numbers of

    Statistical results show that, more than 80% of Sinhala spoken community does not

    have the ability to read and write in English [46][126]. While encouraging the

    learning of English, one also cannot devalue the importance of mother tongue for

    discovery of knowledge for the betterment of mankind.

    In the Asian region, many countries including India, Thailand, Malaysia and Japan

    have conducted considerable amount of research in machine translation. Despite Sri

    Lanka has been working on various projects in machine translation, still little behind

    as compared with similar researches conducted in the Asian region. Weerasinghe

    [154] has pioneered machine translation research in Sri Lanka. Thus, this project will

    contribute to extend machine translation initiatives in Sri Lanka. The project presents

    a theoretical-based translation approach, which would also be beneficial to machine

    translation projects, which handles languages closer to Sinhala language.

    Before presenting the aim and objectives of the project, a brief introduction to field

    of machine translation is given in section 1.3.

    1.3 What are Machine Translation Systems?

    The Machine Translation system refers to computer software that translates text or

    voice from one natural language into another with or without human assistance [73]

  • 3

    [154]. According to the design, each Machine translation system can be broadly

    categorized into two groups, namely, the direct translation system and the indirect

    translation system. The direct translation system translates source language into

    target language by using word-to-word or phrase-to-phrase mapping. In contrast,

    indirect translation systems use an Interlingua or some kind of transfer method. This

    approach starts with an analysis of source text and performs a synthesis to generate

    corresponding text in the target language. Figure 1.1 gives classic pyramid to show

    relationship between these two approaches to machine translation.

    Figure 1.1: Relationship between direct and indirect translations

    Under the above two broad areas, several approaches have been used to develop

    hundreds of machine translation systems all over the world. Among other

    approaches, Human-assisted, Rule-based, Statistical, Example-based, Knowledge-

    based, Hybrid, and Agent-based are commonly cited as the most successful

    approaches for machine translation.

    Comparing the existing machine translation systems and their approaches, many of

    these systems use sequential level architecture for Natural Language Processing and

    machine translation [59]. This sequence comprises of steps such as preprocessing,

  • 4

    morphological analysis, syntax analysis, semantic analysis, pragmatic analysis and

    post processing.

    Despite many attempts have been taken to develop machine translation systems, at

    present this area has achieved very little. In fact due to ever felt need of machine

    translation, some people have rushed to develop such systems without a proper

    conceptual or theoretical basis for their approaches. This has resulted in creating

    many machine translation systems that go through ad-hoc processes to translate

    between languages. This also amounts to constraint the development in the field of

    machine translation.

    1.4 Aim of the Research

    This thesis proposes to design and develop English to Sinhala machine translation

    system with a theoretical basis.

    1.5 Objectives of the Research

    In order to reach the above aim, the following key objectives have been identified.

    These objectives range from critical review of existing approaches to machine

    translation to evaluation of the proposed theoretical-based approach to machine

    translation.

    Objective 1

    Critically review the existing systems, concepts and tools for machine

    translation.

    Objective 2

    Develop a Computational grammar for Sinhala Language

    Objective 3

    Design and develop English to Sinhala Machine Translation system

  • 5

    Objective 4:

    Evaluate the system

    1.6 1.5 Scope of the Project

    The scope of the project is limited to develop a computational grammar for Sinhala

    language as per concept of Varanegeema to handle most commonly used 27-noun

    forms and 36 verb forms.

    1.7 Hypothesis

    In order to achieve the above aim and objectives, the hypothesis employed in the

    thesis can be stated as concepts of Varanegeema (Conjugation) in Sinhala

    languages can be used to drive English to Sinhala Machine translation.

    1.8 Structure of the Thesis

    The thesis has been structured with nine chapters. The following is the structure of

    the thesis with a brief explanation of the contents of each chapter.

    Chapter 1 has provided an overall introduction to the whole research project. It

    briefly explained the research problem addressed in the thesis, overview for machine

    translation, aim, objectives and the hypothesis employed in the thesis.

    Chapter 2 reports on the literature survey on Machine Translation with a detailed

    description leading to highlight the problem addressed in the thesis. Also this chapter

    provides a detailed study about the state of the art Natural Language Processing by

    describing different approaches adapted.

  • 6

    Chapter 3 is on an overview of the English and Sinhala languages as per

    Morphology, Syntax and Semantic concerns of the both languages. This chapter also

    gives a compression between English and Sinhala languages by showing issues

    related to machine translation.

    Chapter 4 discusses the novel approach taken to develop English to Sinhala machine

    translation system. It presents the hypothesis of the project in the first place. Then the

    chapter explains the mechanism of the translation process, nature of input, output and

    key features of the system.

    Chapter 5 is about the design of the proposed English to Sinhala Machine Translation

    system. Each and every module of the design model is explained separately by

    describing the functionality and relation among the modules.

    Chapter 6 presents the implementation of the English to Sinhala machine translation

    system. This chapter gives implementation details about prolog-based modules, java

    based user interface, Intermediate editor and ontology of the lexical databases.

    Chapter 7 presents how BEES works in practice when translating a given English

    text. This chapter also explains applications of BEES as, a standalone translator, an

    on demand translator, web page translator and selected text translator for machine

    translation.

    Chapter 8 reports evaluation of the English to Sinhala machine translation. The

    evaluation methodology, evaluation steps, participants and the result of the

    evaluation are also given in this chapter.

    Chapter 9 concludes the thesis by referring to achievement of each objective. The

    chapter also presents limitations and further work of the research conducted.

  • 7

    1.9 Summary

    This chapter provided an overview for the entire project by describing the problem

    to be addressed, aim, objectives and the hypothesis employed in the thesis. It briefly

    explained the proposed English to Sinhala Machine Translation. Structure of the rest

    of the thesis has also been presented in the chapter.

    The next chapter reports on critical review of the existing approaches to machine

    translation together with major machine translation systems that are based on these

    approaches.

  • 8

    Chapter 2

    STATE OF THE ART OF MACHINE TRANSLATIONS

    2.1 Introduction

    The previous chapter presented an overview of the thesis. This chapter gives the state

    of the art of Natural language processing with a special attention on the Machine

    Translation. Some of the related fundamental aspects in Machine Translation will

    also be discussed in this chapter.

    2.2 Fundamentals of the Natural Language Processing

    The Natural Language Processing (NLP) is a field of computer science and

    linguistics concerned with the interactions between computers and human (Natural)

    languages [107]. It is also a sub field of Artificial Intelligence (AI) in the area of

    Computer Science [128]. According to many electronic resources, the history of the

    Natural language processing began with the Turing article named Computing

    Machinery and Intelligence [151]. It is known as the Turing test as a criterion of

    intelligence. After that, In 1957 Noam Chomsky in the academic and scientific

    community as one of the fathers of modern linguistics, introduced the Syntactic

    Structures for grammar [31]. It is recognized as a most important text in the field of

    linguistics. After that, it becomes fundamental theory for Natural Language

    Processing and many of these Machine Translation systems use this syntactic

    structure [31][33].

    The Natural language processing has come under broad area of the field of Artificial

    Intelligence. The NLP is used to do several tasks including machine translation,

    automatic summarization, Information retrieval, optical character recognition, speech

    recognition, text-to-speech etc [107][128][147].

    Based on the task, the Natural Language Processing systems reserved several issues

    such as Natural language understanding, Natural language generation, Speech and

    text segmentation, Part-of-speech tagging and the Word sense disambiguation [84]

  • 9

    2.3 Machine Translation Systems

    Machine Translation system is a computer software to translate text or speech from

    one natural language to another [161][162]. The Machine translation is a sub area of

    the Natural language processing which is identified during early days of Artificial

    Intelligent (AI). Due to various reasons associated with complexity of languages, for

    more than last sixty years, Machine Translation has been identified as one of the least

    achieved areas in computing [74]. These issues range from Morphological to

    semantics of source and target languages.

    The history of Machine Translation dates back to late 1940s. A look-up dictionary

    at Birkbeck College in London has been cited as an early work of machine

    translation in 1948. After that, 1950 to 1960 many researchers attended to develop

    Machine Translation systems by using trial-and-error approach [75] especially for

    Russian to English language. In 1950 first machine translation system was developed

    to translate Russian sentences into English.

    In 1958 first practical machine translation system was implemented by the IBM

    Corporation to US Air force under direction of Gilbet King [76]. This system

    translates Russian text into English and it successfully works until 1970. In the

    meantime RAND cooperation distributed current linguistic theory and emphasized

    the Statistical analysis. They were prepared bilingual glossaries with grammatical

    information and the grammar rules with the first parser based on the dependency of

    grammar.

    In 1970, SYSTRAN [144] implemented a new Russian-English machine

    translation system which is the replacement of the previous system of the US Air

    force. This system translated more than 100000 pages per year. In the mean time,

    many researchers were attempting to develop machine translation systems. Among

    others, syntactic transfer system for English-French is one of the strong researches in

    the field. Further, principal experimental effect focused on the Interlingua

    approaches with more attention pays to the syntactic aspects [75].

  • 10

    In 1980, many computer companies attempted to develop computer-aided

    translations especially for Japanese-English. These systems are low level direct

    translation systems that are confined to morphological and syntactic analysis. After

    1980 Machine translation researches were developed through many areas. Corpus-

    based machine translation approach is the most popular approach until now.

    However, due to the complexity of the natural languages, development of the

    machine translation systems has become a research challenge. In addition, many

    researchers have also noted that, Operational syntax, idioms and Universal syntactic

    categories are some completely unsolved linguistic problems in the machine

    translation [171].

    2.4 Current Approaches to Machine Translation

    Considering the translation approaches, machine translation system can be

    classified into seven categories, namely, Human-assisted, Rule-based, Statistical,

    Example-based, Knowledge-based, Hybrid and Agent-based. Statistical, Example

    based, Knowledge based and Hybrid approaches are used copra for the machine

    translation. Therefore, these approaches are named as corpus-based approach. All of

    these machine translation approaches have their own strengths and weakness.

    Obviously, the success rate of a translation is depended on the approach. Each

    approach for the machine translation is discussed below.

    2.4.1 Human-assisted Machine Translation

    Human-assisted machine translation approach is an approach for the machine

    translation particularly Indian families of machine translation. The human assisted

    approach uses human interaction for the pre editing, post editing and/or intermediate

    editing stages[85]. This approach uses human support for the semantic handling in

    the machine translation. Using this human assisted approach, numbers of machine

    translation systems have been developed.

  • 11

    In the Indian region a number of machine translation systems have used this

    approach, including Anusaaraka, ManTra, MaTra, Angalabarathi etc [133][38][146].

    Anusaaraka [4] [7] is a popular Human-assisted translation system for Indian

    languages that makes text in one Indian language accessible to another Indian

    language. This system uses Paninian Grammar model [6] to its language analysis.

    The Anusaaraka project [16] has been developed to translate Punjabi, Bengali,

    Telugu, Kannada and Marathi languages into Hindi. English-Hindi Anusaaraka

    translates English text into Hindi. The approach and lexicon is general, but the

    system has mainly been applied for childrens stories [95].

    MaTra is a human-assisted transfer-based translation system for English to Hindi

    [11]. This System uses general-purpose lexicons and applied mainly in the domains

    of news. MaTra follows a structural and lexical transfer approach for its machine

    translation. The MaTra aims to produce understandable output for wide coverage,

    rather than perfect output for a limited range of sentences.

    Mantra [106] is a machine assisted translation tool that, translates English text into

    Hindi in several domains. ManTra is based on the Tree Adjoining Grammar (TAG).

    The Mantra system was started with the translation of administrative documents such

    as appointment letters, notification and circular issued in central government from

    English to Hindi.

    Angalabharti [103] is also a human-assisted machine translation system used in

    India. Since India has many languages, there are a variety of machine translation

    systems. For example, Angalahindi [133] translates English to Hindi using machine-

    aided translation methodology. Human-aided machine translation approach is a

    common feature of most Indian machine translation systems. In addition, these

    systems also use the concepts of both pre-editing and post-editing as the means of

    human intervention in the machine translation system.

    Chandrashekhar Research Centre [20] has developed a machine aided translation

    system for Tamil to Hindi. Tamil to Hindi translator is based on Anusaaraka

    Machine Translation System and the input text is in Tamil and the output can be seen

  • 12

    in a Hindi text. Stand-alone, API and Web-based on-line versions are developed.

    Tamil morphological analyzer and Tamil-Hindi bilingual dictionary are the

    byproducts of this system [133].

    In addition to the above, KSHALT is a human assisted Machine Translation

    system that translates English to Korean language [85]. This translation system

    contains four phrases namely English Parser, English Analyzer, English to Korean

    transfer and the Korean generation.

    2.4.2 Rule-based Machine Translation

    Rule-based approach is yet another approach for machine translation. This

    approach gives grammatical correct translation by using set of rules. Basically, the

    rule-based machine translation system contains a source language morphological

    analyzer, a source language parser, translator, target language morphological

    analyzer, target language parser and several lexicon dictionaries. Source language

    morphological analyzer analyzes a source language word and provides the

    morphological information. Source language parser is a syntax analyzer that analyzes

    source language sentences. Translator is used to translate a source language word

    into target language. Target language morphological analyzer works as a generator

    and it generates appropriate target language words for the given grammatical

    information. Also target language parser works as a composer and it composes a

    suitable target language sentence. Furthermore, this type of machine translation

    system needs minimum of three dictionaries namely the source language dictionary,

    the bilingual dictionary and the target language dictionary. Source language

    morphological analyzer needs a source language dictionary for morphological

    analysis. Bilingual dictionary is used by the translator for translating source language

    into target language; and the target language morphological generator uses the target

    language dictionary to generate target language words. Figure 2.1 can present general

    architecture of the rule-based machine translation system.

  • 13

    A number of machine translation systems have been designed through the rule-

    based approach. Among others Apertium [18] is a rule-based Machine Translation

    system, which translates related languages. This is an opensource system that can

    be used to translate any related two languages. The Apertium engine follows a

    shallow transfer approach and consists of the eight pipelined modules, such as de-

    formatter, A morphological analyzer, A parts-of-speech (PoS) tagger, A lexical

    transfer module, A structural transfer module, A morphological generator, A post-

    generator, and A re-formatter.

    Source Language Morphological

    Analyzer

    Source language

    Dictionary

    Source Language parser

    Bilingual translator

    Target language Morphological

    generator

    Target language sentence generator

    Target Language

    Target language

    Dictionary

    Dictionary

    Bilingual

    Source language

    Figure 2.1: Architecture for a rule-based machine translation system

    Toshiba [145] is another Rule-based Machine translation system for English to

    Japanese vice versa. To translate a given source text, system uses Morphological

    analysis, Syntax analysis, translation word selection and structural transformation,

    syntax transformation and morphological generation steps. This system can translate

    open-domain written texts by using rule-based. This system uses three dictionaries

    namely common word dictionary, a technical-term dictionary and a user-defined

  • 14

    dictionary. The common word dictionary includes both English-Japanese and

    Japanese- English translation. The technical term dictionary includes domain-specific

    technical terms. They have used user defined dictionary to store user provided

    information such as unknown word information.

    Further, rule-based machine translation approaches can be categorized as three

    groups namely transfer-based, Interlingua and dictionary based. The transfer based

    and Interlingua approach has same idea for translation. Both two approaches used

    intermediate representation that captures the "meaning" of the original sentence

    [10][84][56]. The difference between both approaches is the interlingua-based

    system uses language independent intermediate representation and transfer-based

    system uses language dependent intermediate representation. Most of these machine

    translation systems include Morphological analysis, lexical categorization, lexical

    transfer, Structural transfer and Morphological generation. The dictionary based

    machine translation system uses dictionary for its machine translation with or

    without Morphological or syntax analysis. These type of Machine Translation

    systems ideally suitable to translate long lists of phrases. Numbers of machine

    translation systems have been developed under the above three border headings.

    2.4.2.1 Transfer-based Machine Translation

    Lavie and others [96] have applied transfer based approach to the Hindi-to-English

    translation system named Xferand. It trained under the extremely limited data

    scenario. This Xfer system uses IIITMorpher (Morphological analyzer) [79] to

    analyze Hindi words with the root and the other features such as gender, number, and

    tense. The Xfer system uses 70 transfer rules including a rather large verb paradigm,

    with 58 verb sequence rules, ten recursive noun phrase rules and two prepositional

    phrase rules. They have noted that, this approach is particularly suitable for

    languages with very limited data resources.

    Arabic to English machine translation system has been developed through the

    Transfer-based approach [120]. This system is named as Npae-Rbmt. The Npae-

  • 15

    Rbmt is used an intermediate representation that captures the meaning of the

    original sentence in order to generate the correct translation. This system has

    evaluated through the 88 thesis titles and journals from the computer science domain.

    The accuracy of the result was 94.6%.

    Apertium platform follows a transfer-based machine translation model [18]. Using

    these shallow-transfer approach Swedish to Danish machine translation system has

    been developed [125]. Swedish to Danish machine translation system uses two

    morphological dictionaries to analysis and generation. This is the first free software

    translator of Swedish to Danish.

    Using Affix-Transfer-based approach, Tagalog-to-Cebuano [170] Unidirectional

    Machine Translator system has been developed. The morphological analysis is based

    on TagSA (Tagalog Stemming Algorithm) and is focused on an affix

    correspondence-based POS (parts-of-speech) tagger.

    Opentrad is an open source transfer based Machine translation system intended for

    related language pairs and not so similar pairs [3][48]. The Opentrad uses different

    translation methods according to each language pair. For related languages it uses

    shallow transfer, even though for nonrelated pairs the system uses deep transfer [49].

    Opentrad also uses open-source machine translation engine[101] (Matxin) as the

    translation engine.

    OpenLogos is the Open Source version of the Logos Machine Translation System

    [122]. It is one of the earliest and longest running commercial machine translation

    products in the world. This system accepts documents in various formats and

    produces high quality translations [136]. OpenLogos translates from English and

    German to the major European languages, including Spanish, Italian, French and

    Portugese.

    2.4.2.2 Interlingua Machine Translation

    The Interlingua approach gives language independent meaning representation for the

    source language to target language translation. The Interlingua gives one single

    meaning representation for all the languages and it has been reserved as an extremely

  • 16

    difficult task in practice [135]. However, there are several advantages in the

    Interlingua approach. Among others Interlingua gives more easy way to adding new

    language than all other methods. Also it seems several disadvantages. Meaning

    representation is the critical approach in Interlingua. If the meaning is too simple

    then meaning will be lost in the translation. On the other hand it is too complex and

    analysis and generation will be too difficult.

    Numbers of Machine translation system have been developed through the Interlingua

    approach. Abdelhadi and others have been developed English to Arabic machine

    translation system based on Interlingua approach [1]. They have used mapping

    system to Arabic to intermediate representation. This mapping system contains three

    steps namely, selecting lexical items for each Interlingua concepts, mapping the

    semantic roles and mapping the semantic features for each Interlingua concept to

    appropriate syntactic feature in the feature structure.

    Among others ICENT is the interlingua-based Chinese-English natural language

    translation system [167]. This system introduces the realization mechanism of

    Chinese language analysis, which contains syntactic parsing and semantic analyzing

    and gives the design of Interlingua in details.

    Tai to English machine translation system is another successful machine

    translation system for Tai to English [29]. This system translates the Thai sentences

    into Interlingua of a Thai LFG tree using LFG grammar and a bottom up parser.

    2.4.2.3 Dictionary based Machine Translation

    The dictionary based machine translation systems are commonly used for cross-

    language retrieval systems [77]. This dictionary based approach uses dictionary-

    based method to generate the equivalent target query for the given source language

    query.

    Mandal and others [105] have been developed a cross-language retrieval system

    for the retrieval of English documents in response to queries in Bengali and Hindi.

  • 17

    This dictionary-based machine translation system uses to generate the equivalent

    English query out of Indian language topics.

    Thenmozhi and Aravindan have been developed Tamil-English Cross Lingual

    Information Retrieval System for Agriculture Society [149]. This system developed

    for the Farmers of Tamil Nadu which helps them to specify their information need in

    Tamil and to retrieve the documents in English. It uses a Morphological Analyzer to

    obtain the root terms of source query. This Machine Translation approach retrieves

    the pages with mean average precision of 95%.

    2.4.3 Statistical Machine Translation

    Statistical machine translation approach is by far the most widely-studied machine

    translation method in the field of natural language processing. This approach tries to

    generate translations using statistical methods based on bilingual text corpora [84].

    Using this statistical approach, large numbers of machine translation systems have

    been developed.

    Moses is a Statistical machine translation system that allows automatically train

    translation models for any language pair [108]. The Moses system has several

    features. It offers two types of translation models namely, phrase-based and tree-

    based. Moses system uses factored translation models, which enable the integration

    linguistic and other information at the word level.

    Babel Fish [168] is a web-based application developed by AltaVista which

    translates text or web pages from one language into another. The translation

    technology for Babel Fish is provided by SYSTRAN [144], whose technology also

    powers the translator at Google and a number of other sites. It can translate among

    English, Simplified Chinese, Traditional Chinese, Dutch, French, German, Greek,

    Italian, Japanese, Korean, Portuguese, Russian, and Spanish. A number of sites have

    sprung up that used the Babel Fish service to translate back and forth between one or

    more languages.

  • 18

    Bing Translator [112] is a service provided by Microsoft as part of its Bing

    services which allow users to translate texts or entire web pages into different

    languages. All translation pairs are powered by Microsoft Translation, developed by

    Microsoft Research; it uses Microsoft's own syntax-based statistical machine

    translation technology.

    Google Translator [51] translates a section of text, or a webpage, into another

    language. It does not always deliver accurate translations and does not apply

    grammatical rules, since its algorithms are based on statistical analysis rather than

    traditional rule-based analysis.

    In the Indian region, Udupa and Faruquie have developed an English-Hindi

    Statistical Machine Translation System [152]. This machine translation system is

    based on IBM Models 1, 2, and 3. The system has been tested through the English-

    Hindi parallel corpus consist of 150,000 sentence pairs.

    Singh and Bandyopadhyay have been developed Manipuri-English bidirectional

    statistical machine translation system [133]. The system uses four useful translation

    factors namely case markers and POS tags information at the source side and suffixes

    and dependency relations at the target side. This translation system has been

    evaluated through the BLEU score.

    2.4.4 Example-based Machine Translation

    The example-based machine translation system uses bilingual corpus with the parcel

    text for the machine translation. These systems are trained through the bilingual

    parallel copra, which contain sentence pairs. The example based approach is more

    useful for detecting the context from the text. Also this approach uses translation

    memories [13]. Using this approach number of machine translation systems have

    been developed all over the world.

  • 19

    Among others, OpenMaTrExis one of the open source Example-based machine

    translation systems which is freely available on the OpenMaTrEx web site [121].

    OpenMaTrEx has been developed through the marker hypothesis, which is

    compressed on marker-driven chunker, a collection of chunk aligners and two

    engines.

    Kyoto-U is a successful Example based machine translation system that translates

    English-Japanese [119]. This system uses a morphological analyzer and dependency

    analyzer to detect Japanese sentence structures and converted into dependency

    structures. In addition, Japanese and English parsers and bilingual dictionary were

    used as external resources.

    At present many researchers are researching to develop example-based machine

    translation systems by using World Wide Web as parallel corpora [55]. The wEBMT

    is an example-based machine translation (EBMT) system that uses the World Wide

    Web as the parallel corpus [13].

    2.4.5 Knowledge-based Machine Translation

    Knowledge-based machine translation approach uses knowledge for machine

    translation. This is an extended idea of the example-based machine translation. This

    approach uses linguistic and computational instructions, which are supplied by a

    human. Numbers of commercial quality Machine Translation systems have used this

    knowledge-based approach. Among others EDR[150] and KANT [86] are the major

    knowledge-based machine translation systems.

    EDR (Electronic Dictionary Research) [114], by Japanese, is the most successful

    machine translation system. This system has taken a knowledge-based approach in

    which the translation process is supported by several dictionaries and a huge corpus

    [115]. While using the knowledge-based approach, EDR is governed by a process of

    statistical machine translation. As compared with other machine translation systems,

    EDR is more than a mere translation system but provides lots of related information.

  • 20

    KANT (Knowledge-based Accurate Natural-language Translation) is a knowledge

    based machine translation system for specific domain [86]. Prototype of the KANT

    architecture translates French, German, and Japanese successfully. KANT is

    currently being extended in a large-scale commercial application [118]. The KANT

    prototype has been implemented in the domain of technical electronics manuals, and

    translates from English to Japanese, French and German.

    2.4.6 Hybrid Machine Translation

    The Hybrid machine translation system uses combine method in rule-based and

    Statistical machine translation approaches. This hybrid approach has several

    advantages.

    Among others, SYSTRAN is the market leading provider of language translation

    software products and solutions for the desktop, enterprise and Internet that facilitate

    communication in 52 language combinations and in 20 vertical domains [124].

    Introducing combination of self-learning and linguistic technologies SYSTRANS has

    been developed hybrid machine translation system [144] named as a SYSTEMS

    Enterprise server 7.

    The English to Arabic machine translation system has also been developed through

    the hybrid approach, which is combined between rule-based and example based

    approaches [133].

    2.4.7 Agent-based Machine Translation

    Agent technology, more specifically multi-agent systems, have also been used to

    handle machine translations. This Multi-agent system provides tools for building

    artificial Complex Adaptive Systems [131]. In general any multi agent system

    contains four key components, namely Multi-Agent Engine, Virtual world, Ontology

    and Interfaces [130][131]. The multi agent engine provides a run time support for

    agents. The engine starts as the first step of the system. Virtual world is the

  • 21

    environment of the multi agent systems. Using this Virtual world, agents are

    cooperated and competed with each other as they construct and modify the current

    scene. The Ontology contains conceptual problem domain knowledge of each agent.

    There are a number of NLP systems that have been developed using multi agent

    system technology [175][129][130][113][36]. Most of these systems use agents to

    handle semantics in the translation.

    Minakow and others [113] have developed a Multi Agent-based text understanding

    system for car insurance domain. This system uses Multi agent system based

    approach to understand a given text. The system uses four steps to text understanding

    namely morphological analysis, Syntax analysis, semantic analysis and pragmatic

    analysis. To analyze the whole text is divided into sentences. Then first three stages

    are applied to each sentence. After analyzing each paragraph text is passed to

    pragmatic analysis.

    Stefanini and others have developed a Multi-agent based general Natural language

    processing system named Talisman [141]. Talisman agents can communicate with

    each other without the central control. These agents are able to directly exchange

    information using an interaction language. Linguistic agents are governed by a set of

    local rules. The TALISMAN deals with ambiguities and provides a distributed

    algorithm for conflict resolutions arising from uncertain information.

    2.5 Existing English to Sinhala Machine Translation Systems

    During the past few years many Sri Lankan researchers contributed to develop

    Machine Translation systems for local languages. Among others University of

    Colombo has recorded a significant research to develop English to Sinhala and

    Sinhala-Tamil machine translation system with several Local language resources

    such as Sinhala corpus [99][159], Sinhala text to Speech system [160], Parts of

    Speech Tagger[45] and OCR system for Sinhala language [158]. As a first attempt

    Weersinghe and others have been researching to develop Sinhala to Tamil machine

    translation system through the corpus based approach [157]. This translation system

  • 22

    evaluates through the BLUE score matrix [123] and reasonable result were achieved.

    At present they are researching to develop English to Sinhala machine translation

    system through the translation memories[156]. They have designed translation tool

    named OpenTM, which is based on the translation memories. They have mentioned

    that this OpenTM is suitable for any language pairs around the world, where at least

    one language requires complex script support.

    Further, many other local researchers have developed several prototype English to

    Sinhala machine translation systems through several approaches. In 2003, Vithanage

    and others have developed English to Sinhala machine translation systems for

    weather forecasting domain [153]. Vithanages translation system can translate

    simple sentences and works on the limited set of words and the limited sentence

    patterns. This translation system is fundamental rule-based and it has used

    Paragraphs and sentence tokenization, simple parsers (English and Sinhala),

    translators and Sinhala sentence generators for English to Sinhala translation.

    In 2008, Fernando and others have developed English to Sinhala machine translation

    system using Artificial Neural Networks [47]. A Probabilistic Neural Network is

    used to identify the English grammar and it is based on Bayesian classifiers. This

    system has been achieved 50% accuracy in the grammatical translation. It has been

    tested through 84 test cases including 12 tenses and it only capable to translate only

    the simple sentences.

    In addition to above, some people all over the world have attempted to develop

    machine translation system for Sinhala. Among others, Hearth and others have

    attempted to develop translation system for Japanese to modern Sinhalese [57]. The

    system has a limited vocabulary and it handles translations only within its domain.

    2.6 Concepts and Techniques for Machine Translation

    In the previous section the author has discussed several existing approaches for

    Machine Translation. Many of these machine translation systems have used the

    Morphological analysis and the syntax analysis to analyze the source language. This

  • 23

    Morphological analysis and syntax analysis is done by Morphological analyzers and

    parsers. Morphological analyzers and parsers act the major task in any machine

    translation. Therefore the following sub section gives brief description about

    Morphological analysis and syntax analysis.

    2.6.1 Morphological Analysis

    The morphological analysis is the identification (analysis) of the structure of

    morphemes and other units of meaning in a language like words, affixes, and parts of

    speech [84][162][176]. Historically, the first attempt made for the morphological

    analysis, was done by the ancient Indian linguist Panini, who formulated the 3,959

    rules of Sanskrit morphology (Vyakarana). This Panini grammar [24] is the basis of

    all the Indian families of language including, Hindi, Sinhala, Pali, Sanskrit etc. Using

    this Panini grammar model, many researchers have developed number of

    morphological analyzers for their language analysis [5][6].

    The Morphological analyzers for English language have been developed by many

    researchers. Koskenniemis two-level morphology was the first practical and most

    general model in the history of computational linguistics for the analysis of

    morphologically complex languages [92][93]. Koskenniemis Pascal implementation

    of morphological analysis was quickly followed by others. The most influential of

    them was the KIMMO system by Lauri Karttunen and his students at the University

    of Texas. PC-KIMMO is yet another morphological analysis tool, which was based

    on Koskenniemis work and implemented in C [87]. Among others, PC-KIMMO is

    supposed to be the only available free English morphological analyzer with a wide

    coverage [34]. The lexicon used in PC-KIMMO considers verb, pronoun, noun,

    prepositions, adverbs and adjectives. The current version PC-KIMMO is

    implemented in C and can be run on a PC [93]. The PC-KIMMO accepts an input

    word from a user, and provides all possible morphological details of the word. In

    addition, many European and Scandinavian countries have developed morphological

    analyzers for their languages. These countries have exploited real power of

    computer technology for machine translation.

  • 24

    Asian countries including India, Japan and Thailand have also developed

    morphological analyzers for computer-based natural language processing [5][6]. For

    example, Anusaaraka system has developed morphological analyzers for six Indian

    languages [16]. Anusaaraka has been designed to translate among major Indian

    languages and its morphological analysis is based on the paradigms. The Paradigm is

    used both for word analysis as well as word generation. Also Akshar Bharati and

    others have developed a Generic Morphological Analysis Shell that can be used to

    develop morphological analyzers for different minority languages [5]. This Shell

    uses finite state transducers with features to give the analysis of a given word.

    Further, it integrates paradigms with augmented FSTs. The current model has been

    developed for sample data of Hindi, Telugu, Tamil and Russian. The above generic

    Morphological Analysis Shell uses dictionaries, s paradigm table and paradigm

    classes.

    2.6.2 Syntax Analysis

    Syntax analysis is used to analysis structure in the text and is used to determine

    whether or not a text conforms to an expected format [84][91]. In the Machine

    Translation point of view, this syntax analysis is done by the Parser, which is used to

    analyze the given text (sentences). To analyze the given text Parsers use several

    techniques coming under Top-down and Bottom-up parsing.

    The Top-down parsers are analyzing the input source left to right and searching for

    parse trees using a top-down expansion [162]. Using this top-down parsing approach

    there are several types of Parsers that are also developed including Recursive descent

    parser, LL parser, Earley Parser and the X-SAIGA parser. These parsers have

    demonstrated their own properties in addition to the top-down parsing features.

    The Recursive descent parser is the straightforward forms of top-own parsing [97].

    The LL Parser is also used top-down parsing and parses the input from Left to right,

    and constructs a leftmost derivation of the sentence. The ANTLR [148] is the

    popular LL parser, especially for compilers. The LL(k) parser uses the above

    techniques to parse the sentences without backtracking. The Earley parsers are

  • 25

    especially suitable for ambiguous grammars and use for parsing the computational

    linguistics. Many of these parsers are already implemented through the C, Java, Perl

    and Python languages. The X-Saiga parsers are developed under the X-Saiga project

    to create algorithms and implementations which enable the construction of language

    processors such as recognizers, parsers, interpreters, translators, etc. they have

    implemented several algorithms, at various stages to develop X-Saiga [166].

    The bottom-up parser attempts to identify the most fundamental units first. Then it

    attempts to build trees upwards the start. These parsers are mainly used to analyze

    both natural languages and computer languages. Using this bottom-up parsing

    approach several types of Parsers are also developed including Operator Precedence

    parsers, LR parsers and the CYK parsers.

    The operator precedence parser is a bottom-up parser that interprets an operator-

    precedence grammar [162]. The LR Parser [132] is also used bottom-up parsing and

    parses the input from Left to right, and constructs a rightmost derivation of the

    sentence. The CYK Parsers are used CockeYoungerKasami algorithm and parsing

    techniques are based on the bottom-up parsing. The CYK parsers operate on context-

    free grammars given in Chomsky normal form (CNF) [31][32].

    In addition to the above Parsers are developed by using several computer

    languages especially prolog [25] and number of tools are used to develop parsers

    including ANTLR, Yacc, JavaCC etc.By using these programming languages and

    development tools numbers of parsers have been developed by many people for

    several Natural languages as well as computer programming languages.

    2.7 Problem Definition

    The existing Machine translation systems that use the stated approaches are not

    directly able to translate English text into Sinhala. Since each natural language is

    built on its own building blocks and structures, two languages may not be able to

    handle in the same manner. Despite some Indian languages may have common

    features with Sinhala, they are not identical. On the other hand such systems do not

  • 26

    provide an underlying theory to generalize machine translations. As such, it is

    impossible to figure out which building block or the structure should be exactly

    customized to create English to Sinhala machine translation system. Therefore, lack

    of theoretically-based approach to machine translation has led to develop ad-hoc

    translation systems.

    2.8 Summary

    This chapter gave a detailed discussion about Machine Translation systems and the

    approaches used. The table 2.1 shows selected successful machine translation

    systems with language pair, approach and system type.

    Table 2.1: Existing Machine translation systems

    System Language pair Approach & Type

    Anusaaraka Among Indian languages Human-Assisted, Application

    Angalabarath English to Indian

    languages

    Human-Assisted, Rule-based,

    Application

    AngalaHindi English to Hindi Machine-aid, Rule-based/ example-

    based, Web based

    ManTra English to Hindi Human-aided, web based

    English to Urdu

    MT

    English to Urdu Example based, Application

    Matra English to Hindi Human-aided, transfer-based

    Application

    Google TR Several languages Statistical, Web-based

    Bable fish Several languages Systran technology, Web based

    Yahoo TR Several languages Statistical, web-based

    Aprtium Related languages Rule-based, Application

    EDR English/Japanese Knowledge based, Application

  • 27

    According to the literature survey, the author has identified that human assisted and

    rule-based approaches are more suitable for none-related language pairs such as

    English and Sinhala. Next chapter reviews features of English and Sinhala languages

    with a view to identify issues related to machine translation from English to Sinhala.

  • 28

    Chapter 3

    OVERVIEW OF THE ENGLISH AND SINHALA LANGUAGES

    3.1 Introduction

    The previous chapter discussed in detail about the Machine Translation systems. The

    author has pointed out issues in adapting an existing translation system for

    constructing English to Sinhala machine translation system. The literature review

    also revealed that the development of the Machine Translation system absolutely

    depends on the structure of the source and the target languages. Therefore, this

    chapter studies about language primitives and structures of English and Sinhala

    languages. This study would help to provide an insight about how the translation

    from English to Sinhala can be done.

    3.2 The English Language

    English is the international communication language and more than 53 countries are

    already using it as an official language. It is a West German language that originated

    from the Anglo-Frisian and Old Saxon dialects brought to Britain [162]. English

    language contains 26 letters with 5 vowels [116]. The English language has eight

    parts of speech such as Noun, Adjective, Pronoun, Verb, Adverb, Preposition,

    conjunction and Interjection [8][165]. Rest of the section describes Morphology,

    Syntax, and Semantics of the English Language.

    3.3 The English Language Morphology

    Morphology is the study of the way words are built up from smaller meaning bearing

    units called morphems that often define as the minimal meaning-bearing unit in a

    language [84]. For example the word boy consists single morpheme and the word

    boys consists two morphemes namely boy and the -s.Furher, in the Morphological

    view point there are two types of morphemes such as stems and affixes. In the

  • 29

    previous example a morpheme boy is a stem and the s is an affix. These stems and

    affixes are participated both inflection and derivation of the word which is called

    word formation [109].The Inflection provides various forms of any single word such

    as Singular, Plural etc. (E.g. singular man, plural men in English). Derivation creates

    new words from old ones. (E.g. the creation of dogcatcher from dog, catch and

    er is a derivational process) [117][84]. Comparing the other Indo-European

    languages, English grammar has minimal inflections. Therefore, the English

    morphology is simpler than the other Indo-European languages. With the exception

    of pronouns, English words have relatively few forms.

    3.3.1 English Noun Morphology

    English Noun contains two types of inflections such as number and possessive case.

    Nouns generally have only two forms for Number inflection such as singular and

    plural. In the possessive case, the words usually end in ( s ) or ( ) for example

    boys and boys.

    The English noun participates regular and irregular inflections. The regular inflection

    gives general forms of the singular, plural and possessive cases. Table 3.1 shows

    regular and irregular nouns with the inflection forms.

    Table 3.1: Regular and irregular forms of the English Noun

    Grammar rule Regular Irregular

    Singular boy Man

    Plural boys Men

    Singular Possessive boy's man's

    Plural Possessive boys' men's

    Considering the morphology of the English noun, it has very limited number of

    rules for noun inflections. The table 3.2 shows some morphological rules for the

  • 30

    English Noun. Basically, the plural noun is formed by adding some suffixes to the

    singular noun such as s, es, ies, ves etc. The posessive case is formed by adding s

    or s.

    Table 3.2: English Noun Morphological rules

    English Noun Morphology

    No Morphological structure Base word Example

    1 Singular noun Boy boy

    2 Plural Base + s Boy Boys

    3 Plural Base + es Class Classes

    4 Plural Base y + ies Baby Babies

    5 Plural Base f + ves Knife Knives

    6 Singular Possessive Base + s School Schools

    7 Plural Possessive Plural + Boy Boys

    3.3.2 English Verb Morphology

    English verb contains five types of inflection namely Infinitive, simple present, past

    tense, past participle and present participle. In regular verbs, 3rd person singular ends

    with s, past tense and past participle ends with ed and the present participle ends

    with ing. Note that English has a large number of irregular verbs and these verbs do

    not fit with this pattern. The personal pronoun has different forms depending on

    number (singular and plural), case (subject, object, possessive, etc.), and person (1st,

    2nd and 3rd person). In the 3rd person singular, there is gender too. The table 3.3

    shows the entire verb forms available for the English verb play (Regular) and eat

    (Irregular).

    The Morphological point of view, English regular verbs have several

    morphological rules. The table 3.4 shows Morphological rules for English verb.

    Most of the English regular verbs have simple inflection rule. However, Irregular

  • 31

    verbs use different patterns. Then the regular verbs expect simple present (adding s)

    and the Present Participle (adding ing) forms.

    3.3.3 English Adjective Morphology

    Adjectives have comparative and superlative forms namely comparative adjectives

    are end with 'er') and the superlative adjectives end with 'est'). For example; higher

    and highest are the comparative and superlative forms of the adjective high. Other

    parts of speech; adverb, preposition, conjunction and Interjection do not show

    inflections.

    Table 3.3: English verb Morphology

    English Verb Morphology

    Morphological structure Regular verb Irregular

    verb

    Infinitive play eat

    Past played ate

    Present Participle playing eating

    Past Participle played eaten

    Present:

    I play eat

    You play eat

    He, She, It plays eats

    We play eat

    You play eat

    They play eat

  • 32

    Table 3.4: Morphological rules for English Verbs

    English Verb Morphology

    No Morphological structure Regular verb Irregular verb

    1 Infinitive verb (Base verb) play eat

    2 Simple present (base + s) plays eats

    3 Past(base + ed) played ate

    4 Present Participle (base + ing) Playing eating

    5 Past Participle (Base +ed) played eaten

    3.4 Syntax of the English Language

    The syntax is the study of the rules that gives the structure of the sentences [162].

    English Language has its own format and it differs from the Sinhala language syntax.

    The below section gives a brief description about English sentence syntax, which is

    based on the scientific psychin web site [172][174]. English language contains four

    main sentence types namely declarative, Interrogative, Imperative and conditional.

    The English sentence may be simple or compound. The compound sentences consist

    of two or more simple sentences joined by conjunctions.

    The declarative sentence consists of a subject and a predicate. The subject may be

    a simple subject or a compound subject. A simple subject consists of a noun phrase

    or a nominative personal pronoun. Compound subjects are formed by combining

    several simple subjects with conjunctions. All the sentences in this paragraph are

    declarative sentences.

    Interrogative sentences are used to form questions. One form of an interrogative

    sentence is a declarative sentence followed by a question mark and there are several

    ways available for Interrogative sentences that start with what, who, which etc.

    The Imperative sentences are commands; consist of predicates that only contain

    verbs in infinitive form. Generally, imperative sentences are terminated with an

    exclamation mark instead of a period.

  • 33

    The Conditional sentences are used to describe the consequences of a specific

    action, or the dependency between events or conditions. Conditional sentences

    consist of an independent clause and a dependent clause.

    In addition to the above, deep structural analysis needs to develop machine

    translation for English source sentence analysis specially, subject, object, predicate

    and sentence patterns. These information are very useful to develop English Phrases.

    3.4.1 The English Sentence Subject

    The subject is the part of the sentence that performs an action or which is

    associated with the action. The subject may be simple or compound. The Simple

    subject may be a noun phrase or a nominative personal pronoun. (The nominative

    personal pronouns are: I, you, he, she, it, we and they)

    3.4.2 The English Predicate

    The predicate is the part of the sentence that contains a verb or verb phrase and its

    complements. English has three main kinds of verbs: auxiliary verbs, linking verbs,

    and action verbs.

    3.4.3 Verb Tense

    Verb tenses are inflectional forms of verbs or verb phrases that are used to express

    time distinctions [8]. The table 3.5 shows the structure of some common tenses.

    Table 3.5: Tense patterns (Active voice)

    Tense Example

    Simple present I write a book

    The boy sings a new song

    Present I am writing a book

  • 34

    continuous The boy is singing a new song

    Present perfect I have written a book

    The boy has sung a new song

    Present perfect continuous

    I have been writing a book

    The boy has been singing a new song

    Past tense I wrote a book

    The boy sang a new song

    Past continuous I was writing a book

    The boy was singing a new song

    Past perfect I had written a book

    The boy had sung a new song

    Past perfect continuous

    I had been writing a book

    The boy had been singing a new song

    Future tense I will write a book

    The boy will sing a new song

    Future continuous I shall be writing a book

    The boy will be singing a new song

    Future perfect I shall have written a book

    The boy will have sung a new song

    Future perfect continuous

    I shall have been writing a book

    The boy will have been singing a new song

    3.4.4 The Complement

    The predicate consists of a verb or verb phrase and its complements, if any. A verb

    that requires no complements is called intransitive. A verb that requires one or two

    complements is called transitive.

  • 35

    3.5 Semantics of English Language

    Semantics is the study of the meaning. It typically focuses on the relation between

    signifiers, such as words, phrases, signs and symbols, and what they stand for [162].

    Semantics can be classified as three groups namely, word level meaning sentence

    level meaning and the paragraph level meaning.

    3.5.1 Word Level Semantics

    Word level semantics means semantics may define by the words in the sentence. As

    an example consider the following sample sentences, This is a red rose, this paper

    is red, and the supervisor flashes the red light for his student. The word red

    gives different meaning in each sentence.

    3.5.2 Sentence Level Semantics

    The sentence level semantics refers to the meaning that depended on the sentence.

    Analyzing the sentence level semantics of the sentence is very important for many

    areas [37].

    3.5.3 The paragraphs Level Semantics

    The paragraphs level semantic analysis [173] is a solution for the word sense

    ambiguity [80]. Further, many of the researchers have done researches to analyze

    paragraphs level semantics [127].

    3.6 The Sinhala Language

    The Sinhala Language is constitutionally recognized as the official language of Sri

    Lanka, along with Tamil. Sinhala is the mother tongue of the Sinhalese. Sinhala

    language has its own writing system, which is an offspring of the Brahmi script [22].

  • 36

    Maldives, Dhivehi are the closest relative languages to Sinhala. Further, Sinhala

    scripts are the worlds 16th most creative alphabet among todays functional

    languages [35]. The Sinhalese most historical book Mahavansa [102] noted that, the

    prince Vijaya and his entourages who came from India in the 5th century BC were

    merged with the native Hela tribes known as Yakka and Naga who spoke Elu

    language (the ancient form of the Sinhalese language) and the new nation called

    Sinhala came to exist with the Sinhala language.

    Further, Sinhala differs from all other Indo-Aryan languages. It contains a pair of

    vowel sounds that are unique to it, such as short vowel: we ae and Long vowel:

    wE aae. Also Sinhala contains a set of five nasal sounds known as half nasal or

    prenasalized stops. These sounds as represented in modern Sinhala writing and

    their Romanized notations are as follows: a (nng), `ca (ndj), ` (nnd), |a (nd), (mb)

    [88].

    The next sub section briefly describes the Sinhala alphabet, morphology and the

    syntax of the Sinhala language.

    3.6.1 Sinhala Alphabet

    The Sinhala alphabet consists of 61 letters comprising 18 vowels, 41 consonants and

    2 semi-consonants [40][22].These symbols represent 40 sounds: 14 vowel sounds

    and 26 consonant sounds. This is quite similar to other Indic alphabets, as all of

    them appear to be offshoots of the Sanskrit alphabet [50]. Table 3.6 shows the

    Sinhala alphabet.

    Table 3.6: The Sinhala Alphabet

    Letter Type Sinhala Letters

    Vowels w, wd, we, wE, b, B, W, W! ,, iD, iDD, t, ta, ft, T, , T!

    Consonants

    l, L, . , >, V, , p, P, c, Cv [, {, P, g, G, v, V, K,

    , ; , : , o, O, k, |, m, M, n, N, u, U, h, r, ,, j, Y, I, i,

    y,

  • 37

    Furthermore, some graphical symbols, stokes, are used in conjunction with

    consonants. They are used in writing some vowels too (example. wd" ta" ft). Unlike

    in English, a stoke may be positioned at any of the four sides of the base letter.

    Table 3.7 shows Sinhala stokes and their positions [42].

    Table 3.7: Vocalic Stokes and their position

    No Stoke Name Position Example

    1 A Al-lakuna1 Upper ia

    A Al-lakuna2 Upper

    2 D Aela-pilla Right ld

    3 E Kettiaedapilla Right le

    4 E Digaaedapilla Right lE

    5 S Ketti ispilla Upper ls

    6 S Diga ispilla Upper lS

    7 Q Kettipaa pilla1 Lower nq

    = Kettipaa pilla2 Lower l=

    8 Q Digapaa pilla1 Lower nQ

    + Digapaa pilla1 Lower l+

    9 D Gaettapilla Right iD

    10 f Kombuva Left fu

    11 ! Gayanukitta Right T!

    In addition to above, Sinhala letters (characters) are generated using vowels,

    consonants and conjunction with consonant and stokes. Table 3.6 shows the

    combination of the consonant l (k) with vocalic stokes.

    Table 3.8:The consonant l with vocalic stokes

    No Character Letter

    1 la la

    2 la + w l

  • 38

    3 la + wd ld

    4 la + we le

    5 la + wE lE

    6 la + b ls

    7 la + B lS

    8 la + W l=

    9 la+ W! l+

    10 la + iD lD

    11 la + iDD lDD

    12 la + t fl

    13 la + ta fla

    14 la + ft ffl

    15 la + T fld

    16 la + flda

    17 la+ T! fl!

    3.7 Sinhala Language Morphology

    Sinhala is an inflationary rich language and it participates inflection, derivation and

    conjugation for nouns and verbs. Inflection is the modification of a word to express

    different grammatical categories such as tense, mood, voice, aspect, person, number,

    gender and case [54]. The Derivation is "Used to form new words, as with happiness

    and un-happy from happy, or determination from determine [162] and conjugation

    refers to the creation of derived forms of a verb from its principal parts by inflection

    Conjugation may be affected by person, number, gender, tense, aspect, mood, voice,

    or other grammatical categories. A table giving all the conjugated variants of a verb

    in a given language is called a conjugation table or a verb paradigm.

    3.7.1 Sinhala Noun Morphology

    The Sinhala Noun is a word that represents the noun, pronoun and the adjective in

    the English language. The Sinhala noun has four types of inflections such as Gender

  • 39

    (lingaya), Number (Wachana), Person (Purusha) and Case (Vibhakthi). There are

    three genders namely masculine gender, feminine gender and neuter gender. Singular

    and plural are the Number and there are three persons namely first person

    (Uthtamapurusha) second person (Maddamapurusha) and third person

    (prathamapurusha). Also there are nine cases in Sinhala such as Nominative

    (prathama), Accusative (karma), Instrumental (kaththru), Auxiliary (karana), Dative

    (sampadana), Ablative (avadhi), Genitive (Sambanda), Locative (adara) and

    Vocative (alapana) [54][134]. There are 27 inflection forms generated for single base

    noun such as nine Vibhakthi, article and the number. For example Sinhala base word

    .j inflects as .jhd, .jfhda, .jfhla etc. The base word is directly affected by

    the nine cases. Some case suffixes are written with the base word and some are

    written separately. Table 3.9 shows sample case makers of the Sinhala noun. There

    are number of case maker forms available in Sinhala that depends on the gender of

    the noun.

    From morphological point of view, a Sinhala noun is also a word, and nouns are

    participated inflection and derivations. The Sinhala nouns can be divided into thr