10.1.1.103.4227

375
CONTRIBUTIONS TO ENGLISH TO HINDI MACHINE TRANSLATION USING EXAMPLE-BASED APPROACH DEEPA GUPTA DEPARTMENT OF MATHEMATICS INDIAN INSTITUTE OF TECHNOLOGY DELHI HAUZ KHAS, NEW DELHI-110016, INDIA JANUARY, 2005

description

good book of about project

Transcript of 10.1.1.103.4227

  • CONTRIBUTIONS TO ENGLISH TO HINDI

    MACHINE TRANSLATION USING

    EXAMPLE-BASED APPROACH

    DEEPA GUPTA

    DEPARTMENT OF MATHEMATICS

    INDIAN INSTITUTE OF TECHNOLOGY DELHI

    HAUZ KHAS, NEW DELHI-110016, INDIA

    JANUARY, 2005

  • CONTRIBUTIONS TO ENGLISH TO HINDI

    MACHINE TRANSLATION USING

    EXAMPLE-BASED APPROACH

    by

    DEEPA GUPTA

    Department of Mathematics

    Submitted

    in fulfilment of the requirement of

    the degree of

    Doctor of Philosophy

    to the

    Indian Institute of Technology Delhi

    Hauz Khas, New Delhi-110016, India

    January, 2005

  • Dedicated to

    My Parents,

    My Brother Ashish and

    My Thesis Supervisor...

  • Certificate

    This is to certify that the thesis entitled Contributions to English to Hindi

    Machine Translation Using Example-Based Approach submitted by Ms.

    Deepa Gupta to the Department of Mathematics, Indian Institute of Technology

    Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona fide

    research work carried out by her under my guidance and supervision.

    The thesis has reached the standards fulfilling the requirements of the regulations

    relating to the degree. The work contained in this thesis has not been submitted to

    any other university or institute for the award of any degree or diploma.

    Dr. Niladri Chatterjee

    Assistant Professor

    Department of Mathematics

    Indian Institute of Technology Delhi

    Delhi (INDIA)

  • Acknowledgement

    If I say that this is my thesis it would be totally untrue. It is like a dream come true.

    There are people in this world, some of them so wonderful, who helped in making

    this dream, a product that you are holding in your hand. I would like to thank all

    of them, and in particular:

    Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research

    and stayed with me right till the end. His efforts, comments, advices and ideas

    developed my thinking, and improved my way of presentation. Without his con-

    stant encouragement, keen interest, inspiring criticism and invaluable guidance, I

    would not have accomplished my work. I admit that his efforts need much more

    acknowledgement than expressed here.

    I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech

    Research Lab who funded this research. I sincerely thank all the faculty members of

    Department of Mathematics, especially, I express my gratitude for Prof B. Chandra

    and Dr. R. K. Sharma, for providing me continuous moral support and help. I

    thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time

    and efforts. I also thank the department administrative staff for their assistance. I

    extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and

    Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,

    and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening

    discussions on basics of languages.

    I would like to express my sincere thanks to my friends Priya and Dharmendra

    for many fruitful discussions regarding my research problem. I thank Mr. Gaurav

  • Kashyap for helping me in the implementation of the algorithms. In particular, I

    would like to thank Inderdeep Singh, for his help in writing some part of the thesis.

    I want to give special thanks to my friends, Sonia, Pranita and Nutan, for helping

    me in both good and bad times. I would like to thank Prabhakhar for his brotherly

    support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for

    cheering me, always.

    Shailly and Geeta - amazing friends who read the manuscript and gave honest com-

    ments. Both of them also stayed with me in the process, and handled me, and

    sometimes my out-of-control emotions so well. Especially, I wish to extend my

    thanks to Geeta for providing me stay in her hostel room, and also for her wonderful

    help when my leg got fractured when we knew each other for a month only. I wish

    to acknowledge Krishna for his constant help, both academic and nonacademic, and

    his continuous encouragement.

    I convey my sincere regards to my parents, and brothers for the sacrifices they have

    made, for the patience they have shown, and for the love and blessing they have

    showered. I thank Arun for his moral support. Most imperative of all, I would like

    to express my profound sense of gratitude and appreciation to my sister Neetu. Her

    irrational and unbreakable belief in me bordered on craziness at times.

    I cannot avoid to mention my friend Sharad who deserves more than a little ac-

    knowledgement. His constant inspiration and untiring support has sustained my

    confidence throughout this work.

    Finally, I thank GOD for every thing.

    Deepa Gupta

  • Abstract

    This research focuses on development of Example Based Machine Translation (EBMT)

    system for English to Hindi. Development of a machine translation (MT) system

    typically demands a large volume of computational resources. For example, rule-

    based MT systems require extraction of syntactic and semantic knowledge in the

    form of rules, statistics-based MT systems require huge parallel corpus containing

    sentences in the source languages and their translations in target language. Require-

    ment of such computational resources is much less in respect of EMBT. This makes

    development of EBMT systems for English to Hindi translation feasible, where avail-

    ability of large-scale computational resources is still scarce. The primary motivation

    for this work comes because of the following:

    a) Although a small number of English to Hindi MT systems are already available,

    the outputs produced by them are not of high quality all the time. Through

    this work we intend to analyze the difficulties that lead to this below par

    performance, and try to provide some solutions for them.

    b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in

    the Indian subcontinent. Demand for developing MT systems from English to

    these languages is increasing rapidly. But at the same time, development of

    computational resources in these languages is still at its infancy. Since many

    of these languages are similar to Hindi, syntactically as well as lexicon wise,

    the research carried out here should help developing MT systems from English

    to these languages as well.

    i

  • The major contributions of this research may be described as follows:

    1) Development of a systematic adaptation scheme. We proposed an adaptation

    scheme consisting of ten basic operations. These operations work not only at

    word level, but at suffix level as well. This makes adaptation less expensive in

    many situations.

    2) Study of Divergence. We observe that occurrence of divergence causes major

    difficulty for any MT systems. In this work we make an in depth study of the

    different types of divergence, and categorize them.

    3) Development of Retrieval scheme. We propose a novel approach for measuring

    similarity between sentences. We suggest that retrieval strategy, with respect

    to an EBMT system, will be most efficient if it measures similarity on the basis

    of cost of adaptation. In this work we provide a complete framework for an

    efficient retrieval scheme on the basis of our studies on divergence and cost

    of adaptation.

    4) Dealing with Complex sentences. Handling complex sentences by an MT sys-

    tem is generally considered to be difficult. In this work we propose a split

    and translate technique for translating complex sentences under an EBMT

    framework.

    We feel that the overall scheme proposed in this research will pave the way for

    developing an efficient EBMT system for translating from English to Hindi. We

    hope that this research will also help development of MT systems from English to

    other languages of the Indian subcontinent.

    ii

  • Contents

    1 Introduction 1

    1.1 Description of the Work Done and Summary of the Chapters . . . . . 6

    1.2 Some Critical Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2 Adaptation in English to Hindi Translation: A Systematic Ap-

    proach 23

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.2 Description of the Adaptation Operations . . . . . . . . . . . . . . . 29

    2.3 Study of Adaptation Procedure for Morphological Variation of Active

    Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.3.1 Same Tense Same Verb Form . . . . . . . . . . . . . . . . . . 38

    2.3.2 Different Tenses Same Verb Form . . . . . . . . . . . . . . . . 42

    2.3.3 Same Tense Different Verb Forms . . . . . . . . . . . . . . . . 46

    2.3.4 Different Tenses Different Verb Forms . . . . . . . . . . . . . . 48

    2.4 Adaptation Procedure for Morphological Variation of Passive Verbs . 51

    2.5 Study of Adaptation Procedures for Subject/ Object Functional Slot 56

    2.5.1 Adaptation Rules for Variations in the Morpho Tags of @DN> 59

  • Contents

    2.5.2 Adaptation Rules for Variations in the Morpho Tags of @GN> 60

    2.5.3 Adaptation Rules for Variations in the Morpho Tags of @QN . 64

    2.5.4 Adaptation Rules for Variations in the Morpho Tags of Pre-

    modifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64

    2.5.5 Adaptation Rules for Variations in the Morpho Tags of @SUB 69

    2.6 Adaptation of Interrogative Words . . . . . . . . . . . . . . . . . . . 73

    2.7 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 83

    2.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    3 An FT and SPAC Based Divergence Identification Technique From

    Example Base 87

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    3.2 Divergence and Its Identification: Some Relevant Past Work . . . . . 89

    3.3 Divergences and Their Identification in English to Hindi Translation . 96

    3.3.1 Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97

    3.3.2 Categorial Divergence . . . . . . . . . . . . . . . . . . . . . . 100

    3.3.3 Nominal Divergence . . . . . . . . . . . . . . . . . . . . . . . 104

    3.3.4 Pronominal Divergence . . . . . . . . . . . . . . . . . . . . . . 107

    3.3.5 Demotional Divergence . . . . . . . . . . . . . . . . . . . . . . 111

    3.3.6 Conflational Divergence . . . . . . . . . . . . . . . . . . . . . 117

    3.3.7 Possessional Divergence . . . . . . . . . . . . . . . . . . . . . 121

    3.3.8 Some Critical Comments . . . . . . . . . . . . . . . . . . . . . 131

    iv

  • Contents

    3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

    4 A Corpus-Evidence Based Approach for Prior Determination of

    Divergence 135

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    4.2 Corpus-Based Evidences and Their Use in Divergence Identification . 136

    4.2.1 Roles of Different Functional Tags . . . . . . . . . . . . . . . . 138

    4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147

    4.4 Illustrations and Experimental Results . . . . . . . . . . . . . . . . . 155

    4.4.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    4.4.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    4.4.3 Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 166

    4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

    5 A Cost of Adaptation Based Scheme for Efficient Retrieval of Trans-

    lation Examples 171

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

    5.2 Brief Review of Related Past Work . . . . . . . . . . . . . . . . . . . 171

    5.3 Evaluation of Cost of Adaptation . . . . . . . . . . . . . . . . . . . . 178

    5.3.1 Cost of Different Adaptation Operations . . . . . . . . . . . . 182

    5.4 Cost Due to Different Functional Slots and Kind of Sentences . . . . 185

    v

  • Contents

    5.4.1 Costs Due to Variation in Kind of Sentences . . . . . . . . . . 186

    5.4.2 Cost Due to Active Verb Morphological Variation . . . . . . . 187

    5.4.3 Cost Due to Subject/Object Functional Slot . . . . . . . . . . 192

    5.4.4 Use of Adaptation Cost as a Measure of Similarity . . . . . . . 197

    5.5 The Proposed Approach vis-a`-vis Some Similarity Measurement Schemes

    198

    5.5.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 198

    5.5.2 Syntactic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 201

    5.5.3 A Proposed Approach: Cost of Adaptation Based Similarity . 203

    5.5.4 Drawbacks of the Proposed Scheme . . . . . . . . . . . . . . . 211

    5.6 Two-level Filtration Scheme . . . . . . . . . . . . . . . . . . . . . . . 213

    5.6.1 Measurement of Structural Similarity . . . . . . . . . . . . . . 214

    5.6.2 Measurement of Characteristic Feature Dissimilarity . . . . . . 217

    5.7 Complexity Analysis of the Proposed Scheme . . . . . . . . . . . . . 222

    5.8 Difficulties in Handling Complex Sentences . . . . . . . . . . . . . . . 226

    5.9 Splitting Rules for Converting Complex Sentence into Simple Sentences229

    5.9.1 Splitting Rule for the Connectives when, where, when-

    ever and wherever . . . . . . . . . . . . . . . . . . . . . . . 231

    5.9.2 Splitting Rule for the Connective who . . . . . . . . . . . . 241

    5.10 Adaptation Procedure for Complex Sentence . . . . . . . . . . . . . . 253

    5.10.1 Adaptation Procedure for Connectives when, where, when-

    ever and wherever . . . . . . . . . . . . . . . . . . . . . . . 254

    vi

  • Contents

    5.10.2 Adaptation Procedure for Connective who . . . . . . . . . . 256

    5.11 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

    5.11.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

    5.11.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

    5.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

    6 Discussions and Conclusions 267

    6.1 Goals and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

    6.2 Contributions Made by This Research . . . . . . . . . . . . . . . . . . 268

    6.3 Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

    6.4 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

    6.4.1 Pre-editing and Post-editing . . . . . . . . . . . . . . . . . . . 274

    6.4.2 Evaluation Measures of Machine Translation . . . . . . . . . . 276

    Appendices 280

    A 281

    A.1 English and Hindi Language Variations . . . . . . . . . . . . . . . . . 281

    A.2 Verb Morphological and Structure Variations . . . . . . . . . . . . . . 285

    A.2.1 Conjugation of Root Verb . . . . . . . . . . . . . . . . . . . . 286

    B 291

    B.1 Functional Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

    B.2 Morpho Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

    vii

  • Contents

    C 299

    C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures299

    D 303

    D.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

    E 305

    E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305

    Bibliography 308

    viii

  • List of Figures

    1.1 An Example Sentence with Its Morpho-Functional Tags . . . . . . . . 20

    2.1 The five possible scenarios in the SL SL TL interface of partial

    case matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.2 Example of Different Adaptation Operations . . . . . . . . . . . . . . 34

    2.3 Some Typical Sentence Structures . . . . . . . . . . . . . . . . . . . . 83

    3.1 Algorithm for Identification of Structural Divergence . . . . . . . . . 99

    3.2 Correspondence of SPACs of E and H for Identification of Structural

    Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    3.3 Algorithim for Identification of Categorial Divergence . . . . . . . . . 103

    3.4 Correspondence of SPACs for the Categorial Divergence Example of

    Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    3.5 Algorithim for Identification of Nominal Divergence . . . . . . . . . . 106

    3.6 Correspondence of SPAC E and SPAC H of Nominal Divergence of

    Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    3.7 Algorithim for Identification of Pronominal Divergence . . . . . . . . 110

  • LIST OF FIGURES

    3.8 Correspondence of SPAC E and SPAC H of Pronominal Divergence

    of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    3.9 Algorithm for Identification of Demotional Divergence . . . . . . . . . 114

    3.10 Correspondence of SPAC E and SPAC H for Demotional Sub-type 4 115

    3.11 SPAC Correspondence for Demotional Divergence of Sub-type 1 . . . 116

    3.12 Algorithm for Identification of Conflational Divergence . . . . . . . . 120

    3.13 Correspondence of SPAC E and SPAC H for Conflational Divergence

    of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    3.14 Algorithm for Identification of Possessional Divergence . . . . . . . . 129

    3.15 Correspondence of SPAC E and SPAC H for Possessional Divergence

    of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    3.16 Correspondence of SPAC E and SPAC H for Possessional Divergence

    of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    4.1 Schematic Diagram of the Proposed Algorithm . . . . . . . . . . . . . 153

    4.2 Continuation of the Figure 4.1 . . . . . . . . . . . . . . . . . . . . . . 154

    5.1 Schematic View of Module 1 for Identification of Complex Sentence

    with Connective any of when, where, whenever, or wherever . 232

    5.2 Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 237

    5.3 Schematic View of Module 3 . . . . . . . . . . . . . . . . . . . . . . . 240

    5.4 Schematic View of Module 1 for Identification of Complex Sentence

    with Connective who . . . . . . . . . . . . . . . . . . . . . . . . . . 244

    x

  • LIST OF FIGURES

    5.5 Schematic View of the SUBROUTINE SPLIT . . . . . . . . . . . . . 246

    5.6 Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 247

    5.7 Schematic View of Module 3 . . . . . . . . . . . . . . . . . . . . . . . 249

    5.8 Schematic View of Module 4 . . . . . . . . . . . . . . . . . . . . . . . 250

    xi

  • List of Tables

    1.1 Output of AnglaHindi and Shakti MT System . . . . . . . . . . 5

    2.2 Notations Used in Sentence Patterns . . . . . . . . . . . . . . . . . . 35

    2.3 Adaptation Operations of Verb Morphological Variations in Present

    Indefinite to Present Indefinite . . . . . . . . . . . . . . . . . . . . . . 39

    2.4 Adaptation Operations of Verb Morphological Variations in Present

    Indefinite to Past Indefinite . . . . . . . . . . . . . . . . . . . . . . . 44

    2.5 Different Functional Tags Under the Functional Slot or . . 56

    2.6 Different Possible Morpho Tags for Each of the Functional Tag under

    the Functional Slot or . . . . . . . . . . . . . . . . . . . . 58

    2.8 Adaptation Operations for Genitive Case to Genitive Case . . . . . . 62

    2.10 Adaptation Operations for Pre-modifier Adjective to Pre-modifier Ad-

    jective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    2.11 Adaptation Operations for Subject to Subject Variations . . . . . . . 71

    2.12 Different Sentence Patterns of Interrogative Words . . . . . . . . . . . 77

  • LIST OF TABLES

    2.13 Functional & Morpho Tags Corresponding to Each Interrogative Sen-

    tence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    2.14 Adaptability Rules for Group G5 Sentence Patterns . . . . . . . . . . 83

    2.15 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 84

    3.1 Different Semantic Similarity Score between shock with trouble

    and panic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

    4.1 FT-features Instrumental for Creating Divergence . . . . . . . . . . . 138

    4.2 Relevance of FT-features in Different Divergence Types . . . . . . . . 139

    4.3 FT of the Problematic Words for Each Divergence Type . . . . . . . 142

    4.4 Frequency of Words in Different Sections . . . . . . . . . . . . . . . . 144

    4.5 PSD/NSD Schematic Representations . . . . . . . . . . . . . . . . . . 145

    4.6 Values of s(di) and m(di) for Illustration 3 . . . . . . . . . . . . . . . 160

    4.7 Some Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

    4.8 Continuation of Table 4.7 . . . . . . . . . . . . . . . . . . . . . . . . 165

    4.9 Results of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . 166

    5.1 Cost Due to Variation in Kind of Sentences . . . . . . . . . . . . . . . 187

    5.2 Cost Due to Verb Morphological Variation Present Indefinite to Present

    Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

    5.3 Adaptation Operations of Verb Morphological Variation Present In-

    definite to Past indefinite . . . . . . . . . . . . . . . . . . . . . . . . . 192

    5.4 Costs Due to Adapting Genitive Case to Genitive Case . . . . . . . . 195

    xiv

  • LIST OF TABLES

    5.5 Cost of Adaptation Due to Subject/Object to Subject/Object . . . . 197

    5.6 Best Five Matches by Using Semantic Similarity for the Input Sen-

    tence I work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

    5.7 Best Five Matches by Using Semantic Similarity for the Input Sen-

    tence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . 201

    5.8 Weighting Scheme for Different POS and Syntactic Role . . . . . . . 202

    5.9 Best Five Matches by Syntactic Similarity for the Input Sentence I

    work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

    5.10 Best Five Matches by Syntactic Similarity for the Input Sentence Sita

    sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

    5.11 Functional-morpho Tags for the Input English Sentence (IE) and the

    Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204

    5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the

    Input Sentence I work. . . . . . . . . . . . . . . . . . . . . . . . . . 207

    5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the

    Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . 207

    5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input

    Sentence I work. by Using Semantic and Syntactic Based Similarity

    Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

    5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input

    Sentence Sita sings ghazals by Using Semantic and Syntactic based

    Similarity Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

    5.16 Weights Used for Characteristic Features . . . . . . . . . . . . . . . . 220

    xv

  • LIST OF TABLES

    5.17 Notation Used in the Complexity Analysis . . . . . . . . . . . . . . . 222

    5.19 Typical Examples of Complex Sentence with Connective when, where,

    whenever or wherever Handled by Module 2 . . . . . . . . . . . . 235

    5.20 Typical Examples of Complex Sentence with Connective when, where,

    whenever or wherever Handled by Module 3 . . . . . . . . . . . . 239

    5.21 Typical Complex Sentences with Relative Adverb who Handled by

    Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

    5.22 Typical Complex Sentences with Relative Adverb who Handled by

    Module 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

    5.23 Typical Complex Sentences with Relative Adverb who Handled by

    Module 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

    5.24 Hindi Translation of Relative Adverbs . . . . . . . . . . . . . . . . . . 254

    5.25 Patterns of Complex Sentence with Connective when, where,

    whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . . 255

    5.26 Patterns of Complex Sentence with Connective who . . . . . . . . . 257

    5.27 Five Most Similar Sentence for RC You go to India. Using Cost of

    Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . . . . . 261

    5.28 Five Most Similar Sentence for MC You should speak Hindi. Using

    Cost of Adaptation based Scheme . . . . . . . . . . . . . . . . . . . . 261

    5.29 Five Most Similar Sentence for RC He wants to learn Hindi. Using

    Cost of Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . 263

    5.30 Five Most Similar Sentence for MC The student should study this

    book. Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263

    xvi

  • LIST OF TABLES

    A.2 Different Case Ending in Hindi . . . . . . . . . . . . . . . . . . . . . 283

    A.3 Suffixes and Morpho-Words for Hindi Verb Conjugations . . . . . . . 286

    A.4 Verb Morphological Changes From English to Hindi Translation . . . 288

    E.1 Costs Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective307

    xvii

  • Chapter 1

    Introduction

  • Chapter 1. Introduction

    Machine Translation (MT) is the process of translating text units of one language

    (source language) into a second language (target language) by using computers. The

    need for MT is greatly felt in the modern age due to globalization of information,

    where global information base needs to be accessed from different parts of the world.

    Although most of this information is available online, the major difficulty in dealing

    with this information is that its language is primarily English. Starting from science,

    technology, education to manuals of gadgets, commercial advertisements, everywhere

    predominant presence of English as the medium of communication can be easily

    observed. This world, however, is multi-lingual, where different languages are spoken

    in different regions. This necessitates the development of good MT systems for

    translating these works into other languages so that a larger population can access,

    retrieve and understand them. Consequently, in a country like India, where English

    is understood by less than 3% of the population (Sinha and Jain, 2003), the need

    for developing MT systems for translating from English into some native Indian

    languages is very acute. In this work we looked into different aspects of designing an

    English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two

    fundamental questions that we feel we should answer at this point are:

    The rationale behind choosing Example-Based Machine Translation (EBMT)

    as the paradigm of interest;

    The reason behind selecting Hindi as the preferred language.

    Below we provide justifications behind these choices.

    Development of MT systems has taken a big leap in the last two decades. Typ-

    ically, machine translation requires handcrafted and complicated large-scale knowl-

    1

  • edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending

    upon how the translation knowledge is acquired and used. For example,

    1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis

    and representation of the meaning of the source language texts, and the

    generation of equivalent target language texts (Grishman and Kosaka, 1992),

    (Thurmair, 1990), (Arnold and Sadler, 1990).

    2. Statistical- (or Corpus-) Based Machine Translation (SBMT): Statistical trans-

    lation models are trained on a sentence-aligned translation corpus, which is

    based on n-gram modelling, and probability distribution of the occurrence of

    a source-target language pair in a very large corpus. This technique was pro-

    posed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.

    al., 1993), (Germann, 2001).

    However, these techniques have their own drawbacks. The main drawback of

    RBMT systems is that sentences in any natural language may assume a large vari-

    ety of structures. Also, machine translation often suffers from ambiguities of various

    types (Dorr et. al., 1998). As a consequences, translation from one natural lan-

    guage into another requires enormous knowledge about the syntax and semantics of

    both the source and target languages. Capturing all the knowledge in rule form is

    daunting task if not impossible. On the other hand, SBMT techniques depend on

    how accurately various probabilities are measured. Realistic measurements of these

    probabilities can be made only if a large volume of parallel corpus is made available.

    However, availability of such huge data is not easy. Consequently, this scheme is

    viable only for small number of language pairs.

    2

  • Chapter 1. Introduction

    Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes

    use of past translation examples to generate the translation of a given input. An

    EBMT system stores in its example base of translation examples between two lan-

    guages, the source language (SL) and the target language (TL). These examples are

    subsequently used as guidance for future translation tasks. In order to translate a

    new input sentence in SL, a1 similar SL sentence is retrieved from the example base,

    along with its translation in TL. This example is then adapted suitably to generate a

    translation of the given input. It has been found that EBMT has several advantages

    in comparison with other MT paradigms (Sumita and Iida, 1991):

    1. It can be upgraded easily by adding more examples to the example base;

    2. It utilizes translators expertise, and adds a reliability factor to the translation;

    3. It can be accelerated easily by indexing and parallel computing;

    4. It is robust because of best-match reasoning.

    Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered

    EBMT to be one major and effective approach among different MT paradigms,

    primarily because it exploits the linguistic knowledge stored in an aligned text in a

    more efficient way.

    We apprehend from the above observation that for development of MT systems

    from English to Indian languages, EBMT should be one of the preferred approaches.

    This is because a significant volume of parallel corpus is available between English

    and different Indian languages in the form of government notices, translation books,

    1Sometimes more than one sentence is also retrieved

    3

  • advertisement material etc. Although this data is generally not available in elec-

    tronic form yet, converting them into machine readable form is much easier than

    formulating explicit translation rules as required by an RBMT system. In fact some

    parallel data in electronic form has been made available through some projects (e.g.

    EMILLE :http://www.emille.lancs.ac.uk/home.html). Also, there has been some

    concerted effort from various government organizations like TDIL2, CIIL Mysore3,

    C-DAC Nodia4, (Vikas, 2001) and various institutes, e. g., IIT Bombay5, IIT Kan-

    pur6, LTRC (IIIT Hyderabad)7 and develop linguistic resources. At the same time

    this data is not large enough to design an English to Hindi SBMT, which typically

    requires several hundred thousand of sentences. These resources, we hope, will be

    fruitfully utilized for developing different EBMT systems involving Indian languages.

    Of the different Indian languages8 Hindi has some major advantages over the oth-

    ers as far as working on MT is concerned. Not only is Hindi the national language of

    India, it is also the most popular among all Indian languages. With respect to Indian

    languages, all the major works that have been reported so far (e.g. ANGLAHINDI

    (Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), Ma-

    Tra (Human aided MT)9) are primarily concerned English and Hindi as their pre-

    ferred languages. In 2003 Hindi has been considered as the surprise language

    (Oard, 2003) by DARPA. As a consequence, different universities (e.g. CMU, Johns

    Hopkins, USC-ISI) have invested efforts in developing MT systems involving Hindi.

    2http://tdil.mit.gov.in/3http://www.ciil.org/4http://www.cdacnoida.com/5http://www.cfilt.iitb.ac.in6http://www.cse.iitk.ac.in/users/isciig/7http://ltrc.iiit.net/8India has 17 official languages, and more than 1000 dialects

    (http://azaz.essortment.com/languagesindian rsbo.htm)9http://www.ncst.ernet.in/matra/about.shtml

    4

  • Chapter 1. Introduction

    This world-wide popularity of the language makes the study of English to Hindi

    machine translation more meaningful in todays context.

    One major advantage of having the above-mentioned English to Hindi translation

    systems available on-line is that it helped us in working on the systems to examine

    the quality of their outputs. In this respect, we find that the outputs given by the

    above systems are not always the correct translations of the inputs. The following

    Table 1.1 illustrates the above statement with respect to the systems AnglaHindi

    and Shakti. In this table we show the translations produced by the above two

    systems for different inputs, and also show the correct translations of these sentences.

    Input Output of Output of Actual

    Sentences AnglaHindi Shakti Translation

    Ram married Sita. raam ne siita vi-

    vahaa kiyaa

    raam ne siitaa vi-

    vaaha kiyaa

    raam ne siitaa se

    vivaaha kiyaa

    Fan is on. pankhaa ho par pankhaa la-

    gaataar hai

    pankhaa chal rahaa

    hai

    This dish tastes

    good.

    yaha vyanjan

    achchhaa hotaa

    hai

    yah thalii achc-

    chaa swaad letii

    hai

    iss vyanjan kaa

    swaad achchhaa

    hai

    The soup lacks

    salt.

    soop namak kam

    hotaa hai

    shorbaa namak

    kamii hai

    soop mein namak

    kam hai

    It is raining. yah varshaa ho

    rahii hai

    yah varshaa ho

    rahii hai

    varshaa ho rahii

    hai

    They have a big

    fight.

    unke paas eka

    badhii ladaae hai

    unke badhii

    ladaaiyaan hain

    unkii ghamasan

    ladaii huii

    Table 1.1: Output of AnglaHindi and Shakti MT

    System

    5

  • 1.1. Description of the Work Done and Summary of the Chapters

    We have found many such instances where the outputs produced by the systems

    may not be considered to be correct Hindi translations of the respective inputs. This

    observation prompts us to study different aspects of English to Hindi translations in

    order to understand the difficulty in machine translations, particularly with respect

    to English to Hindi translation, also, how can these shortcomings be dealt with

    under an EBMT framework. The research is concerned with the above studies.

    1.1 Description of the Work Done and Summary

    of the Chapters

    The success of an EBMT system lies on two different modules: (i) Similarity mea-

    surement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a

    suitable translation example is retrieved from a systems example base. Adapta-

    tion is the procedure by which a retrieved translation is modified to generate the

    translation of the given input. Various retrieval strategies have been developed (e.g.

    (Nagao, 1984), (Sato, 1992), (Collins and Cunningham, 1996)). All these retrieval

    strategies aim at retrieving an example from the example base such that the retrieved

    example is similar to the input sentence. This is due to the fact that the fundamental

    intuition behind EBMT is that translations of similar sentences of the source lan-

    guage will be similar in the target languages as well. Thus the concept of retrieval is

    intricately related with the concept of similarity measurement between sentences.

    But the main difficulty with respect to this assumption is that there is no straight-

    forward way to measure similarity between sentences. In different works different

    approaches have been defined for measuring similarity between sentences. For exam-

    ple, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based

    6

  • Chapter 1. Introduction

    metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and

    Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid

    retrieval scheme (e.g. (Collins, 1998)).

    In all these works similarity measurement and adaptation are considered

    in isolation. This we feel is the major hindrance with respect to EBMT. In this

    work we therefore propose a novel approach for measuring similarity. We intend

    to look at similarity from the point of view of adaptation. We suggest that a past

    example will be considered as the most similar with respect to an input sentence, if

    its adaptation towards generating the desired translation is the simplest. The work

    carried out in this research is aimed at achieving this goal. Our studies therefore start

    in the following way. We first look at adaptation in detail. An efficient adaptation

    scheme is very important for an EBMT system because even a very large example

    base cannot, in general, guarantee an exact match for a given input sentence. As

    a consequence, the need for an efficient and systematic adaptation scheme arises

    for modifying a retrieved example, and thereby generating the required translation.

    Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,

    1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these

    schemes suggest that primarily there are four basic adaptation operations, i.e. word

    addition, word deletion, word replacement and copy.

    In our approach we started with these basic operations: word addition, word

    deletion, word replacement and copy. However, in this respect we notice the follow-

    ing:

    1. Both English and Hindi relies heavily on suffixes for morphological changes.

    There are a number of suffixes for achieving declension of verbs and nouns.

    Further, in Hindi there are situations when morphological changes in the ad-

    7

  • 1.1. Description of the Work Done and Summary of the Chapters

    jectives is also required depending upon the number and gender of the corre-

    sponding noun/pronoun. Since the number of suffixes is limited, we feel that

    instead of purely word-based operations if adaptation operations are focused

    on the suffixes, then in many situations significant amount of computational

    efforts may be saved.

    2. A further observation with respect to Hindi is that there are situations when in-

    stead of suffixes whole words are used for bringing in morphological variations.

    For example, the present continuous form of Hindi verbs is: + + . Here the words rahaa,

    rahii or rahe are used to achieve the morphological variation. Which of

    these will be used depend upon the number and gender of the subject. Sim-

    ilarly, hai, hain and ho are used corresponding to situations when the

    subject is singular or plural and person, respectively. We term these words as

    morpho-words. Appendix A gives details of different Hindi morpho-words

    and their usages.

    A major fall out of the above observation is that in some situations, adaptation

    may be carried out by dealing with the morpho-words instead of whole words, which

    are computationally much less expensive than dealing with constituent words as a

    whole. Thus we propose an adaptation scheme consisting of ten operations: addition,

    deletion, and replacement of constituent words, addition, deletion, and replacement

    of morpho-words, addition, deletion, and replacement of suffixes and copy. Chapter

    2 of the thesis discusses these adaptation operations in detail.

    One point, however, we notice with respect to the above operations is that the

    above-mentioned operations cannot deal with translation divergences in an efficient

    way. Divergence occurs when structurally similar sentences of the source language

    8

  • Chapter 1. Introduction

    do not translate into sentences that are similar in structures in the target language.

    (Dorr, 1993). We therefore felt study of divergence is an important aspect for any

    MT system. With respect to an EBMT system the need arises because of the two

    reasons:

    The past example that is retrieved for carrying out the task of adaptation

    has a normal translation, but translation of the input sentence should involve

    divergence.

    The translation of the retrieved example involves divergence, whereas the input

    sentence should have a normal translation.

    In this work we made an in-depth study of divergence with respect to English to

    Hindi translation. In this regard one may note that divergence is a highly language-

    dependent phenomenon. Its nature may change along with the source and target

    languages under consideration. Although divergence has been studied extensively

    with respect to translation between European languages (e.g. (Dorr et. al., 2002),

    (Watanabe et. al., 2000)) very little studies on divergence may be found regarding

    translations in Indian languages. The only work that came into our notice is in (Dave

    et. al., 2002). In this work the author has followed the classifications given in (Dorr,

    1993) and tried to find examples of each of them with respect to English to Hindi

    translation. In this regard it may be noted that Dorr has described seven differ-

    ent divergence types: structural, categorical, conflational, promotional, demotional,

    thematic and lexical, with respect to translations between European languages.

    However, we find that all the different divergence types explained in Dorrs work

    do not apply with respect to Indian languages. In fact, we found very few (if not

    none) examples of thematic and promotional divergence with respect to English

    9

  • 1.1. Description of the Work Done and Summary of the Chapters

    to Hindi translation. On the other hand we identified three new types of divergence

    that have not so far been cited in any other works on divergence. We named these

    divergences as nominal, pronominal and possessional, respectively. We have

    further observed that all the different divergence types (barring structural) for

    which we found instances in English to Hindi translation may be further divided into

    several sub-categories. Chapter 3 explains in detail different divergence types and

    their sub-types that we have observed with respect to English to Hindi translation,

    and illustrates them with suitable examples. Some of these results have already been

    presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).

    Presence of divergence examples in the example base makes straightforward ap-

    plication of the above-mentioned adaptation scheme difficult. As mentioned earlier,

    application of the operations discussed in Chapter 2 will not be able to generate

    the correct translation if the input sentence requires normal translation, whereas

    the translation of retrieved example involves divergence, or vice versa. To overcome

    this difficulty we suggest that the example base may be partitioned into two parts:

    one containing examples of normal translation, the other containing the examples

    of divergence, so that given an input sentence an EBMT system may retrieve an

    example from the appropriate part of the example base. However, implementation

    of the above scheme requires design of algorithms for:

    1) Partitioning the example base sentences.

    2) Designing an efficient retrieval policy.

    We attempt to answer the first one by designing algorithms for identification of

    translation divergence, i.e. if an English sentence and its Hindi translation are given

    as input, these algorithms will detect whether this translation involves any of the said

    10

  • Chapter 1. Introduction

    types of divergence. The remaining part of Chapter 3 discusses different algorithms

    that we developed for identification of divergence from a given English-Hindi pair

    of sentences. The identification algorithms designed by us consider the Functional

    tag (FT10) of the constituent words and the Syntactic Phrasal Annotated Chunk

    (SPAC11) of the SL and TL sentences. When these two do not match for a source

    language sentence and its translation in the TL, a divergence can be identified. With

    respect to each divergence categories and their sub-categories we have identified

    the appropriate FTs and SPACs whose presence/absence indicate possibilities of

    certain divergence. By systematically analyzing the FTs and SPACs of the English

    sentence and its Hindi translation the algorithms arrive at a decision on whether

    this translation involves any divergence. Thus the algorithm partitions the example

    base in two parts: Normal Example Base and Divergence Example Base. Some of

    these algorithms have already been presented in (Gupta and Chatterjee, 2003b).

    To answer the second question, we feel that given an input sentence if it can be

    decided a priori whether its translation will involve divergence then the retrieval can

    be made accordingly. To handle the situation when the translation of input sentence

    does not involve any divergence, we devise a cost of adaptation based two-level

    filtration scheme that enables quick retrieval from normal example base12. Chapter 4

    describes our scheme of retrieval from divergence example base in situations involving

    divergence. Here our primary attempt is to develop a procedure so that given an

    input English sentence it can decide whether its Hindi translation will involve any

    type of divergence. Obviously, this decision has to be made before resorting to

    the actual translation. Hence we call it prior identification of divergence. The

    10Appendix B provides details on the FTs.11SPAC structure is discussed in detail in Appendix C.12This scheme is discussed in Chapter 5.

    11

  • 1.1. Description of the Work Done and Summary of the Chapters

    algorithm seeks evidence from the example base and the WordNet. In this work we

    have used WordNet 2.013 to measure semantic similarity of the constituent words

    of the input sentence, and various words present in the example base sentences to

    arrive at a decision in this regard. The scheme works in the following way. We first

    identified the roles of different Functional Tags (FT) towards causing divergence.

    We observe with respect to different divergence type and sub-types that each FT

    may have one of the three following roles;

    1) Its presence is mandatory for the corresponding divergence (sub-)type to occur;

    2) Its absence if mandatory for the corresponding divergence (sub-)type to occur;

    3) Occurrence/non-occurrence of the divergence (sub-)type is not influenced by

    the FT under consideration.

    This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given

    an input sentence the scheme first determines its constituent FTs. We have used

    ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding

    is then compared to the above-mentioned knowledge base (Table 4.2) to identify

    the set (D) of divergence types that may possibly occur in the translation of this

    sentence. Further investigation is carried out to discard elements from the set D, so

    that the divergence that may actually occur can be pin-pointed. In this respect we

    proceed in the following way. Corresponding to each divergence type we identify the

    functional tag that is at the root of causing the divergence. We call it the problem-

    atic FT corresponding to that particular divergence. Table 4.3 presents our finding

    in this regard. Corresponding to each possible divergence (as found in D) the scheme

    13http://www.cogsci.princeton.edu/cgi-bin/webwn14http://www.lingsoft.fi/cgi-bin/engcg

    12

  • Chapter 1. Introduction

    works as follows. It first retrieves from the input sentence the constituent word cor-

    responding to the problematic FT of the divergence type under consideration. Then

    the semantic similarity of this word is compared to other words. Proximity in this

    semantic distance is then used as a yardstick for similarity measurement. Chapter

    4 discusses this scheme in detail.

    Finally, in Chapter 5 we look at how cost of adaptation may be used as a similar-

    ity measurement scheme. It has been stated that no unique definition of similarity

    exists for comparing sentences. Similarity between sentences may be viewed from

    different perspectives. In this work, we have first considered two most general sim-

    ilarity schemes: syntactic similarity and semantic similarity. The ideas have

    been borrowed from the domain of Information Technology (Manning and Schutze,

    1999). According to the definition given therein semantic similarity is measured on

    the basis of commonality of words. The more is the number of words common be-

    tween two sentences, the more similarity is said to exist between the two sentences

    under consideration. However, it has been shown in (Chatterjee, 2001) that this

    measurement of similarity is not always helpful from EBMT point of view. For ex-

    ample, it has been shown there that although the sentences The horse had a good

    run. and The horse is good to run on. have most of the key words common, the

    structure of their Hindi translations are very different. Consequently, adaptation of

    the translation of one of them to generate the translation of the other is computa-

    tionally demanding. On the other hand, syntactic similarity between two sentences

    is measured on the basis of commonality of morpho-functional tags between them.

    In this case, adaptation may require a large number of constituent word replacement

    (WR) operations. Each of these WR operations involves reference to some dictio-

    nary for picking up the appropriate words in the target language. Typically the

    13

  • 1.1. Description of the Work Done and Summary of the Chapters

    dictionary access will involve accessing an external storage, and thereby will incur

    significant computational cost. Thus a purely syntax-based similarity measurement

    scheme may not be suitable for an EBMT system.

    In this work we therefore propose that from EBMT perspective retrieval and

    adaptation should be looked at in a unified way. In this chapter (i.e. Chapter

    5) we investigate feasibility of the above proposal in depth. In this respect we first

    look into the overall adaptation operations deeply. We have already observed that

    these operations are invoked successively to remove the discrepancies between the

    input sentence and the retrieved example. These discrepancies, as we observe, may

    be in the actual words, or in the overall structure of the sentences. For illustration,

    suppose the input sentence is The boy eats rice everyday., whose Hindi translation

    ladkaa har roz chaawla khaataa hai has to be generated. The nature of the adap-

    tation varies depending upon which example is retrieved from the example base. For

    illustration:

    a) If the retrieved example is The boy eats rice, the adaptation procedure needs

    to apply a constituent word addition operation (WA) to take care of the adverb

    everyday.

    b) However, if the retrieved sentence is The boy plays cricket everyday. ladkaa

    roz cricket kheltaa hai, then the adaptation procedure needs to invoke two

    constituent word replacement (WR) operations : to replace Hindi of play,

    i.e. khel with the Hindi of eat, i.e. khaa, and cricket (cricket) with

    chaawal (rice).

    c) In case the retrieved example is The boy is eating rice., one adaptation op-

    eration that is constituent word addition (WA) is required for the adverb

    14

  • Chapter 1. Introduction

    everyday. Further to take care of verb conjugation some morpho-word and

    suffix operations need to be carried out. This is because the Hindi transla-

    tion of The boy is eating rice is : ladkaa (boy) chaawal (rice) khaa (eat)

    rahaa (..ing) hai (is). But the translation of the input sentence The boy

    eats rice everyday should be ladkaa har roz chaawal khaataa hai. Thus the

    morpho-word rahaa, which is required for the present continuous tense of

    the retrieved sentence needs to be deleted. Further the suffix taa is to be

    added to the root main verb to get the required present indefinite verb form

    of the input.

    d) However, if the retrieved example is Does the boy eat rice?, then adaptation

    procedure needs to take care of the structural variation between the inter-

    rogative form of the retrieved example, and the affirmative form of the input

    sentence.

    Obviously, the more will be the discrepancy between the retrieved example and

    the input sentence, the more will be the number of adaptation operations towards

    generating the desired translation. The above illustrations make certain points evi-

    dent:

    a) Adaptation operations are required for performing two general tasks: dealing

    with constituent words (along with their suffixes, morpho-words), and dealing

    with the overall structure of the sentence.

    b) Each invocation of adaptation operation pertains to a particular part of speech,

    such as, noun, verb, adverb etc.

    c) Of the ten adaptation operations (described earlier with respect to Chapter

    15

  • 1.1. Description of the Work Done and Summary of the Chapters

    2) only the WA and WR operations require dictionary15 searches. Since dic-

    tionary search typically involves accessing an external device (e.g hard disk),

    a dictionary search is computationally more expensive than other operations

    (e.g. constituent word deletion, morpho-word operations) which are purely

    RAM16-based and hence computationally cheaper.

    The above observations help us to proceed towards achieving the intended goal

    of using cost of adaptation as a measurement of similarity. As a first step towards

    achieving the intended goal, we suggest to divide the dictionary into several parts

    based on the part-of-speech (POS) of the words. Division of the dictionary into

    several parts according to the POS reduces the search time for each invocation of

    the above procedures, and as a consequence, the search time is reduced. The cost of

    adaptation based similarity measurement approach then proceeds along the following

    line:

    a) We first estimate the average cost for each of the ten adaptation operations.

    We observe that these costs depend on two major types of parameters. On

    one hand they depend on certain linguistic aspects, such as, the average length

    of the sentences in both source and target languages, the number of suffixes

    (used with different POS), the number of morpho-words etc. On the other

    hand, these costs are related to the machine on which the EBMT system is

    working. Since we aim at analyzing the costs in a general way, we assumed

    these machine-dependent costs to be variables in all our analysis. For the lin-

    guistic parameters, we used values that we have obtained by analyzing about

    15By dictionary we mean a source language to target language word dictionary available on-line.

    16Random Access Memory

    16

  • Chapter 1. Introduction

    30,000 examples of English to Hindi translations. These examples were col-

    lected from various sources that are translation books, advertisement materi-

    als, childrens story books and government notices, which are freely available

    in non-electronic form.

    b) At the second step, we estimated the costs incurred in adapting various func-

    tional tags17. In particular, we have considered cost of adaptation due to vari-

    ations in active and passive verb morphology, subject/object, pre-modifying

    adjective, genitive case and wh-family words. These costs are stored in various

    tables, in Section 5.4.

    c) At the third step we have considered costs of adaptation due to differences in

    sentence structure. Here, we have considered four different sentence structures:

    affirmative, negative, interrogative, negative-interrogative. These adaptation

    costs too are stored in tabular form. Section 5.4 gives details of this analysis.

    Once these basic costs are modelled, we are in a position to experiment on costs

    of adaptation as a similarity measure vis-a`-vis semantics and syntax based similarity

    measurement scheme discussed above. Our experiments have clearly established the

    efficiency of the proposed scheme over the others. Part of this work is also presented

    in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:

    1) It may end up in comparing a given input with all the example-base sentences

    to ascertain the least cost of adaptation.

    2) Another major question that may arise is that whether the cost of adaptation

    scheme is efficient enough to handle sentences that are structurally more com-

    17In fact we worked on Functional Slots which are more general than Functional Tags. Thisis discussed in detail in Section 2.2

    17

  • 1.1. Description of the Work Done and Summary of the Chapters

    plicated, e.g. complex or compound sentences. It is a generally accepted fact

    that complex sentences are difficult to handle in an MT system (Dorr et. al.,

    1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003).

    In order to deal with first difficulty we have proposed a two-level filtration scheme.

    This scheme helps in selecting a smaller number of examples from the example base,

    which may subsequently be subjected to the rigorous treatment for determining their

    costs of adaptation with respect to the given input. We have also justified that this

    scheme does not leave out the sentences whose translations are easier to adapt for

    the given input.

    In this work we have given a solution for the second problem too. We have

    given rules for splitting a complex sentence into more than one simple sentence.

    Translations of these simple sentences may then be generated by the EBMT system.

    These individual translations may then be combined to obtain the translation of the

    given complex sentence input. If the cost of adaptation based similarity measurement

    scheme is applied for translating the simple sentences, then the cost of adaptation

    of the complex sentence too can be estimated, by adding the individual costs with

    the cost of combining the individual translations. Since the last operation is purely

    algorithmic its computational complexity can be easily computed, and hence the

    overall cost of adaptation be estimated. With respect to dealing with complex

    sentences, we have however used certain restrictions. We considered sentences with

    only one subordinate clause. Further, the presence of a connecting word is also

    mandatory. Evidently, more complicated complex sentence structures are available,

    and further investigations are required for developing techniques for handling them

    in an EBMT framework.

    In this connection we like to mention that we have explained the cost of adap-

    18

  • Chapter 1. Introduction

    tation with respect to a selected set of sentence structures, and for a selected set of

    Functional slots. Definitely many more variations are available with respect to these

    parameters. Consequently, more work has to be done to form rules for handling

    these variations. However, we feel that the work described in research provides the

    suitable guideline for further continuation of the research.

    1.2 Some Critical Points

    1) The aim of this research is not to construct an English to Hindi EBMT system.

    Rather our intuition is to analyze the requirements that help in building an

    effective EBMT system. The motivation behind this research came from two

    major observations:

    Although some MT system for translation from English to Hindi already

    exist, the quality of their translation is often not up to the mark. This

    promoted us to look into the process of MT to ascertain the inherent

    difficulties.

    We have chosen EBMT as our preferred paradigm because of its certain

    advantages our other MT paradigms such as RBMT, SBMT. One major

    advantage of EBMT is that it requires neither a huge parallel corpus as

    required by SBMT, nor it requires framing a large rule base required by

    RBMT. Study of EBMT is therefore feasible for us as we did not have

    access to such linguistics resources.

    2) In order to design our scheme we have studied about 30,000 English to Hindi

    translation examples available off-line. Although now large volumes of English

    19

  • 1.2. Some Critical Points

    English sentence: The horses have been running for one hour.Tagged form: @DN> ART the, @SUBJ N PL horse %ghodaa%,@+FAUXV V PRES have, @-FAUXV V PCP2 be, @-FMAINV V PCP1run %daudaa%, @ADVL PREP for, @QN> NUM CARD one %ek%, @

  • Chapter 1. Introduction

    this research will be helpful for developing MT system not only for Hindi but also for

    other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer

    from the same drawback - unavailability of linguistics resources. However, demands

    for developing MT systems from English to these languages is increasing with time

    not only because these are prominent regional languages of India, but also they

    are important minority languages in other countries such as U.K. (Somers, 1997).

    The studies made in the research should pave the way for developing EBMT system

    involving these languages as well.

    21

  • Chapter 2

    Adaptation in English to Hindi

    Translation: A Systematic

    Approach

  • Adaptation in English to Hindi Translation: A Systematic Approach

    2.1 Introduction

    The need for an efficient and systematic adaptation scheme arises for modifying a

    retrieved example, and thereby generating the required translation. This chapter is

    devoted to the study of systematic adaptation approach. Various approaches have

    been pursued in dealing with adaptation aspect of an EBMT system. Some of the

    major approaches are described below.

    1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:

    high-level grafting and keyhole surgery. High-level grafting deals with phrases.

    Here an entire phrasal segment of the target sentence is replaced with another

    phrasal segment from a different example. On the other hand, keyhole surgery

    deals with individual words in an existing target segment of an example. Under

    this operation words are replaced or morphologically fine-tuned to suit the

    current translation task. For instance, suppose the input sentence is The girl

    is playing in the park., and in the example base we have the following examples:

    (a) The boy is playing.

    (b) Rita knows that girl.

    (c) It is a big park.

    (d) Ram studies in the school.

    For the high level grafting the sentences (a) and (d) will be used. Then keyhole

    surgery will be applied for putting in the translations of the words park and

    girl. These translations will be extracted from (b) and (c).

    2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on

    three steps: finding the difference, replacing the difference, and smoothing the

    23

  • 2.1. Introduction

    output. The differing segments of the input sentence and the source template

    are identified. Translations of these different segments in the input sentence

    are produced by rule-based methods, and these translated segments are fitted

    into a translation template. The resulting sentence is then smoothed over by

    checking for person and number agreement, and inflection mismatches. For

    example, assume the input sentence and selected template are:

    SI A very efficient lady doctor is busy.

    St A lady doctor is busy.

    Tt mahilaa chikitsak vyasta hai

    The parsing process however shows that The very efficient lady doctor is a

    noun phrase, and so matches it with The lady doctor - ek mahilaa chikit-

    sak. The very efficient lady doctor is translated as ek bahut yogya mahilaa

    chikitsak, by the rule-based noun phrase translation system. This is inserted

    into Tt giving the following: Tt: ek bahut yogya mahilaa chikitsak vyasta hai.

    3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here

    two different cases are considered: full-case adaptation and partial-case adap-

    tation. Full-case adaptation is employed when a problem is fully covered by the

    retrieved example. Here desired translation is created by substitution alone.

    No addition and deletion are required for adapting TL for generating the trans-

    lation of SL. Here TL and SL denote example base target language sentence

    and input source language sentence, respectively. In this case five scenarios

    are possible: SAME, ADAPT, IGNORE, ADAPTZERO and IGNOREZERO.

    Partial-case adaptation is used when a single unifying example does not exist.

    Here three more operations are required on the top of the above five. These

    three operations are ADD, DELETE and DELETZERO.

    24

  • Adaptation in English to Hindi Translation: A Systematic Approach

    Figure 2.1: The five possible scenarios in the SL SL TL interface of partialcase matching

    Note that there is a subtle difference between ADAPT and ADAPTZERO.

    For ADAPT as well as for ADAPTZERO, both SL and SL have same links

    but different chunks. If TL has words corresponding to the chunk which is

    different in SL and SL, then the words in TL should be modified and this is

    the case of ADAPT. One the other hand, if no corresponding chunk is present

    in TL then it is the case of ADAPTZERO. Therefore, in that case no work is

    needed for adaptation. Similar subtleties may be observed between DELETE

    and DELETZERO, and also between IGNORE and IGNOREZERO. Other

    operations (such as, SAME, ADD) have obvious interpretations. Figure 2.1

    provides the conceptual view of partial case matching.

    4. Somers (2001) proposes adaptation from case-based reasoning point of view.

    The simplest of the CBR adaptation methods is null adaptation where no

    changes are recommended. In a more general situation various substitution

    methods (e.g. reinstatiation, parameter adjustment)/transformation methods

    (e.g. commonsense transformation and model-guided repair) may be applied.

    For example, suppose the input sentence (I) and the retrieved example (R)

    25

  • 2.1. Introduction

    are:

    I That old woman has died.

    R That old man has died. wah boodhaa aadmii mar gayaa

    To generate the desired translation of the word man aadmii is first re-

    placed with the translation of woman aurat in R. This operation is called

    reinstantiation. At this stage an intermediate translation wah boodhaa aurat

    mar gayaa is obtained. To obtain the final translation wah boodhii aurat

    mar gayii, the system must also change the adjective boodhaa to boodhii

    and the word gayaa to gayii. This is called parameter adjustment.

    5. The adaptation scheme proposed by McTait (2001) works in the following way.

    Translation patterns that share lexical items with the input and partially cover

    it are retrieved in a pattern matching procedure. From these, the patterns

    whose SL side cover the SL input to the greatest extent (longest cover) are

    selected. They are termed base patterns, as they provide sentential context in

    the translation process. It is intuitive that the greater extent of the cover is

    provided by the base patterns, the more is the context, and the lesser is the

    ambiguity and complexity in the translation process. If the SL side of the base

    pattern does not fully cover the SL input, any unmatched segments are bound

    to the variable on the SL side of the base pattern. The translations of the SL

    segments bound to the SL variables of the base pattern are retrieved from the

    remaining set of translation patterns, as the text fragments and variables on

    the TL side of the base pattern from translation strings.

    The following is a simple example: given the source language input is I: AIDS

    control programme for Ethiopia, suppose the longest covering base pattern is:

    D1: AIDS control programme for (....) ke liye AIDS contral smahaaroo (...).

    26

  • Adaptation in English to Hindi Translation: A Systematic Approach

    To complete the match between I and the source language side of D1, a trans-

    lation pattern containing the text fragment Ethiopia is required i.e.

    D2: (...) Ethiopia (...) Ethiopia (...).

    The TL translation T: ethiopia ke liye AIDS contral smahaaroo is generated

    by recombing the text fragments: Ethiopia and ethiopia are aligned in D2

    as are the variables in the base pattern D1. Since Ethiopia and ethiopia

    are aligned on a 1:1 basis, and so are the variables in the base pattern D1, the

    TL text fragment Ethiopia is bound to the variable on the TL side of D1 to

    produce T.

    6. In HEBMT (Jain, 1995) examples are stored in an abstracted form for deter-

    mining the structural similarity between the input sentence and the example

    sentences. The target language sentence is generated using the target pat-

    tern of the sentence that has lesser distance with the input sentence. The

    system substitutes the corresponding translations of syntactic units identified

    by a finite state machine in the target pattern. Variation in tense of verb,

    and variations due to number, gender etc. are taken care at this stage for

    generating the appropriate translation. This system translates from Hindi to

    English, therefore, we explain its adaptation process with the example of Hindi

    to English translation.

    For example, suppose the input sentence is merii somavara ko jaa rahii hai

    and its matches with examples sentence R: meraa dosta itavaar ko aayegaa.

    Steps (a) to (f) below, show the process of translation.

    (a) merii somavara ko jaa rahii hai(input sentence)

    27

  • 2.1. Introduction

    (b) 123(syntactic grouping)

    (c) [Mary] [Monday] [go] (English translation of syntactic groups)

    (d) {on} (target pattern of example R)

    (e) [Mary] [is going] on [Monday] (Translation after substitution)

    (f) Mary is going on Monday (Final translated output)

    Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,

    1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (Guvenir and Cicekli,

    1998). But overall in our view the adaptation procedures employed in different

    EBMT systems primarily consist of four operations:

    Copy, where the same chunk of the retrieved translation example is used in

    the generated translation;

    Add, where a new chunk is added in the retrieved translation example;

    Delete, when some chunk of the retrieved example is deleted; and

    Replace, where some chunk of the retrieved example is replaced with a new

    one to meet the requirements of the current input.

    The operations prescribed in different systems vary in the chunks they deal with.

    Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional

    suffix).

    1snp : noun, adj+noun, noun+ kaa+noun2npk2: noun+ko3mv: verb-part

    28

  • Adaptation in English to Hindi Translation: A Systematic Approach

    With respect to English and Hindi, we find that both the languages depend

    heavily on suffixes for verb morphology, changing numbers from singular to plu-

    ral and vice versa, case endings, etc. Appendix A provides detail descriptions

    of various Hindi suffixes. Keeping the above in view we differentiated the adap-

    tation operations in two groups: word based and suffix based. The word based

    operations are further subdivided into two categories: constituent word based and

    morpho-word based. Thus the adaptation scheme proposed here consists of ten op-

    erations: Copy (CP), Constituent word deletion (WD), Constituent word addition

    (WA), Constituent word replacement(WR), Morpho-word deletion (MD), Morpho-

    word addition (MA), Morpho-word replacement(MR), Suffix addition (SA), Suffix

    deletion (SD) and Suffix replacement (SR). Section 2.2 illustrates the roles of the

    these operations in adapting a retrieved translation example.

    The advantage of the above classification of adaptation operations is twofold.

    Firstly, it helps in identifying the specific task that has to be carried out in the step-

    by-step adaptation for a given input. Secondly, it helps in measuring the average

    cost of each of the above operations in a meaningful way, which in turn helps in

    estimating the total adaptation cost for a given sentence. This estimate can be used

    as a tool for similarity measurement between an input and the stored examples.

    These issues are discussed in Chapter 5.

    2.2 Description of the Adaptation Operations

    The ten adaptation operations mentioned above are described below.

    1. Constituent Word Replacement (WR): One may get the translation of the

    input sentence by replacing some words in the retrieved translation example.

    29

  • 2.2. Description of the Adaptation Operations

    Suppose the input sentence is: The squirrel was eating groundnuts., and the

    most similar example retrieved by the system (along with its Hindi translation)

    is: The elephant was eating fruits. haathii phal khaa rahaa thaa. The

    desired translation may be generated by replacing haathii with the Hindi of

    squirrel, i.e. gilharii and replacing phal with the Hindi of groundnuts,

    i.e. moongphalii. These are examples of the operation of constituent word

    replacement.

    2. Constituent Word Deletion (WD): In some cases one may have to delete some

    words from the translation example to generate the required translation. For

    example, suppose the input sentence is: Animals were dying of thirst. If the

    retrieved translation example is : Birds and Animals were dying of thirst.

    pakshii aur pashu pyaas se mar rahe the, then the desired translation can

    be obtained by deleting pakshii aur (i.e the Hindi of birds and) from the

    retrieved translation. Thus the adaptation here requires two constituent word

    deletions.

    3. Constituent Word Addition (WA): This operation is the opposite of constituent

    word deletion. Here addition of some additional words in the retrieved trans-

    lation example is required for generating the translation. For illustration, one

    may consider the example given above with the roles of input and retrieved

    sentences being reversed.

    4. Morpho-word Replacement (MR): In this case one morpho-word is replaced by

    another morpho-word in the retrieved translation example. Consider a case

    when the input sentence is: The squirrel was eating groundnuts., and the

    retrieved example is: The squirrel is eating groundnuts. gilharii moongfalii

    30

  • Adaptation in English to Hindi Translation: A Systematic Approach

    khaa rahii hai. In order to take care of the variation in tense the morpho-

    word hai is to be replaced with thaa. This is an example of Morpho-word

    replacement.

    5. Morpho-word Deletion (MD): Here some morpho-word(s) are deleted from the

    retrieved translation example. For illustration, if the input sentence is He

    eats rice., and the retrieved example is: He is eating rice. wah chaawal

    khaa rahaa hai, then to obtain the desired translation4 first the morpho-word

    rahaa is to be deleted from the retrieved translation example.

    6. Morpho-word Addition (MA): This is the opposite case of morpho-word dele-

    tion. Here some morpho-words need to be added in the retrieved example in

    order to generate the required translation.

    7. Suffix Replacement (SR): Here the suffix attached to some constituent word

    of the retrieved sentence is replaced with a different suffix to meet the current

    translation requirements. This may happen with respect to noun, adjective

    verb, or case ending . For illustration,

    (a) To change the number of nouns

    Boy (ladkaa) Boys (ladke)

    The suffix aa is replaced with e so in order to get its plural form in

    Hindi.

    (b) Change of Adjectives

    Bad boy (buraa ladkaa) Bad girl (burii ladkii)

    The suffix aa is replaced with ii to get the adjective burii.

    4Of course the final translation will be obtained by adding the the suffix taa with the wordkhaa.

    31

  • 2.2. Description of the Adaptation Operations

    (c) Morphological changes in verb

    He reads. (wah padtaa hai) She reads. (wah padtii hai)

    The suffix taa is replaced with tii to get the verb padtii, which is

    required to indicate that the subject is feminine.

    (d) Morphological changes due to case ending

    boy (ladkaa) from boy (ladke se)

    room (kamraa) in room (karmre mein)

    The suffix aa is replaced with e to get the nouns ladke and kamre.

    8. Suffix Deletion (SD): By this operation the suffix attached to some constituent

    word may be removed, and thereby the root word may be obtained. This

    operation is illustrated in the following examples:

    (a) To change the number of nouns

    women (aauraten) woman (aaurat),

    The suffix en is deleted from aauraten to get the Hindi translation

    of woman.

    (b) Morphological changes in verb

    He reads. (wah padtaa hai) He is reading. (wah pad rahaa hai)

    The suffix taa is deleted from padtaa to get the root form pad of

    the English verb read.

    (c) Morphological changes due to case ending

    in the houses (gharon mein) houses (ghar)

    in words (shabdon mein) words (shabd)

    The suffix on is deleted from gharon and shabdon to get the Hindi

    translation of nouns houses and words, respectively.

    32

  • Adaptation in English to Hindi Translation: A Systematic Approach

    9. Suffix Addition (SA): Here a suffix is added to some constituent word in the

    retrieved example. Note that here the word concerned is in its root form in

    the retrieved example. One may consider the examples given above with the

    roles of input and retrieved sentences reversed as suitable examples for suffix

    addition operation.

    10. Copy (CP): When some word (with or without suffix) of the retrieved example

    is retained in toto in the required translation then it is called a copy operation.

    Figure 2.2 provides an example of adaptation using the above operations. In this

    example the input sentence is He plays football daily., and the retrieved translation

    example is:

    They are playing football. we football khel rahe hain

    (They) (football) (play) (...ing) (are)

    The translation to be generated is : wah roz football kheltaa hai. When carried

    out adaptation using both word and suffix operations the adaptation steps look as

    given in Figure 2.2. In this respect one may note that Hindi is a free order language,

    and consequently the position of adverb is not fixed. Hence the above input sentence

    may have different Hindi translations:

    wah roz football kheltaa hai

    wah football roz kheltaa hai

    roz wah football kheltaa hai

    While implementing an EBMT system one has to stick to some specific format.

    The adverb will be added according to the format adapted by the system.

    33

  • 2.2. Description of the Adaptation Operations

    Input we football khel rahe hain

    Operations WR WA CP SA MD MR

    Output wah roz football kheltaa hai

    Figure 2.2: Example of Different Adaptation Operations

    Which adaptation operations will be required to translate a given input sentence

    depends upon the translation example retrieved from the example base. A variety

    of examples may be adapted to generate the desired translation, but obviously with

    varying computational costs. For efficient performance, an EBMT system, therefore,

    needs to retrieve an example that can be adapted to the desired translation with

    least cost. This brings in the notion of similarity among sentences. The proposed

    adaptation procedure has the advantage that it provides a systematic way of evalu-

    ating the overall adaptation cost. This estimated cost may then be used as a good

    measure of similarity for appropriate retrieval from the example base. How cost of

    adaptation may be used as a yardstick to measure similarity between sentences will

    be described in Chapter 5.

    Here our aim is to count the number of adaptation operations required in adapt-

    ing a retrieved example to generate the translation of a given input. Obviously, de-

    pending upon the situation one has to apply some adaptation operations for changing

    different functional slot5(Singh, 2003), such as, subject(), object (), verb

    (). Also certain operations are required for changing the kind of sentence, e.g.

    5The following example illustrates the difference between functional slots and functional tags.Consider the sentence The old man is weak.. The subject of this sentence is the noun phrase Theold man. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that theis a determiner, old is adjective, and man is the subject. But, as mentioned above, the entirenoun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is, i.e. subject slot. Note that a particular functional slot may have variable number of words.The sequence of functional slots in a sentence provide the sentence pattern. The difference betweenvarious tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.

    34

  • Adaptation in English to Hindi Translation: A Systematic Approach

    affirmative to negative, negative to interrogative etc. Table 2.2 contains the nota-

    tions for roles of different functional slot and operators, which are required for the

    subsequent discussion.

    Operators Role of operators

    < > For functional slot and part of speech and its transforma-

    tion. E.g. , etc.

    & Both functional slots or past of speech and its transforma-

    tion should be present.

    or Either first slot/tag, second slot/tag or both.

    {} For non-obligatory functional tag/slot or for optional adap-

    tation operation.

    [ ] For the property of functional slot/tag.

    Functional

    Slot

    Role of Functional slot

    Linking verbs in English are: are , am , was, were, become,

    seem etc., and in Hindi are: hai, hain, ho, thaa, the etc.

    Auxiliary verb (if any) and main verb of the sentence

    Auxiliary verb

    Main verb

    Subject

    Object

    First object

    Second object

    Subjective complement

    -ing verb form other than the main verb.

    -ed or -en verb forms other than the main verb.

    to-infinitive form of verb.

    Adverb

    Adjective phrase

    preposition phrase

    preposition

    Table 2.2: Notations Used in Sentence Patterns

    35

  • 2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

    The following sections describe how many such operations are required in dif-

    ferent cases. In particular we consider the following functional slot and sentence

    kinds:

    1. Tense and Form of the Verb. Since there are three tenses (viz. Present,

    Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect

    Continuous), in all one can have 12 different structures of verb and passive

    form verb structure also.

    2. Subject/Object functional slot. Variations in subject/object functional slot

    may happen in many different ways, such as, Proper Noun, Common Noun

    (Singular or Plural), Pronoun, PCP1 form6 and PCP2 form7. Study of varia-

    tion in pre-modifier adjectives, genitive case, quantifier and determiner tags.

    3. Study of wh-family interrogative sentences.

    4. Kind of sentence. Whether the sentence is affirmative, negative, interrogative

    and negative interrogative.

    Systematic study of these patterns, and their components helps in estimating

    the adaptation costs between them.

    2.3 Study of Adaptation Procedure for Morpho-

    logical Variation of Active Verbs

    Hindi verb morphological variations depend on four aspects: gender, number and

    person of subject and tense (and form) of the sentence. All these variations effect

    6-ing verb form other than the main verb7-ed or -en verb forms other than the main verb

    36

  • Adaptation in English to Hindi Translation: A Systematic Approach

    the adaptation procedure. In Hindi, these conjugations are realized by using suffixes

    attached to the root verbs, and/or by adding some auxiliary verbs (see Table A.3 of

    Appendix A). Since there are 12 different structures (depending upon the tense and

    form), the adaptation scheme should have the capabilities to adapt any one of them

    for any of the input type. Hence altogether 1212, i.e. 144 different combinations are

    possible. However, Table A.3 (Appendix A) shows that in Hindi, perfect cont