10.1.1.103.4227

CONTRIBUTIONS TO ENGLISH TO HINDI

MACHINE TRANSLATION USING

EXAMPLE-BASED APPROACH

DEEPA GUPTA

DEPARTMENT OF MATHEMATICS

INDIAN INSTITUTE OF TECHNOLOGY DELHI

HAUZ KHAS, NEW DELHI-110016, INDIA

JANUARY, 2005

CONTRIBUTIONS TO ENGLISH TO HINDI

MACHINE TRANSLATION USING

EXAMPLE-BASED APPROACH

by

DEEPA GUPTA

Department of Mathematics

Submitted

in fulfilment of the requirement of

the degree of

Doctor of Philosophy

to the

Indian Institute of Technology Delhi

Hauz Khas, New Delhi-110016, India

January, 2005

Dedicated to

My Parents,

My Brother Ashish and

My Thesis Supervisor...

Certificate

This is to certify that the thesis entitled Contributions to English to Hindi

Machine Translation Using Example-Based Approach submitted by Ms.

Deepa Gupta to the Department of Mathematics, Indian Institute of Technology

Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona fide

research work carried out by her under my guidance and supervision.

The thesis has reached the standards fulfilling the requirements of the regulations

relating to the degree. The work contained in this thesis has not been submitted to

any other university or institute for the award of any degree or diploma.

Dr. Niladri Chatterjee

Assistant Professor

Department of Mathematics

Indian Institute of Technology Delhi

Delhi (INDIA)

Acknowledgement

If I say that this is my thesis it would be totally untrue. It is like a dream come true.

There are people in this world, some of them so wonderful, who helped in making

this dream, a product that you are holding in your hand. I would like to thank all

of them, and in particular:

Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research

and stayed with me right till the end. His efforts, comments, advices and ideas

developed my thinking, and improved my way of presentation. Without his con-

stant encouragement, keen interest, inspiring criticism and invaluable guidance, I

would not have accomplished my work. I admit that his efforts need much more

acknowledgement than expressed here.

I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech

Research Lab who funded this research. I sincerely thank all the faculty members of

Department of Mathematics, especially, I express my gratitude for Prof B. Chandra

and Dr. R. K. Sharma, for providing me continuous moral support and help. I

thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time

and efforts. I also thank the department administrative staff for their assistance. I

extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and

Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,

and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening

discussions on basics of languages.

I would like to express my sincere thanks to my friends Priya and Dharmendra

for many fruitful discussions regarding my research problem. I thank Mr. Gaurav

Kashyap for helping me in the implementation of the algorithms. In particular, I

would like to thank Inderdeep Singh, for his help in writing some part of the thesis.

I want to give special thanks to my friends, Sonia, Pranita and Nutan, for helping

me in both good and bad times. I would like to thank Prabhakhar for his brotherly

support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for

cheering me, always.

Shailly and Geeta - amazing friends who read the manuscript and gave honest com-

ments. Both of them also stayed with me in the process, and handled me, and

sometimes my out-of-control emotions so well. Especially, I wish to extend my

thanks to Geeta for providing me stay in her hostel room, and also for her wonderful

help when my leg got fractured when we knew each other for a month only. I wish

to acknowledge Krishna for his constant help, both academic and nonacademic, and

his continuous encouragement.

I convey my sincere regards to my parents, and brothers for the sacrifices they have

made, for the patience they have shown, and for the love and blessing they have

showered. I thank Arun for his moral support. Most imperative of all, I would like

to express my profound sense of gratitude and appreciation to my sister Neetu. Her

irrational and unbreakable belief in me bordered on craziness at times.

I cannot avoid to mention my friend Sharad who deserves more than a little ac-

knowledgement. His constant inspiration and untiring support has sustained my

confidence throughout this work.

Finally, I thank GOD for every thing.

Deepa Gupta

Abstract

This research focuses on development of Example Based Machine Translation (EBMT)

system for English to Hindi. Development of a machine translation (MT) system

typically demands a large volume of computational resources. For example, rule-

based MT systems require extraction of syntactic and semantic knowledge in the

form of rules, statistics-based MT systems require huge parallel corpus containing

sentences in the source languages and their translations in target language. Require-

ment of such computational resources is much less in respect of EMBT. This makes

development of EBMT systems for English to Hindi translation feasible, where avail-

ability of large-scale computational resources is still scarce. The primary motivation

for this work comes because of the following:

a) Although a small number of English to Hindi MT systems are already available,

the outputs produced by them are not of high quality all the time. Through

this work we intend to analyze the difficulties that lead to this below par

performance, and try to provide some solutions for them.

b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in

the Indian subcontinent. Demand for developing MT systems from English to

these languages is increasing rapidly. But at the same time, development of

computational resources in these languages is still at its infancy. Since many

of these languages are similar to Hindi, syntactically as well as lexicon wise,

the research carried out here should help developing MT systems from English

to these languages as well.

i

The major contributions of this research may be described as follows:

1) Development of a systematic adaptation scheme. We proposed an adaptation

scheme consisting of ten basic operations. These operations work not only at

word level, but at suffix level as well. This makes adaptation less expensive in

many situations.

2) Study of Divergence. We observe that occurrence of divergence causes major

difficulty for any MT systems. In this work we make an in depth study of the

different types of divergence, and categorize them.

3) Development of Retrieval scheme. We propose a novel approach for measuring

similarity between sentences. We suggest that retrieval strategy, with respect

to an EBMT system, will be most efficient if it measures similarity on the basis

of cost of adaptation. In this work we provide a complete framework for an

efficient retrieval scheme on the basis of our studies on divergence and cost

of adaptation.

4) Dealing with Complex sentences. Handling complex sentences by an MT sys-

tem is generally considered to be difficult. In this work we propose a split

and translate technique for translating complex sentences under an EBMT

framework.

We feel that the overall scheme proposed in this research will pave the way for

developing an efficient EBMT system for translating from English to Hindi. We

hope that this research will also help development of MT systems from English to

other languages of the Indian subcontinent.

ii

Contents

1 Introduction 1

1.1 Description of the Work Done and Summary of the Chapters . . . . . 6

1.2 Some Critical Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Adaptation in English to Hindi Translation: A Systematic Ap-

proach 23

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Description of the Adaptation Operations . . . . . . . . . . . . . . . 29

2.3 Study of Adaptation Procedure for Morphological Variation of Active

Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.1 Same Tense Same Verb Form . . . . . . . . . . . . . . . . . . 38

2.3.2 Different Tenses Same Verb Form . . . . . . . . . . . . . . . . 42

2.3.3 Same Tense Different Verb Forms . . . . . . . . . . . . . . . . 46

2.3.4 Different Tenses Different Verb Forms . . . . . . . . . . . . . . 48

2.4 Adaptation Procedure for Morphological Variation of Passive Verbs . 51

2.5 Study of Adaptation Procedures for Subject/ Object Functional Slot 56

2.5.1 Adaptation Rules for Variations in the Morpho Tags of @DN> 59

Contents

2.5.2 Adaptation Rules for Variations in the Morpho Tags of @GN> 60

2.5.3 Adaptation Rules for Variations in the Morpho Tags of @QN . 64

2.5.4 Adaptation Rules for Variations in the Morpho Tags of Pre-

modifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64

2.5.5 Adaptation Rules for Variations in the Morpho Tags of @SUB 69

2.6 Adaptation of Interrogative Words . . . . . . . . . . . . . . . . . . . 73

2.7 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 83

2.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3 An FT and SPAC Based Divergence Identification Technique From

Example Base 87

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.2 Divergence and Its Identification: Some Relevant Past Work . . . . . 89

3.3 Divergences and Their Identification in English to Hindi Translation . 96

3.3.1 Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97

3.3.2 Categorial Divergence . . . . . . . . . . . . . . . . . . . . . . 100

3.3.3 Nominal Divergence . . . . . . . . . . . . . . . . . . . . . . . 104

3.3.4 Pronominal Divergence . . . . . . . . . . . . . . . . . . . . . . 107

3.3.5 Demotional Divergence . . . . . . . . . . . . . . . . . . . . . . 111

3.3.6 Conflational Divergence . . . . . . . . . . . . . . . . . . . . . 117

3.3.7 Possessional Divergence . . . . . . . . . . . . . . . . . . . . . 121

3.3.8 Some Critical Comments . . . . . . . . . . . . . . . . . . . . . 131

iv

Contents


4 A Corpus-Evidence Based Approach for Prior Determination of

Divergence 135

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.2 Corpus-Based Evidences and Their Use in Divergence Identification . 136

4.2.1 Roles of Different Functional Tags . . . . . . . . . . . . . . . . 138

4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.4 Illustrations and Experimental Results . . . . . . . . . . . . . . . . . 155

4.4.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

4.4.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.4.3 Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 166


5 A Cost of Adaptation Based Scheme for Efficient Retrieval of Trans-

lation Examples 171

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5.2 Brief Review of Related Past Work . . . . . . . . . . . . . . . . . . . 171

5.3 Evaluation of Cost of Adaptation . . . . . . . . . . . . . . . . . . . . 178

5.3.1 Cost of Different Adaptation Operations . . . . . . . . . . . . 182

5.4 Cost Due to Different Functional Slots and Kind of Sentences . . . . 185

v

Contents

5.4.1 Costs Due to Variation in Kind of Sentences . . . . . . . . . . 186

5.4.2 Cost Due to Active Verb Morphological Variation . . . . . . . 187

5.4.3 Cost Due to Subject/Object Functional Slot . . . . . . . . . . 192

5.4.4 Use of Adaptation Cost as a Measure of Similarity . . . . . . . 197

5.5 The Proposed Approach vis-a`-vis Some Similarity Measurement Schemes

198

5.5.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 198

5.5.2 Syntactic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 201

5.5.3 A Proposed Approach: Cost of Adaptation Based Similarity . 203

5.5.4 Drawbacks of the Proposed Scheme . . . . . . . . . . . . . . . 211

5.6 Two-level Filtration Scheme . . . . . . . . . . . . . . . . . . . . . . . 213

5.6.1 Measurement of Structural Similarity . . . . . . . . . . . . . . 214

5.6.2 Measurement of Characteristic Feature Dissimilarity . . . . . . 217

5.7 Complexity Analysis of the Proposed Scheme . . . . . . . . . . . . . 222

5.8 Difficulties in Handling Complex Sentences . . . . . . . . . . . . . . . 226

5.9 Splitting Rules for Converting Complex Sentence into Simple Sentences229

5.9.1 Splitting Rule for the Connectives when, where, when-

ever and wherever . . . . . . . . . . . . . . . . . . . . . . . 231

5.9.2 Splitting Rule for the Connective who . . . . . . . . . . . . 241

5.10 Adaptation Procedure for Complex Sentence . . . . . . . . . . . . . . 253

5.10.1 Adaptation Procedure for Connectives when, where, when-

ever and wherever . . . . . . . . . . . . . . . . . . . . . . . 254

vi

Contents

5.10.2 Adaptation Procedure for Connective who . . . . . . . . . . 256

5.11 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

5.11.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

5.11.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262


6 Discussions and Conclusions 267

6.1 Goals and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

6.2 Contributions Made by This Research . . . . . . . . . . . . . . . . . . 268

6.3 Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

6.4 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

6.4.1 Pre-editing and Post-editing . . . . . . . . . . . . . . . . . . . 274

6.4.2 Evaluation Measures of Machine Translation . . . . . . . . . . 276

Appendices 280

A 281

A.1 English and Hindi Language Variations . . . . . . . . . . . . . . . . . 281

A.2 Verb Morphological and Structure Variations . . . . . . . . . . . . . . 285

A.2.1 Conjugation of Root Verb . . . . . . . . . . . . . . . . . . . . 286

B 291

B.1 Functional Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

B.2 Morpho Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

vii

Contents

C 299

C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures299

D 303

D.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

E 305

E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305

Bibliography 308

viii

List of Figures

1.1 An Example Sentence with Its Morpho-Functional Tags . . . . . . . . 20

2.1 The five possible scenarios in the SL SL TL interface of partial

case matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Example of Different Adaptation Operations . . . . . . . . . . . . . . 34

2.3 Some Typical Sentence Structures . . . . . . . . . . . . . . . . . . . . 83

3.1 Algorithm for Identification of Structural Divergence . . . . . . . . . 99

3.2 Correspondence of SPACs of E and H for Identification of Structural

Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.3 Algorithim for Identification of Categorial Divergence . . . . . . . . . 103

3.4 Correspondence of SPACs for the Categorial Divergence Example of

Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.5 Algorithim for Identification of Nominal Divergence . . . . . . . . . . 106

3.6 Correspondence of SPAC E and SPAC H of Nominal Divergence of

Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.7 Algorithim for Identification of Pronominal Divergence . . . . . . . . 110

LIST OF FIGURES

3.8 Correspondence of SPAC E and SPAC H of Pronominal Divergence

of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.9 Algorithm for Identification of Demotional Divergence . . . . . . . . . 114

3.10 Correspondence of SPAC E and SPAC H for Demotional Sub-type 4 115

3.11 SPAC Correspondence for Demotional Divergence of Sub-type 1 . . . 116

3.12 Algorithm for Identification of Conflational Divergence . . . . . . . . 120

3.13 Correspondence of SPAC E and SPAC H for Conflational Divergence

of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.14 Algorithm for Identification of Possessional Divergence . . . . . . . . 129

3.15 Correspondence of SPAC E and SPAC H for Possessional Divergence

of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

3.16 Correspondence of SPAC E and SPAC H for Possessional Divergence

of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.1 Schematic Diagram of the Proposed Algorithm . . . . . . . . . . . . . 153

4.2 Continuation of the Figure 4.1 . . . . . . . . . . . . . . . . . . . . . . 154

5.1 Schematic View of Module 1 for Identification of Complex Sentence

with Connective any of when, where, whenever, or wherever . 232

5.2 Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 237


5.4 Schematic View of Module 1 for Identification of Complex Sentence

with Connective who . . . . . . . . . . . . . . . . . . . . . . . . . . 244

x

LIST OF FIGURES

5.5 Schematic View of the SUBROUTINE SPLIT . . . . . . . . . . . . . 246




xi

List of Tables

1.1 Output of AnglaHindi and Shakti MT System . . . . . . . . . . 5

2.2 Notations Used in Sentence Patterns . . . . . . . . . . . . . . . . . . 35

2.3 Adaptation Operations of Verb Morphological Variations in Present

Indefinite to Present Indefinite . . . . . . . . . . . . . . . . . . . . . . 39

2.4 Adaptation Operations of Verb Morphological Variations in Present

Indefinite to Past Indefinite . . . . . . . . . . . . . . . . . . . . . . . 44

2.5 Different Functional Tags Under the Functional Slot or . . 56

2.6 Different Possible Morpho Tags for Each of the Functional Tag under

the Functional Slot or . . . . . . . . . . . . . . . . . . . . 58

2.8 Adaptation Operations for Genitive Case to Genitive Case . . . . . . 62

2.10 Adaptation Operations for Pre-modifier Adjective to Pre-modifier Ad-

jective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.11 Adaptation Operations for Subject to Subject Variations . . . . . . . 71

2.12 Different Sentence Patterns of Interrogative Words . . . . . . . . . . . 77

LIST OF TABLES

2.13 Functional & Morpho Tags Corresponding to Each Interrogative Sen-

tence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

2.14 Adaptability Rules for Group G5 Sentence Patterns . . . . . . . . . . 83

2.15 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 84

3.1 Different Semantic Similarity Score between shock with trouble

and panic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.1 FT-features Instrumental for Creating Divergence . . . . . . . . . . . 138

4.2 Relevance of FT-features in Different Divergence Types . . . . . . . . 139

4.3 FT of the Problematic Words for Each Divergence Type . . . . . . . 142

4.4 Frequency of Words in Different Sections . . . . . . . . . . . . . . . . 144

4.5 PSD/NSD Schematic Representations . . . . . . . . . . . . . . . . . . 145

4.6 Values of s(di) and m(di) for Illustration 3 . . . . . . . . . . . . . . . 160

4.7 Some Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.8 Continuation of Table 4.7 . . . . . . . . . . . . . . . . . . . . . . . . 165

4.9 Results of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . 166

5.1 Cost Due to Variation in Kind of Sentences . . . . . . . . . . . . . . . 187

5.2 Cost Due to Verb Morphological Variation Present Indefinite to Present

Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.3 Adaptation Operations of Verb Morphological Variation Present In-

definite to Past indefinite . . . . . . . . . . . . . . . . . . . . . . . . . 192

5.4 Costs Due to Adapting Genitive Case to Genitive Case . . . . . . . . 195

xiv

LIST OF TABLES

5.5 Cost of Adaptation Due to Subject/Object to Subject/Object . . . . 197

5.6 Best Five Matches by Using Semantic Similarity for the Input Sen-

tence I work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

5.7 Best Five Matches by Using Semantic Similarity for the Input Sen-

tence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . 201

5.8 Weighting Scheme for Different POS and Syntactic Role . . . . . . . 202

5.9 Best Five Matches by Syntactic Similarity for the Input Sentence I

work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

5.10 Best Five Matches by Syntactic Similarity for the Input Sentence Sita

sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

5.11 Functional-morpho Tags for the Input English Sentence (IE) and the

Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204

5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the

Input Sentence I work. . . . . . . . . . . . . . . . . . . . . . . . . . 207

5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the

Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . 207

5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input

Sentence I work. by Using Semantic and Syntactic Based Similarity

Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input

Sentence Sita sings ghazals by Using Semantic and Syntactic based

Similarity Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

5.16 Weights Used for Characteristic Features . . . . . . . . . . . . . . . . 220

xv

LIST OF TABLES

5.17 Notation Used in the Complexity Analysis . . . . . . . . . . . . . . . 222

5.19 Typical Examples of Complex Sentence with Connective when, where,

whenever or wherever Handled by Module 2 . . . . . . . . . . . . 235

5.20 Typical Examples of Complex Sentence with Connective when, where,

whenever or wherever Handled by Module 3 . . . . . . . . . . . . 239

5.21 Typical Complex Sentences with Relative Adverb who Handled by

Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242


Module 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242


Module 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

5.24 Hindi Translation of Relative Adverbs . . . . . . . . . . . . . . . . . . 254

5.25 Patterns of Complex Sentence with Connective when, where,

whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . . 255

5.26 Patterns of Complex Sentence with Connective who . . . . . . . . . 257

5.27 Five Most Similar Sentence for RC You go to India. Using Cost of

Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . . . . . 261

5.28 Five Most Similar Sentence for MC You should speak Hindi. Using

Cost of Adaptation based Scheme . . . . . . . . . . . . . . . . . . . . 261

5.29 Five Most Similar Sentence for RC He wants to learn Hindi. Using

Cost of Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . 263

5.30 Five Most Similar Sentence for MC The student should study this

book. Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263

xvi

LIST OF TABLES

A.2 Different Case Ending in Hindi . . . . . . . . . . . . . . . . . . . . . 283

A.3 Suffixes and Morpho-Words for Hindi Verb Conjugations . . . . . . . 286

A.4 Verb Morphological Changes From English to Hindi Translation . . . 288

E.1 Costs Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective307

xvii

Chapter 1

Introduction

Chapter 1. Introduction

Machine Translation (MT) is the process of translating text units of one language

(source language) into a second language (target language) by using computers. The

need for MT is greatly felt in the modern age due to globalization of information,

where global information base needs to be accessed from different parts of the world.

Although most of this information is available online, the major difficulty in dealing

with this information is that its language is primarily English. Starting from science,

technology, education to manuals of gadgets, commercial advertisements, everywhere

predominant presence of English as the medium of communication can be easily

observed. This world, however, is multi-lingual, where different languages are spoken

in different regions. This necessitates the development of good MT systems for

translating these works into other languages so that a larger population can access,

retrieve and understand them. Consequently, in a country like India, where English

is understood by less than 3% of the population (Sinha and Jain, 2003), the need

for developing MT systems for translating from English into some native Indian

languages is very acute. In this work we looked into different aspects of designing an

English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two

fundamental questions that we feel we should answer at this point are:

The rationale behind choosing Example-Based Machine Translation (EBMT)

as the paradigm of interest;

The reason behind selecting Hindi as the preferred language.

Below we provide justifications behind these choices.

Development of MT systems has taken a big leap in the last two decades. Typ-

ically, machine translation requires handcrafted and complicated large-scale knowl-

1

edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending

upon how the translation knowledge is acquired and used. For example,

1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis

and representation of the meaning of the source language texts, and the

generation of equivalent target language texts (Grishman and Kosaka, 1992),

(Thurmair, 1990), (Arnold and Sadler, 1990).

2. Statistical- (or Corpus-) Based Machine Translation (SBMT): Statistical trans-

lation models are trained on a sentence-aligned translation corpus, which is

based on n-gram modelling, and probability distribution of the occurrence of

a source-target language pair in a very large corpus. This technique was pro-

posed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.

al., 1993), (Germann, 2001).

However, these techniques have their own drawbacks. The main drawback of

RBMT systems is that sentences in any natural language may assume a large vari-

ety of structures. Also, machine translation often suffers from ambiguities of various

types (Dorr et. al., 1998). As a consequences, translation from one natural lan-

guage into another requires enormous knowledge about the syntax and semantics of

both the source and target languages. Capturing all the knowledge in rule form is

daunting task if not impossible. On the other hand, SBMT techniques depend on

how accurately various probabilities are measured. Realistic measurements of these

probabilities can be made only if a large volume of parallel corpus is made available.

However, availability of such huge data is not easy. Consequently, this scheme is

viable only for small number of language pairs.

2


Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes

use of past translation examples to generate the translation of a given input. An

EBMT system stores in its example base of translation examples between two lan-

guages, the source language (SL) and the target language (TL). These examples are

subsequently used as guidance for future translation tasks. In order to translate a

new input sentence in SL, a1 similar SL sentence is retrieved from the example base,

along with its translation in TL. This example is then adapted suitably to generate a

translation of the given input. It has been found that EBMT has several advantages

in comparison with other MT paradigms (Sumita and Iida, 1991):

1. It can be upgraded easily by adding more examples to the example base;

2. It utilizes translators expertise, and adds a reliability factor to the translation;

3. It can be accelerated easily by indexing and parallel computing;

4. It is robust because of best-match reasoning.

Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered

EBMT to be one major and effective approach among different MT paradigms,

primarily because it exploits the linguistic knowledge stored in an aligned text in a

more efficient way.

We apprehend from the above observation that for development of MT systems

from English to Indian languages, EBMT should be one of the preferred approaches.

This is because a significant volume of parallel corpus is available between English

and different Indian languages in the form of government notices, translation books,

1Sometimes more than one sentence is also retrieved

3

advertisement material etc. Although this data is generally not available in elec-

tronic form yet, converting them into machine readable form is much easier than

formulating explicit translation rules as required by an RBMT system. In fact some

parallel data in electronic form has been made available through some projects (e.g.

EMILLE :http://www.emille.lancs.ac.uk/home.html). Also, there has been some

concerted effort from various government organizations like TDIL2, CIIL Mysore3,

C-DAC Nodia4, (Vikas, 2001) and various institutes, e. g., IIT Bombay5, IIT Kan-

pur6, LTRC (IIIT Hyderabad)7 and develop linguistic resources. At the same time

this data is not large enough to design an English to Hindi SBMT, which typically

requires several hundred thousand of sentences. These resources, we hope, will be

fruitfully utilized for developing different EBMT systems involving Indian languages.

Of the different Indian languages8 Hindi has some major advantages over the oth-

ers as far as working on MT is concerned. Not only is Hindi the national language of

India, it is also the most popular among all Indian languages. With respect to Indian

languages, all the major works that have been reported so far (e.g. ANGLAHINDI

(Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), Ma-

Tra (Human aided MT)9) are primarily concerned English and Hindi as their pre-

ferred languages. In 2003 Hindi has been considered as the surprise language

(Oard, 2003) by DARPA. As a consequence, different universities (e.g. CMU, Johns

Hopkins, USC-ISI) have invested efforts in developing MT systems involving Hindi.

2http://tdil.mit.gov.in/3http://www.ciil.org/4http://www.cdacnoida.com/5http://www.cfilt.iitb.ac.in6http://www.cse.iitk.ac.in/users/isciig/7http://ltrc.iiit.net/8India has 17 official languages, and more than 1000 dialects

(http://azaz.essortment.com/languagesindian rsbo.htm)9http://www.ncst.ernet.in/matra/about.shtml

4


This world-wide popularity of the language makes the study of English to Hindi

machine translation more meaningful in todays context.

One major advantage of having the above-mentioned English to Hindi translation

systems available on-line is that it helped us in working on the systems to examine

the quality of their outputs. In this respect, we find that the outputs given by the

above systems are not always the correct translations of the inputs. The following

Table 1.1 illustrates the above statement with respect to the systems AnglaHindi

and Shakti. In this table we show the translations produced by the above two

systems for different inputs, and also show the correct translations of these sentences.

Input Output of Output of Actual

Sentences AnglaHindi Shakti Translation

Ram married Sita. raam ne siita vi-

vahaa kiyaa

raam ne siitaa vi-

vaaha kiyaa

raam ne siitaa se

vivaaha kiyaa

Fan is on. pankhaa ho par pankhaa la-

gaataar hai

pankhaa chal rahaa

hai

This dish tastes

good.

yaha vyanjan

achchhaa hotaa

hai

yah thalii achc-

chaa swaad letii

hai

iss vyanjan kaa

swaad achchhaa

hai

The soup lacks

salt.

soop namak kam

hotaa hai

shorbaa namak

kamii hai

soop mein namak

kam hai

It is raining. yah varshaa ho

rahii hai

yah varshaa ho

rahii hai

varshaa ho rahii

hai

They have a big

fight.

unke paas eka

badhii ladaae hai

unke badhii

ladaaiyaan hain

unkii ghamasan

ladaii huii

Table 1.1: Output of AnglaHindi and Shakti MT

System

5

1.1. Description of the Work Done and Summary of the Chapters

We have found many such instances where the outputs produced by the systems

may not be considered to be correct Hindi translations of the respective inputs. This

observation prompts us to study different aspects of English to Hindi translations in

order to understand the difficulty in machine translations, particularly with respect

to English to Hindi translation, also, how can these shortcomings be dealt with

under an EBMT framework. The research is concerned with the above studies.

1.1 Description of the Work Done and Summary

of the Chapters

The success of an EBMT system lies on two different modules: (i) Similarity mea-

surement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a

suitable translation example is retrieved from a systems example base. Adapta-

tion is the procedure by which a retrieved translation is modified to generate the

translation of the given input. Various retrieval strategies have been developed (e.g.

(Nagao, 1984), (Sato, 1992), (Collins and Cunningham, 1996)). All these retrieval

strategies aim at retrieving an example from the example base such that the retrieved

example is similar to the input sentence. This is due to the fact that the fundamental

intuition behind EBMT is that translations of similar sentences of the source lan-

guage will be similar in the target languages as well. Thus the concept of retrieval is

intricately related with the concept of similarity measurement between sentences.

But the main difficulty with respect to this assumption is that there is no straight-

forward way to measure similarity between sentences. In different works different

approaches have been defined for measuring similarity between sentences. For exam-

ple, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based

6


metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and

Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid

retrieval scheme (e.g. (Collins, 1998)).

In all these works similarity measurement and adaptation are considered

in isolation. This we feel is the major hindrance with respect to EBMT. In this

work we therefore propose a novel approach for measuring similarity. We intend

to look at similarity from the point of view of adaptation. We suggest that a past

example will be considered as the most similar with respect to an input sentence, if

its adaptation towards generating the desired translation is the simplest. The work

carried out in this research is aimed at achieving this goal. Our studies therefore start

in the following way. We first look at adaptation in detail. An efficient adaptation

scheme is very important for an EBMT system because even a very large example

base cannot, in general, guarantee an exact match for a given input sentence. As

a consequence, the need for an efficient and systematic adaptation scheme arises

for modifying a retrieved example, and thereby generating the required translation.

Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,

1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these

schemes suggest that primarily there are four basic adaptation operations, i.e. word

addition, word deletion, word replacement and copy.

In our approach we started with these basic operations: word addition, word

deletion, word replacement and copy. However, in this respect we notice the follow-

ing:

1. Both English and Hindi relies heavily on suffixes for morphological changes.

There are a number of suffixes for achieving declension of verbs and nouns.

Further, in Hindi there are situations when morphological changes in the ad-

7


jectives is also required depending upon the number and gender of the corre-

sponding noun/pronoun. Since the number of suffixes is limited, we feel that

instead of purely word-based operations if adaptation operations are focused

on the suffixes, then in many situations significant amount of computational

efforts may be saved.

2. A further observation with respect to Hindi is that there are situations when in-

stead of suffixes whole words are used for bringing in morphological variations.

For example, the present continuous form of Hindi verbs is: + + . Here the words rahaa,

rahii or rahe are used to achieve the morphological variation. Which of

these will be used depend upon the number and gender of the subject. Sim-

ilarly, hai, hain and ho are used corresponding to situations when the

subject is singular or plural and person, respectively. We term these words as

morpho-words. Appendix A gives details of different Hindi morpho-words

and their usages.

A major fall out of the above observation is that in some situations, adaptation

may be carried out by dealing with the morpho-words instead of whole words, which

are computationally much less expensive than dealing with constituent words as a

whole. Thus we propose an adaptation scheme consisting of ten operations: addition,

deletion, and replacement of constituent words, addition, deletion, and replacement

of morpho-words, addition, deletion, and replacement of suffixes and copy. Chapter

2 of the thesis discusses these adaptation operations in detail.

One point, however, we notice with respect to the above operations is that the

above-mentioned operations cannot deal with translation divergences in an efficient

way. Divergence occurs when structurally similar sentences of the source language

8


do not translate into sentences that are similar in structures in the target language.

(Dorr, 1993). We therefore felt study of divergence is an important aspect for any

MT system. With respect to an EBMT system the need arises because of the two

reasons:

The past example that is retrieved for carrying out the task of adaptation

has a normal translation, but translation of the input sentence should involve

divergence.

The translation of the retrieved example involves divergence, whereas the input

sentence should have a normal translation.

In this work we made an in-depth study of divergence with respect to English to

Hindi translation. In this regard one may note that divergence is a highly language-

dependent phenomenon. Its nature may change along with the source and target

languages under consideration. Although divergence has been studied extensively

with respect to translation between European languages (e.g. (Dorr et. al., 2002),

(Watanabe et. al., 2000)) very little studies on divergence may be found regarding

translations in Indian languages. The only work that came into our notice is in (Dave

et. al., 2002). In this work the author has followed the classifications given in (Dorr,

1993) and tried to find examples of each of them with respect to English to Hindi

translation. In this regard it may be noted that Dorr has described seven differ-

ent divergence types: structural, categorical, conflational, promotional, demotional,

thematic and lexical, with respect to translations between European languages.

However, we find that all the different divergence types explained in Dorrs work

do not apply with respect to Indian languages. In fact, we found very few (if not

none) examples of thematic and promotional divergence with respect to English

9


to Hindi translation. On the other hand we identified three new types of divergence

that have not so far been cited in any other works on divergence. We named these

divergences as nominal, pronominal and possessional, respectively. We have

further observed that all the different divergence types (barring structural) for

which we found instances in English to Hindi translation may be further divided into

several sub-categories. Chapter 3 explains in detail different divergence types and

their sub-types that we have observed with respect to English to Hindi translation,

and illustrates them with suitable examples. Some of these results have already been

presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).

Presence of divergence examples in the example base makes straightforward ap-

plication of the above-mentioned adaptation scheme difficult. As mentioned earlier,

application of the operations discussed in Chapter 2 will not be able to generate

the correct translation if the input sentence requires normal translation, whereas

the translation of retrieved example involves divergence, or vice versa. To overcome

this difficulty we suggest that the example base may be partitioned into two parts:

one containing examples of normal translation, the other containing the examples

of divergence, so that given an input sentence an EBMT system may retrieve an

example from the appropriate part of the example base. However, implementation

of the above scheme requires design of algorithms for:

1) Partitioning the example base sentences.

2) Designing an efficient retrieval policy.

We attempt to answer the first one by designing algorithms for identification of

translation divergence, i.e. if an English sentence and its Hindi translation are given

as input, these algorithms will detect whether this translation involves any of the said

10


types of divergence. The remaining part of Chapter 3 discusses different algorithms

that we developed for identification of divergence from a given English-Hindi pair

of sentences. The identification algorithms designed by us consider the Functional

tag (FT10) of the constituent words and the Syntactic Phrasal Annotated Chunk

(SPAC11) of the SL and TL sentences. When these two do not match for a source

language sentence and its translation in the TL, a divergence can be identified. With

respect to each divergence categories and their sub-categories we have identified

the appropriate FTs and SPACs whose presence/absence indicate possibilities of

certain divergence. By systematically analyzing the FTs and SPACs of the English

sentence and its Hindi translation the algorithms arrive at a decision on whether

this translation involves any divergence. Thus the algorithm partitions the example

base in two parts: Normal Example Base and Divergence Example Base. Some of

these algorithms have already been presented in (Gupta and Chatterjee, 2003b).

To answer the second question, we feel that given an input sentence if it can be

decided a priori whether its translation will involve divergence then the retrieval can

be made accordingly. To handle the situation when the translation of input sentence

does not involve any divergence, we devise a cost of adaptation based two-level

filtration scheme that enables quick retrieval from normal example base12. Chapter 4

describes our scheme of retrieval from divergence example base in situations involving

divergence. Here our primary attempt is to develop a procedure so that given an

input English sentence it can decide whether its Hindi translation will involve any

type of divergence. Obviously, this decision has to be made before resorting to

the actual translation. Hence we call it prior identification of divergence. The

10Appendix B provides details on the FTs.11SPAC structure is discussed in detail in Appendix C.12This scheme is discussed in Chapter 5.

11


algorithm seeks evidence from the example base and the WordNet. In this work we

have used WordNet 2.013 to measure semantic similarity of the constituent words

of the input sentence, and various words present in the example base sentences to

arrive at a decision in this regard. The scheme works in the following way. We first

identified the roles of different Functional Tags (FT) towards causing divergence.

We observe with respect to different divergence type and sub-types that each FT

may have one of the three following roles;

1) Its presence is mandatory for the corresponding divergence (sub-)type to occur;

2) Its absence if mandatory for the corresponding divergence (sub-)type to occur;

3) Occurrence/non-occurrence of the divergence (sub-)type is not influenced by

the FT under consideration.

This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given

an input sentence the scheme first determines its constituent FTs. We have used

ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding

is then compared to the above-mentioned knowledge base (Table 4.2) to identify

the set (D) of divergence types that may possibly occur in the translation of this

sentence. Further investigation is carried out to discard elements from the set D, so

that the divergence that may actually occur can be pin-pointed. In this respect we

proceed in the following way. Corresponding to each divergence type we identify the

functional tag that is at the root of causing the divergence. We call it the problem-

atic FT corresponding to that particular divergence. Table 4.3 presents our finding

in this regard. Corresponding to each possible divergence (as found in D) the scheme

13http://www.cogsci.princeton.edu/cgi-bin/webwn14http://www.lingsoft.fi/cgi-bin/engcg

12


works as follows. It first retrieves from the input sentence the constituent word cor-

responding to the problematic FT of the divergence type under consideration. Then

the semantic similarity of this word is compared to other words. Proximity in this

semantic distance is then used as a yardstick for similarity measurement. Chapter

4 discusses this scheme in detail.

Finally, in Chapter 5 we look at how cost of adaptation may be used as a similar-

ity measurement scheme. It has been stated that no unique definition of similarity

exists for comparing sentences. Similarity between sentences may be viewed from

different perspectives. In this work, we have first considered two most general sim-

ilarity schemes: syntactic similarity and semantic similarity. The ideas have

been borrowed from the domain of Information Technology (Manning and Schutze,

1999). According to the definition given therein semantic similarity is measured on

the basis of commonality of words. The more is the number of words common be-

tween two sentences, the more similarity is said to exist between the two sentences

under consideration. However, it has been shown in (Chatterjee, 2001) that this

measurement of similarity is not always helpful from EBMT point of view. For ex-

ample, it has been shown there that although the sentences The horse had a good

run. and The horse is good to run on. have most of the key words common, the

structure of their Hindi translations are very different. Consequently, adaptation of

the translation of one of them to generate the translation of the other is computa-

tionally demanding. On the other hand, syntactic similarity between two sentences

is measured on the basis of commonality of morpho-functional tags between them.

In this case, adaptation may require a large number of constituent word replacement

(WR) operations. Each of these WR operations involves reference to some dictio-

nary for picking up the appropriate words in the target language. Typically the

13


dictionary access will involve accessing an external storage, and thereby will incur

significant computational cost. Thus a purely syntax-based similarity measurement

scheme may not be suitable for an EBMT system.

In this work we therefore propose that from EBMT perspective retrieval and

adaptation should be looked at in a unified way. In this chapter (i.e. Chapter

5) we investigate feasibility of the above proposal in depth. In this respect we first

look into the overall adaptation operations deeply. We have already observed that

these operations are invoked successively to remove the discrepancies between the

input sentence and the retrieved example. These discrepancies, as we observe, may

be in the actual words, or in the overall structure of the sentences. For illustration,

suppose the input sentence is The boy eats rice everyday., whose Hindi translation

ladkaa har roz chaawla khaataa hai has to be generated. The nature of the adap-

tation varies depending upon which example is retrieved from the example base. For

illustration:

a) If the retrieved example is The boy eats rice, the adaptation procedure needs

to apply a constituent word addition operation (WA) to take care of the adverb

everyday.

b) However, if the retrieved sentence is The boy plays cricket everyday. ladkaa

roz cricket kheltaa hai, then the adaptation procedure needs to invoke two

constituent word replacement (WR) operations : to replace Hindi of play,

i.e. khel with the Hindi of eat, i.e. khaa, and cricket (cricket) with

chaawal (rice).

c) In case the retrieved example is The boy is eating rice., one adaptation op-

eration that is constituent word addition (WA) is required for the adverb

14


everyday. Further to take care of verb conjugation some morpho-word and

suffix operations need to be carried out. This is because the Hindi transla-

tion of The boy is eating rice is : ladkaa (boy) chaawal (rice) khaa (eat)

rahaa (..ing) hai (is). But the translation of the input sentence The boy

eats rice everyday should be ladkaa har roz chaawal khaataa hai. Thus the

morpho-word rahaa, which is required for the present continuous tense of

the retrieved sentence needs to be deleted. Further the suffix taa is to be

added to the root main verb to get the required present indefinite verb form

of the input.

d) However, if the retrieved example is Does the boy eat rice?, then adaptation

procedure needs to take care of the structural variation between the inter-

rogative form of the retrieved example, and the affirmative form of the input

sentence.

Obviously, the more will be the discrepancy between the retrieved example and

the input sentence, the more will be the number of adaptation operations towards

generating the desired translation. The above illustrations make certain points evi-

dent:

a) Adaptation operations are required for performing two general tasks: dealing

with constituent words (along with their suffixes, morpho-words), and dealing

with the overall structure of the sentence.

b) Each invocation of adaptation operation pertains to a particular part of speech,

such as, noun, verb, adverb etc.

c) Of the ten adaptation operations (described earlier with respect to Chapter

15


2) only the WA and WR operations require dictionary15 searches. Since dic-

tionary search typically involves accessing an external device (e.g hard disk),

a dictionary search is computationally more expensive than other operations

(e.g. constituent word deletion, morpho-word operations) which are purely

RAM16-based and hence computationally cheaper.

The above observations help us to proceed towards achieving the intended goal

of using cost of adaptation as a measurement of similarity. As a first step towards

achieving the intended goal, we suggest to divide the dictionary into several parts

based on the part-of-speech (POS) of the words. Division of the dictionary into

several parts according to the POS reduces the search time for each invocation of

the above procedures, and as a consequence, the search time is reduced. The cost of

adaptation based similarity measurement approach then proceeds along the following

line:

a) We first estimate the average cost for each of the ten adaptation operations.

We observe that these costs depend on two major types of parameters. On

one hand they depend on certain linguistic aspects, such as, the average length

of the sentences in both source and target languages, the number of suffixes

(used with different POS), the number of morpho-words etc. On the other

hand, these costs are related to the machine on which the EBMT system is

working. Since we aim at analyzing the costs in a general way, we assumed

these machine-dependent costs to be variables in all our analysis. For the lin-

guistic parameters, we used values that we have obtained by analyzing about

15By dictionary we mean a source language to target language word dictionary available on-line.

16Random Access Memory

16


30,000 examples of English to Hindi translations. These examples were col-

lected from various sources that are translation books, advertisement materi-

als, childrens story books and government notices, which are freely available

in non-electronic form.

b) At the second step, we estimated the costs incurred in adapting various func-

tional tags17. In particular, we have considered cost of adaptation due to vari-

ations in active and passive verb morphology, subject/object, pre-modifying

adjective, genitive case and wh-family words. These costs are stored in various

tables, in Section 5.4.

c) At the third step we have considered costs of adaptation due to differences in

sentence structure. Here, we have considered four different sentence structures:

affirmative, negative, interrogative, negative-interrogative. These adaptation

costs too are stored in tabular form. Section 5.4 gives details of this analysis.

Once these basic costs are modelled, we are in a position to experiment on costs

of adaptation as a similarity measure vis-a`-vis semantics and syntax based similarity

measurement scheme discussed above. Our experiments have clearly established the

efficiency of the proposed scheme over the others. Part of this work is also presented

in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:

1) It may end up in comparing a given input with all the example-base sentences

to ascertain the least cost of adaptation.

2) Another major question that may arise is that whether the cost of adaptation

scheme is efficient enough to handle sentences that are structurally more com-

17In fact we worked on Functional Slots which are more general than Functional Tags. Thisis discussed in detail in Section 2.2

17


plicated, e.g. complex or compound sentences. It is a generally accepted fact

that complex sentences are difficult to handle in an MT system (Dorr et. al.,

1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003).

In order to deal with first difficulty we have proposed a two-level filtration scheme.

This scheme helps in selecting a smaller number of examples from the example base,

which may subsequently be subjected to the rigorous treatment for determining their

costs of adaptation with respect to the given input. We have also justified that this

scheme does not leave out the sentences whose translations are easier to adapt for

the given input.

In this work we have given a solution for the second problem too. We have

given rules for splitting a complex sentence into more than one simple sentence.

Translations of these simple sentences may then be generated by the EBMT system.

These individual translations may then be combined to obtain the translation of the

given complex sentence input. If the cost of adaptation based similarity measurement

scheme is applied for translating the simple sentences, then the cost of adaptation

of the complex sentence too can be estimated, by adding the individual costs with

the cost of combining the individual translations. Since the last operation is purely

algorithmic its computational complexity can be easily computed, and hence the

overall cost of adaptation be estimated. With respect to dealing with complex

sentences, we have however used certain restrictions. We considered sentences with

only one subordinate clause. Further, the presence of a connecting word is also

mandatory. Evidently, more complicated complex sentence structures are available,

and further investigations are required for developing techniques for handling them

in an EBMT framework.

In this connection we like to mention that we have explained the cost of adap-

18


tation with respect to a selected set of sentence structures, and for a selected set of

Functional slots. Definitely many more variations are available with respect to these

parameters. Consequently, more work has to be done to form rules for handling

these variations. However, we feel that the work described in research provides the

suitable guideline for further continuation of the research.

1.2 Some Critical Points

1) The aim of this research is not to construct an English to Hindi EBMT system.

Rather our intuition is to analyze the requirements that help in building an

effective EBMT system. The motivation behind this research came from two

major observations:

Although some MT system for translation from English to Hindi already

exist, the quality of their translation is often not up to the mark. This

promoted us to look into the process of MT to ascertain the inherent

difficulties.

We have chosen EBMT as our preferred paradigm because of its certain

advantages our other MT paradigms such as RBMT, SBMT. One major

advantage of EBMT is that it requires neither a huge parallel corpus as

required by SBMT, nor it requires framing a large rule base required by

RBMT. Study of EBMT is therefore feasible for us as we did not have

access to such linguistics resources.

2) In order to design our scheme we have studied about 30,000 English to Hindi

translation examples available off-line. Although now large volumes of English

19

1.2. Some Critical Points

English sentence: The horses have been running for one hour.Tagged form: @DN> ART the, @SUBJ N PL horse %ghodaa%,@+FAUXV V PRES have, @-FAUXV V PCP2 be, @-FMAINV V PCP1run %daudaa%, @ADVL PREP for, @QN> NUM CARD one %ek%, @


this research will be helpful for developing MT system not only for Hindi but also for

other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer

from the same drawback - unavailability of linguistics resources. However, demands

for developing MT systems from English to these languages is increasing with time

not only because these are prominent regional languages of India, but also they

are important minority languages in other countries such as U.K. (Somers, 1997).

The studies made in the research should pave the way for developing EBMT system

involving these languages as well.

21

Chapter 2

Adaptation in English to Hindi

Translation: A Systematic

Approach

Adaptation in English to Hindi Translation: A Systematic Approach

2.1 Introduction

The need for an efficient and systematic adaptation scheme arises for modifying a

retrieved example, and thereby generating the required translation. This chapter is

devoted to the study of systematic adaptation approach. Various approaches have

been pursued in dealing with adaptation aspect of an EBMT system. Some of the

major approaches are described below.

1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:

high-level grafting and keyhole surgery. High-level grafting deals with phrases.

Here an entire phrasal segment of the target sentence is replaced with another

phrasal segment from a different example. On the other hand, keyhole surgery

deals with individual words in an existing target segment of an example. Under

this operation words are replaced or morphologically fine-tuned to suit the

current translation task. For instance, suppose the input sentence is The girl

is playing in the park., and in the example base we have the following examples:

(a) The boy is playing.

(b) Rita knows that girl.

(c) It is a big park.

(d) Ram studies in the school.

For the high level grafting the sentences (a) and (d) will be used. Then keyhole

surgery will be applied for putting in the translations of the words park and

girl. These translations will be extracted from (b) and (c).

2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on

three steps: finding the difference, replacing the difference, and smoothing the

23

2.1. Introduction

output. The differing segments of the input sentence and the source template

are identified. Translations of these different segments in the input sentence

are produced by rule-based methods, and these translated segments are fitted

into a translation template. The resulting sentence is then smoothed over by

checking for person and number agreement, and inflection mismatches. For

example, assume the input sentence and selected template are:

SI A very efficient lady doctor is busy.

St A lady doctor is busy.

Tt mahilaa chikitsak vyasta hai

The parsing process however shows that The very efficient lady doctor is a

noun phrase, and so matches it with The lady doctor - ek mahilaa chikit-

sak. The very efficient lady doctor is translated as ek bahut yogya mahilaa

chikitsak, by the rule-based noun phrase translation system. This is inserted

into Tt giving the following: Tt: ek bahut yogya mahilaa chikitsak vyasta hai.

3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here

two different cases are considered: full-case adaptation and partial-case adap-

tation. Full-case adaptation is employed when a problem is fully covered by the

retrieved example. Here desired translation is created by substitution alone.

No addition and deletion are required for adapting TL for generating the trans-

lation of SL. Here TL and SL denote example base target language sentence

and input source language sentence, respectively. In this case five scenarios

are possible: SAME, ADAPT, IGNORE, ADAPTZERO and IGNOREZERO.

Partial-case adaptation is used when a single unifying example does not exist.

Here three more operations are required on the top of the above five. These

three operations are ADD, DELETE and DELETZERO.

24


Figure 2.1: The five possible scenarios in the SL SL TL interface of partialcase matching

Note that there is a subtle difference between ADAPT and ADAPTZERO.

For ADAPT as well as for ADAPTZERO, both SL and SL have same links

but different chunks. If TL has words corresponding to the chunk which is

different in SL and SL, then the words in TL should be modified and this is

the case of ADAPT. One the other hand, if no corresponding chunk is present

in TL then it is the case of ADAPTZERO. Therefore, in that case no work is

needed for adaptation. Similar subtleties may be observed between DELETE

and DELETZERO, and also between IGNORE and IGNOREZERO. Other

operations (such as, SAME, ADD) have obvious interpretations. Figure 2.1

provides the conceptual view of partial case matching.

4. Somers (2001) proposes adaptation from case-based reasoning point of view.

The simplest of the CBR adaptation methods is null adaptation where no

changes are recommended. In a more general situation various substitution

methods (e.g. reinstatiation, parameter adjustment)/transformation methods

(e.g. commonsense transformation and model-guided repair) may be applied.

For example, suppose the input sentence (I) and the retrieved example (R)

25

2.1. Introduction

are:

I That old woman has died.

R That old man has died. wah boodhaa aadmii mar gayaa

To generate the desired translation of the word man aadmii is first re-

placed with the translation of woman aurat in R. This operation is called

reinstantiation. At this stage an intermediate translation wah boodhaa aurat

mar gayaa is obtained. To obtain the final translation wah boodhii aurat

mar gayii, the system must also change the adjective boodhaa to boodhii

and the word gayaa to gayii. This is called parameter adjustment.

5. The adaptation scheme proposed by McTait (2001) works in the following way.

Translation patterns that share lexical items with the input and partially cover

it are retrieved in a pattern matching procedure. From these, the patterns

whose SL side cover the SL input to the greatest extent (longest cover) are

selected. They are termed base patterns, as they provide sentential context in

the translation process. It is intuitive that the greater extent of the cover is

provided by the base patterns, the more is the context, and the lesser is the

ambiguity and complexity in the translation process. If the SL side of the base

pattern does not fully cover the SL input, any unmatched segments are bound

to the variable on the SL side of the base pattern. The translations of the SL

segments bound to the SL variables of the base pattern are retrieved from the

remaining set of translation patterns, as the text fragments and variables on

the TL side of the base pattern from translation strings.

The following is a simple example: given the source language input is I: AIDS

control programme for Ethiopia, suppose the longest covering base pattern is:

D1: AIDS control programme for (....) ke liye AIDS contral smahaaroo (...).

26


To complete the match between I and the source language side of D1, a trans-

lation pattern containing the text fragment Ethiopia is required i.e.

D2: (...) Ethiopia (...) Ethiopia (...).

The TL translation T: ethiopia ke liye AIDS contral smahaaroo is generated

by recombing the text fragments: Ethiopia and ethiopia are aligned in D2

as are the variables in the base pattern D1. Since Ethiopia and ethiopia

are aligned on a 1:1 basis, and so are the variables in the base pattern D1, the

TL text fragment Ethiopia is bound to the variable on the TL side of D1 to

produce T.

6. In HEBMT (Jain, 1995) examples are stored in an abstracted form for deter-

mining the structural similarity between the input sentence and the example

sentences. The target language sentence is generated using the target pat-

tern of the sentence that has lesser distance with the input sentence. The

system substitutes the corresponding translations of syntactic units identified

by a finite state machine in the target pattern. Variation in tense of verb,

and variations due to number, gender etc. are taken care at this stage for

generating the appropriate translation. This system translates from Hindi to

English, therefore, we explain its adaptation process with the example of Hindi

to English translation.

For example, suppose the input sentence is merii somavara ko jaa rahii hai

and its matches with examples sentence R: meraa dosta itavaar ko aayegaa.

Steps (a) to (f) below, show the process of translation.

(a) merii somavara ko jaa rahii hai(input sentence)

27

2.1. Introduction

(b) 123(syntactic grouping)

(c) [Mary] [Monday] [go] (English translation of syntactic groups)

(d) {on} (target pattern of example R)

(e) [Mary] [is going] on [Monday] (Translation after substitution)

(f) Mary is going on Monday (Final translated output)

Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,

1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (Guvenir and Cicekli,

1998). But overall in our view the adaptation procedures employed in different

EBMT systems primarily consist of four operations:

Copy, where the same chunk of the retrieved translation example is used in

the generated translation;

Add, where a new chunk is added in the retrieved translation example;

Delete, when some chunk of the retrieved example is deleted; and

Replace, where some chunk of the retrieved example is replaced with a new

one to meet the requirements of the current input.

The operations prescribed in different systems vary in the chunks they deal with.

Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional

suffix).

1snp : noun, adj+noun, noun+ kaa+noun2npk2: noun+ko3mv: verb-part

28


With respect to English and Hindi, we find that both the languages depend

heavily on suffixes for verb morphology, changing numbers from singular to plu-

ral and vice versa, case endings, etc. Appendix A provides detail descriptions

of various Hindi suffixes. Keeping the above in view we differentiated the adap-

tation operations in two groups: word based and suffix based. The word based

operations are further subdivided into two categories: constituent word based and

morpho-word based. Thus the adaptation scheme proposed here consists of ten op-

erations: Copy (CP), Constituent word deletion (WD), Constituent word addition

(WA), Constituent word replacement(WR), Morpho-word deletion (MD), Morpho-

word addition (MA), Morpho-word replacement(MR), Suffix addition (SA), Suffix

deletion (SD) and Suffix replacement (SR). Section 2.2 illustrates the roles of the

these operations in adapting a retrieved translation example.

The advantage of the above classification of adaptation operations is twofold.

Firstly, it helps in identifying the specific task that has to be carried out in the step-

by-step adaptation for a given input. Secondly, it helps in measuring the average

cost of each of the above operations in a meaningful way, which in turn helps in

estimating the total adaptation cost for a given sentence. This estimate can be used

as a tool for similarity measurement between an input and the stored examples.

These issues are discussed in Chapter 5.

2.2 Description of the Adaptation Operations

The ten adaptation operations mentioned above are described below.

1. Constituent Word Replacement (WR): One may get the translation of the

input sentence by replacing some words in the retrieved translation example.

29

2.2. Description of the Adaptation Operations

Suppose the input sentence is: The squirrel was eating groundnuts., and the

most similar example retrieved by the system (along with its Hindi translation)

is: The elephant was eating fruits. haathii phal khaa rahaa thaa. The

desired translation may be generated by replacing haathii with the Hindi of

squirrel, i.e. gilharii and replacing phal with the Hindi of groundnuts,

i.e. moongphalii. These are examples of the operation of constituent word

replacement.

2. Constituent Word Deletion (WD): In some cases one may have to delete some

words from the translation example to generate the required translation. For

example, suppose the input sentence is: Animals were dying of thirst. If the

retrieved translation example is : Birds and Animals were dying of thirst.

pakshii aur pashu pyaas se mar rahe the, then the desired translation can

be obtained by deleting pakshii aur (i.e the Hindi of birds and) from the

retrieved translation. Thus the adaptation here requires two constituent word

deletions.

3. Constituent Word Addition (WA): This operation is the opposite of constituent

word deletion. Here addition of some additional words in the retrieved trans-

lation example is required for generating the translation. For illustration, one

may consider the example given above with the roles of input and retrieved

sentences being reversed.

4. Morpho-word Replacement (MR): In this case one morpho-word is replaced by

another morpho-word in the retrieved translation example. Consider a case

when the input sentence is: The squirrel was eating groundnuts., and the

retrieved example is: The squirrel is eating groundnuts. gilharii moongfalii

30


khaa rahii hai. In order to take care of the variation in tense the morpho-

word hai is to be replaced with thaa. This is an example of Morpho-word

replacement.

5. Morpho-word Deletion (MD): Here some morpho-word(s) are deleted from the

retrieved translation example. For illustration, if the input sentence is He

eats rice., and the retrieved example is: He is eating rice. wah chaawal

khaa rahaa hai, then to obtain the desired translation4 first the morpho-word

rahaa is to be deleted from the retrieved translation example.

6. Morpho-word Addition (MA): This is the opposite case of morpho-word dele-

tion. Here some morpho-words need to be added in the retrieved example in

order to generate the required translation.

7. Suffix Replacement (SR): Here the suffix attached to some constituent word

of the retrieved sentence is replaced with a different suffix to meet the current

translation requirements. This may happen with respect to noun, adjective

verb, or case ending . For illustration,

(a) To change the number of nouns

Boy (ladkaa) Boys (ladke)

The suffix aa is replaced with e so in order to get its plural form in

Hindi.

(b) Change of Adjectives

Bad boy (buraa ladkaa) Bad girl (burii ladkii)

The suffix aa is replaced with ii to get the adjective burii.

4Of course the final translation will be obtained by adding the the suffix taa with the wordkhaa.

31


(c) Morphological changes in verb

He reads. (wah padtaa hai) She reads. (wah padtii hai)

The suffix taa is replaced with tii to get the verb padtii, which is

required to indicate that the subject is feminine.

(d) Morphological changes due to case ending

boy (ladkaa) from boy (ladke se)

room (kamraa) in room (karmre mein)

The suffix aa is replaced with e to get the nouns ladke and kamre.

8. Suffix Deletion (SD): By this operation the suffix attached to some constituent

word may be removed, and thereby the root word may be obtained. This

operation is illustrated in the following examples:

(a) To change the number of nouns

women (aauraten) woman (aaurat),

The suffix en is deleted from aauraten to get the Hindi translation

of woman.

(b) Morphological changes in verb

He reads. (wah padtaa hai) He is reading. (wah pad rahaa hai)

The suffix taa is deleted from padtaa to get the root form pad of

the English verb read.

(c) Morphological changes due to case ending

in the houses (gharon mein) houses (ghar)

in words (shabdon mein) words (shabd)

The suffix on is deleted from gharon and shabdon to get the Hindi

translation of nouns houses and words, respectively.

32


9. Suffix Addition (SA): Here a suffix is added to some constituent word in the

retrieved example. Note that here the word concerned is in its root form in

the retrieved example. One may consider the examples given above with the

roles of input and retrieved sentences reversed as suitable examples for suffix

addition operation.

10. Copy (CP): When some word (with or without suffix) of the retrieved example

is retained in toto in the required translation then it is called a copy operation.

Figure 2.2 provides an example of adaptation using the above operations. In this

example the input sentence is He plays football daily., and the retrieved translation

example is:

They are playing football. we football khel rahe hain

(They) (football) (play) (...ing) (are)

The translation to be generated is : wah roz football kheltaa hai. When carried

out adaptation using both word and suffix operations the adaptation steps look as

given in Figure 2.2. In this respect one may note that Hindi is a free order language,

and consequently the position of adverb is not fixed. Hence the above input sentence

may have different Hindi translations:

wah roz football kheltaa hai

wah football roz kheltaa hai

roz wah football kheltaa hai

While implementing an EBMT system one has to stick to some specific format.

The adverb will be added according to the format adapted by the system.

33


Input we football khel rahe hain

Operations WR WA CP SA MD MR

Output wah roz football kheltaa hai

Figure 2.2: Example of Different Adaptation Operations

Which adaptation operations will be required to translate a given input sentence

depends upon the translation example retrieved from the example base. A variety

of examples may be adapted to generate the desired translation, but obviously with

varying computational costs. For efficient performance, an EBMT system, therefore,

needs to retrieve an example that can be adapted to the desired translation with

least cost. This brings in the notion of similarity among sentences. The proposed

adaptation procedure has the advantage that it provides a systematic way of evalu-

ating the overall adaptation cost. This estimated cost may then be used as a good

measure of similarity for appropriate retrieval from the example base. How cost of

adaptation may be used as a yardstick to measure similarity between sentences will

be described in Chapter 5.

Here our aim is to count the number of adaptation operations required in adapt-

ing a retrieved example to generate the translation of a given input. Obviously, de-

pending upon the situation one has to apply some adaptation operations for changing

different functional slot5(Singh, 2003), such as, subject(), object (), verb

(). Also certain operations are required for changing the kind of sentence, e.g.

5The following example illustrates the difference between functional slots and functional tags.Consider the sentence The old man is weak.. The subject of this sentence is the noun phrase Theold man. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that theis a determiner, old is adjective, and man is the subject. But, as mentioned above, the entirenoun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is, i.e. subject slot. Note that a particular functional slot may have variable number of words.The sequence of functional slots in a sentence provide the sentence pattern. The difference betweenvarious tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.

34


affirmative to negative, negative to interrogative etc. Table 2.2 contains the nota-

tions for roles of different functional slot and operators, which are required for the

subsequent discussion.

Operators Role of operators

< > For functional slot and part of speech and its transforma-

tion. E.g. , etc.

& Both functional slots or past of speech and its transforma-

tion should be present.

or Either first slot/tag, second slot/tag or both.

{} For non-obligatory functional tag/slot or for optional adap-

tation operation.

[ ] For the property of functional slot/tag.

Functional

Slot

Role of Functional slot

Linking verbs in English are: are , am , was, were, become,

seem etc., and in Hindi are: hai, hain, ho, thaa, the etc.

Auxiliary verb (if any) and main verb of the sentence

Auxiliary verb

Main verb

Subject

Object

First object

Second object

Subjective complement

-ing verb form other than the main verb.

-ed or -en verb forms other than the main verb.

to-infinitive form of verb.

Adverb

Adjective phrase

preposition phrase

preposition

Table 2.2: Notations Used in Sentence Patterns

35

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

The following sections describe how many such operations are required in dif-

ferent cases. In particular we consider the following functional slot and sentence

kinds:

1. Tense and Form of the Verb. Since there are three tenses (viz. Present,

Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect

Continuous), in all one can have 12 different structures of verb and passive

form verb structure also.

2. Subject/Object functional slot. Variations in subject/object functional slot

may happen in many different ways, such as, Proper Noun, Common Noun

(Singular or Plural), Pronoun, PCP1 form6 and PCP2 form7. Study of varia-

tion in pre-modifier adjectives, genitive case, quantifier and determiner tags.

3. Study of wh-family interrogative sentences.

4. Kind of sentence. Whether the sentence is affirmative, negative, interrogative

and negative interrogative.

Systematic study of these patterns, and their components helps in estimating

the adaptation costs between them.

2.3 Study of Adaptation Procedure for Morpho-

logical Variation of Active Verbs

Hindi verb morphological variations depend on four aspects: gender, number and

person of subject and tense (and form) of the sentence. All these variations effect

6-ing verb form other than the main verb7-ed or -en verb forms other than the main verb

36


the adaptation procedure. In Hindi, these conjugations are realized by using suffixes

attached to the root verbs, and/or by adding some auxiliary verbs (see Table A.3 of

Appendix A). Since there are 12 different structures (depending upon the tense and

form), the adaptation scheme should have the capabilities to adapt any one of them

for any of the input type. Hence altogether 1212, i.e. 144 different combinations are

possible. However, Table A.3 (Appendix A) shows that in Hindi, perfect cont

10.1.1.103.4227

Documents

Transcript of 10.1.1.103.4227