10.1.1.103.4227
description
Transcript of 10.1.1.103.4227
-
CONTRIBUTIONS TO ENGLISH TO HINDI
MACHINE TRANSLATION USING
EXAMPLE-BASED APPROACH
DEEPA GUPTA
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY DELHI
HAUZ KHAS, NEW DELHI-110016, INDIA
JANUARY, 2005
-
CONTRIBUTIONS TO ENGLISH TO HINDI
MACHINE TRANSLATION USING
EXAMPLE-BASED APPROACH
by
DEEPA GUPTA
Department of Mathematics
Submitted
in fulfilment of the requirement of
the degree of
Doctor of Philosophy
to the
Indian Institute of Technology Delhi
Hauz Khas, New Delhi-110016, India
January, 2005
-
Dedicated to
My Parents,
My Brother Ashish and
My Thesis Supervisor...
-
Certificate
This is to certify that the thesis entitled Contributions to English to Hindi
Machine Translation Using Example-Based Approach submitted by Ms.
Deepa Gupta to the Department of Mathematics, Indian Institute of Technology
Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona fide
research work carried out by her under my guidance and supervision.
The thesis has reached the standards fulfilling the requirements of the regulations
relating to the degree. The work contained in this thesis has not been submitted to
any other university or institute for the award of any degree or diploma.
Dr. Niladri Chatterjee
Assistant Professor
Department of Mathematics
Indian Institute of Technology Delhi
Delhi (INDIA)
-
Acknowledgement
If I say that this is my thesis it would be totally untrue. It is like a dream come true.
There are people in this world, some of them so wonderful, who helped in making
this dream, a product that you are holding in your hand. I would like to thank all
of them, and in particular:
Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research
and stayed with me right till the end. His efforts, comments, advices and ideas
developed my thinking, and improved my way of presentation. Without his con-
stant encouragement, keen interest, inspiring criticism and invaluable guidance, I
would not have accomplished my work. I admit that his efforts need much more
acknowledgement than expressed here.
I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech
Research Lab who funded this research. I sincerely thank all the faculty members of
Department of Mathematics, especially, I express my gratitude for Prof B. Chandra
and Dr. R. K. Sharma, for providing me continuous moral support and help. I
thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time
and efforts. I also thank the department administrative staff for their assistance. I
extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and
Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,
and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening
discussions on basics of languages.
I would like to express my sincere thanks to my friends Priya and Dharmendra
for many fruitful discussions regarding my research problem. I thank Mr. Gaurav
-
Kashyap for helping me in the implementation of the algorithms. In particular, I
would like to thank Inderdeep Singh, for his help in writing some part of the thesis.
I want to give special thanks to my friends, Sonia, Pranita and Nutan, for helping
me in both good and bad times. I would like to thank Prabhakhar for his brotherly
support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for
cheering me, always.
Shailly and Geeta - amazing friends who read the manuscript and gave honest com-
ments. Both of them also stayed with me in the process, and handled me, and
sometimes my out-of-control emotions so well. Especially, I wish to extend my
thanks to Geeta for providing me stay in her hostel room, and also for her wonderful
help when my leg got fractured when we knew each other for a month only. I wish
to acknowledge Krishna for his constant help, both academic and nonacademic, and
his continuous encouragement.
I convey my sincere regards to my parents, and brothers for the sacrifices they have
made, for the patience they have shown, and for the love and blessing they have
showered. I thank Arun for his moral support. Most imperative of all, I would like
to express my profound sense of gratitude and appreciation to my sister Neetu. Her
irrational and unbreakable belief in me bordered on craziness at times.
I cannot avoid to mention my friend Sharad who deserves more than a little ac-
knowledgement. His constant inspiration and untiring support has sustained my
confidence throughout this work.
Finally, I thank GOD for every thing.
Deepa Gupta
-
Abstract
This research focuses on development of Example Based Machine Translation (EBMT)
system for English to Hindi. Development of a machine translation (MT) system
typically demands a large volume of computational resources. For example, rule-
based MT systems require extraction of syntactic and semantic knowledge in the
form of rules, statistics-based MT systems require huge parallel corpus containing
sentences in the source languages and their translations in target language. Require-
ment of such computational resources is much less in respect of EMBT. This makes
development of EBMT systems for English to Hindi translation feasible, where avail-
ability of large-scale computational resources is still scarce. The primary motivation
for this work comes because of the following:
a) Although a small number of English to Hindi MT systems are already available,
the outputs produced by them are not of high quality all the time. Through
this work we intend to analyze the difficulties that lead to this below par
performance, and try to provide some solutions for them.
b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in
the Indian subcontinent. Demand for developing MT systems from English to
these languages is increasing rapidly. But at the same time, development of
computational resources in these languages is still at its infancy. Since many
of these languages are similar to Hindi, syntactically as well as lexicon wise,
the research carried out here should help developing MT systems from English
to these languages as well.
i
-
The major contributions of this research may be described as follows:
1) Development of a systematic adaptation scheme. We proposed an adaptation
scheme consisting of ten basic operations. These operations work not only at
word level, but at suffix level as well. This makes adaptation less expensive in
many situations.
2) Study of Divergence. We observe that occurrence of divergence causes major
difficulty for any MT systems. In this work we make an in depth study of the
different types of divergence, and categorize them.
3) Development of Retrieval scheme. We propose a novel approach for measuring
similarity between sentences. We suggest that retrieval strategy, with respect
to an EBMT system, will be most efficient if it measures similarity on the basis
of cost of adaptation. In this work we provide a complete framework for an
efficient retrieval scheme on the basis of our studies on divergence and cost
of adaptation.
4) Dealing with Complex sentences. Handling complex sentences by an MT sys-
tem is generally considered to be difficult. In this work we propose a split
and translate technique for translating complex sentences under an EBMT
framework.
We feel that the overall scheme proposed in this research will pave the way for
developing an efficient EBMT system for translating from English to Hindi. We
hope that this research will also help development of MT systems from English to
other languages of the Indian subcontinent.
ii
-
Contents
1 Introduction 1
1.1 Description of the Work Done and Summary of the Chapters . . . . . 6
1.2 Some Critical Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Adaptation in English to Hindi Translation: A Systematic Ap-
proach 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Description of the Adaptation Operations . . . . . . . . . . . . . . . 29
2.3 Study of Adaptation Procedure for Morphological Variation of Active
Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Same Tense Same Verb Form . . . . . . . . . . . . . . . . . . 38
2.3.2 Different Tenses Same Verb Form . . . . . . . . . . . . . . . . 42
2.3.3 Same Tense Different Verb Forms . . . . . . . . . . . . . . . . 46
2.3.4 Different Tenses Different Verb Forms . . . . . . . . . . . . . . 48
2.4 Adaptation Procedure for Morphological Variation of Passive Verbs . 51
2.5 Study of Adaptation Procedures for Subject/ Object Functional Slot 56
2.5.1 Adaptation Rules for Variations in the Morpho Tags of @DN> 59
-
Contents
2.5.2 Adaptation Rules for Variations in the Morpho Tags of @GN> 60
2.5.3 Adaptation Rules for Variations in the Morpho Tags of @QN . 64
2.5.4 Adaptation Rules for Variations in the Morpho Tags of Pre-
modifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64
2.5.5 Adaptation Rules for Variations in the Morpho Tags of @SUB 69
2.6 Adaptation of Interrogative Words . . . . . . . . . . . . . . . . . . . 73
2.7 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 83
2.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3 An FT and SPAC Based Divergence Identification Technique From
Example Base 87
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2 Divergence and Its Identification: Some Relevant Past Work . . . . . 89
3.3 Divergences and Their Identification in English to Hindi Translation . 96
3.3.1 Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.2 Categorial Divergence . . . . . . . . . . . . . . . . . . . . . . 100
3.3.3 Nominal Divergence . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.4 Pronominal Divergence . . . . . . . . . . . . . . . . . . . . . . 107
3.3.5 Demotional Divergence . . . . . . . . . . . . . . . . . . . . . . 111
3.3.6 Conflational Divergence . . . . . . . . . . . . . . . . . . . . . 117
3.3.7 Possessional Divergence . . . . . . . . . . . . . . . . . . . . . 121
3.3.8 Some Critical Comments . . . . . . . . . . . . . . . . . . . . . 131
iv
-
Contents
3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4 A Corpus-Evidence Based Approach for Prior Determination of
Divergence 135
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2 Corpus-Based Evidences and Their Use in Divergence Identification . 136
4.2.1 Roles of Different Functional Tags . . . . . . . . . . . . . . . . 138
4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.4 Illustrations and Experimental Results . . . . . . . . . . . . . . . . . 155
4.4.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.3 Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 166
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5 A Cost of Adaptation Based Scheme for Efficient Retrieval of Trans-
lation Examples 171
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.2 Brief Review of Related Past Work . . . . . . . . . . . . . . . . . . . 171
5.3 Evaluation of Cost of Adaptation . . . . . . . . . . . . . . . . . . . . 178
5.3.1 Cost of Different Adaptation Operations . . . . . . . . . . . . 182
5.4 Cost Due to Different Functional Slots and Kind of Sentences . . . . 185
v
-
Contents
5.4.1 Costs Due to Variation in Kind of Sentences . . . . . . . . . . 186
5.4.2 Cost Due to Active Verb Morphological Variation . . . . . . . 187
5.4.3 Cost Due to Subject/Object Functional Slot . . . . . . . . . . 192
5.4.4 Use of Adaptation Cost as a Measure of Similarity . . . . . . . 197
5.5 The Proposed Approach vis-a`-vis Some Similarity Measurement Schemes
198
5.5.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 198
5.5.2 Syntactic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 201
5.5.3 A Proposed Approach: Cost of Adaptation Based Similarity . 203
5.5.4 Drawbacks of the Proposed Scheme . . . . . . . . . . . . . . . 211
5.6 Two-level Filtration Scheme . . . . . . . . . . . . . . . . . . . . . . . 213
5.6.1 Measurement of Structural Similarity . . . . . . . . . . . . . . 214
5.6.2 Measurement of Characteristic Feature Dissimilarity . . . . . . 217
5.7 Complexity Analysis of the Proposed Scheme . . . . . . . . . . . . . 222
5.8 Difficulties in Handling Complex Sentences . . . . . . . . . . . . . . . 226
5.9 Splitting Rules for Converting Complex Sentence into Simple Sentences229
5.9.1 Splitting Rule for the Connectives when, where, when-
ever and wherever . . . . . . . . . . . . . . . . . . . . . . . 231
5.9.2 Splitting Rule for the Connective who . . . . . . . . . . . . 241
5.10 Adaptation Procedure for Complex Sentence . . . . . . . . . . . . . . 253
5.10.1 Adaptation Procedure for Connectives when, where, when-
ever and wherever . . . . . . . . . . . . . . . . . . . . . . . 254
vi
-
Contents
5.10.2 Adaptation Procedure for Connective who . . . . . . . . . . 256
5.11 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
5.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6 Discussions and Conclusions 267
6.1 Goals and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.2 Contributions Made by This Research . . . . . . . . . . . . . . . . . . 268
6.3 Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.4 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.4.1 Pre-editing and Post-editing . . . . . . . . . . . . . . . . . . . 274
6.4.2 Evaluation Measures of Machine Translation . . . . . . . . . . 276
Appendices 280
A 281
A.1 English and Hindi Language Variations . . . . . . . . . . . . . . . . . 281
A.2 Verb Morphological and Structure Variations . . . . . . . . . . . . . . 285
A.2.1 Conjugation of Root Verb . . . . . . . . . . . . . . . . . . . . 286
B 291
B.1 Functional Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
B.2 Morpho Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
vii
-
Contents
C 299
C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures299
D 303
D.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
E 305
E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305
Bibliography 308
viii
-
List of Figures
1.1 An Example Sentence with Its Morpho-Functional Tags . . . . . . . . 20
2.1 The five possible scenarios in the SL SL TL interface of partial
case matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Example of Different Adaptation Operations . . . . . . . . . . . . . . 34
2.3 Some Typical Sentence Structures . . . . . . . . . . . . . . . . . . . . 83
3.1 Algorithm for Identification of Structural Divergence . . . . . . . . . 99
3.2 Correspondence of SPACs of E and H for Identification of Structural
Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 Algorithim for Identification of Categorial Divergence . . . . . . . . . 103
3.4 Correspondence of SPACs for the Categorial Divergence Example of
Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5 Algorithim for Identification of Nominal Divergence . . . . . . . . . . 106
3.6 Correspondence of SPAC E and SPAC H of Nominal Divergence of
Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.7 Algorithim for Identification of Pronominal Divergence . . . . . . . . 110
-
LIST OF FIGURES
3.8 Correspondence of SPAC E and SPAC H of Pronominal Divergence
of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.9 Algorithm for Identification of Demotional Divergence . . . . . . . . . 114
3.10 Correspondence of SPAC E and SPAC H for Demotional Sub-type 4 115
3.11 SPAC Correspondence for Demotional Divergence of Sub-type 1 . . . 116
3.12 Algorithm for Identification of Conflational Divergence . . . . . . . . 120
3.13 Correspondence of SPAC E and SPAC H for Conflational Divergence
of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.14 Algorithm for Identification of Possessional Divergence . . . . . . . . 129
3.15 Correspondence of SPAC E and SPAC H for Possessional Divergence
of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.16 Correspondence of SPAC E and SPAC H for Possessional Divergence
of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.1 Schematic Diagram of the Proposed Algorithm . . . . . . . . . . . . . 153
4.2 Continuation of the Figure 4.1 . . . . . . . . . . . . . . . . . . . . . . 154
5.1 Schematic View of Module 1 for Identification of Complex Sentence
with Connective any of when, where, whenever, or wherever . 232
5.2 Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 237
5.3 Schematic View of Module 3 . . . . . . . . . . . . . . . . . . . . . . . 240
5.4 Schematic View of Module 1 for Identification of Complex Sentence
with Connective who . . . . . . . . . . . . . . . . . . . . . . . . . . 244
x
-
LIST OF FIGURES
5.5 Schematic View of the SUBROUTINE SPLIT . . . . . . . . . . . . . 246
5.6 Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 247
5.7 Schematic View of Module 3 . . . . . . . . . . . . . . . . . . . . . . . 249
5.8 Schematic View of Module 4 . . . . . . . . . . . . . . . . . . . . . . . 250
xi
-
List of Tables
1.1 Output of AnglaHindi and Shakti MT System . . . . . . . . . . 5
2.2 Notations Used in Sentence Patterns . . . . . . . . . . . . . . . . . . 35
2.3 Adaptation Operations of Verb Morphological Variations in Present
Indefinite to Present Indefinite . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Adaptation Operations of Verb Morphological Variations in Present
Indefinite to Past Indefinite . . . . . . . . . . . . . . . . . . . . . . . 44
2.5 Different Functional Tags Under the Functional Slot or . . 56
2.6 Different Possible Morpho Tags for Each of the Functional Tag under
the Functional Slot or . . . . . . . . . . . . . . . . . . . . 58
2.8 Adaptation Operations for Genitive Case to Genitive Case . . . . . . 62
2.10 Adaptation Operations for Pre-modifier Adjective to Pre-modifier Ad-
jective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.11 Adaptation Operations for Subject to Subject Variations . . . . . . . 71
2.12 Different Sentence Patterns of Interrogative Words . . . . . . . . . . . 77
-
LIST OF TABLES
2.13 Functional & Morpho Tags Corresponding to Each Interrogative Sen-
tence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.14 Adaptability Rules for Group G5 Sentence Patterns . . . . . . . . . . 83
2.15 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 84
3.1 Different Semantic Similarity Score between shock with trouble
and panic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1 FT-features Instrumental for Creating Divergence . . . . . . . . . . . 138
4.2 Relevance of FT-features in Different Divergence Types . . . . . . . . 139
4.3 FT of the Problematic Words for Each Divergence Type . . . . . . . 142
4.4 Frequency of Words in Different Sections . . . . . . . . . . . . . . . . 144
4.5 PSD/NSD Schematic Representations . . . . . . . . . . . . . . . . . . 145
4.6 Values of s(di) and m(di) for Illustration 3 . . . . . . . . . . . . . . . 160
4.7 Some Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.8 Continuation of Table 4.7 . . . . . . . . . . . . . . . . . . . . . . . . 165
4.9 Results of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . 166
5.1 Cost Due to Variation in Kind of Sentences . . . . . . . . . . . . . . . 187
5.2 Cost Due to Verb Morphological Variation Present Indefinite to Present
Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.3 Adaptation Operations of Verb Morphological Variation Present In-
definite to Past indefinite . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.4 Costs Due to Adapting Genitive Case to Genitive Case . . . . . . . . 195
xiv
-
LIST OF TABLES
5.5 Cost of Adaptation Due to Subject/Object to Subject/Object . . . . 197
5.6 Best Five Matches by Using Semantic Similarity for the Input Sen-
tence I work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.7 Best Five Matches by Using Semantic Similarity for the Input Sen-
tence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . 201
5.8 Weighting Scheme for Different POS and Syntactic Role . . . . . . . 202
5.9 Best Five Matches by Syntactic Similarity for the Input Sentence I
work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.10 Best Five Matches by Syntactic Similarity for the Input Sentence Sita
sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.11 Functional-morpho Tags for the Input English Sentence (IE) and the
Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204
5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the
Input Sentence I work. . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the
Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . 207
5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence I work. by Using Semantic and Syntactic Based Similarity
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence Sita sings ghazals by Using Semantic and Syntactic based
Similarity Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.16 Weights Used for Characteristic Features . . . . . . . . . . . . . . . . 220
xv
-
LIST OF TABLES
5.17 Notation Used in the Complexity Analysis . . . . . . . . . . . . . . . 222
5.19 Typical Examples of Complex Sentence with Connective when, where,
whenever or wherever Handled by Module 2 . . . . . . . . . . . . 235
5.20 Typical Examples of Complex Sentence with Connective when, where,
whenever or wherever Handled by Module 3 . . . . . . . . . . . . 239
5.21 Typical Complex Sentences with Relative Adverb who Handled by
Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.22 Typical Complex Sentences with Relative Adverb who Handled by
Module 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.23 Typical Complex Sentences with Relative Adverb who Handled by
Module 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
5.24 Hindi Translation of Relative Adverbs . . . . . . . . . . . . . . . . . . 254
5.25 Patterns of Complex Sentence with Connective when, where,
whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . . 255
5.26 Patterns of Complex Sentence with Connective who . . . . . . . . . 257
5.27 Five Most Similar Sentence for RC You go to India. Using Cost of
Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . . . . . 261
5.28 Five Most Similar Sentence for MC You should speak Hindi. Using
Cost of Adaptation based Scheme . . . . . . . . . . . . . . . . . . . . 261
5.29 Five Most Similar Sentence for RC He wants to learn Hindi. Using
Cost of Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . 263
5.30 Five Most Similar Sentence for MC The student should study this
book. Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263
xvi
-
LIST OF TABLES
A.2 Different Case Ending in Hindi . . . . . . . . . . . . . . . . . . . . . 283
A.3 Suffixes and Morpho-Words for Hindi Verb Conjugations . . . . . . . 286
A.4 Verb Morphological Changes From English to Hindi Translation . . . 288
E.1 Costs Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective307
xvii
-
Chapter 1
Introduction
-
Chapter 1. Introduction
Machine Translation (MT) is the process of translating text units of one language
(source language) into a second language (target language) by using computers. The
need for MT is greatly felt in the modern age due to globalization of information,
where global information base needs to be accessed from different parts of the world.
Although most of this information is available online, the major difficulty in dealing
with this information is that its language is primarily English. Starting from science,
technology, education to manuals of gadgets, commercial advertisements, everywhere
predominant presence of English as the medium of communication can be easily
observed. This world, however, is multi-lingual, where different languages are spoken
in different regions. This necessitates the development of good MT systems for
translating these works into other languages so that a larger population can access,
retrieve and understand them. Consequently, in a country like India, where English
is understood by less than 3% of the population (Sinha and Jain, 2003), the need
for developing MT systems for translating from English into some native Indian
languages is very acute. In this work we looked into different aspects of designing an
English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two
fundamental questions that we feel we should answer at this point are:
The rationale behind choosing Example-Based Machine Translation (EBMT)
as the paradigm of interest;
The reason behind selecting Hindi as the preferred language.
Below we provide justifications behind these choices.
Development of MT systems has taken a big leap in the last two decades. Typ-
ically, machine translation requires handcrafted and complicated large-scale knowl-
1
-
edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending
upon how the translation knowledge is acquired and used. For example,
1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis
and representation of the meaning of the source language texts, and the
generation of equivalent target language texts (Grishman and Kosaka, 1992),
(Thurmair, 1990), (Arnold and Sadler, 1990).
2. Statistical- (or Corpus-) Based Machine Translation (SBMT): Statistical trans-
lation models are trained on a sentence-aligned translation corpus, which is
based on n-gram modelling, and probability distribution of the occurrence of
a source-target language pair in a very large corpus. This technique was pro-
posed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.
al., 1993), (Germann, 2001).
However, these techniques have their own drawbacks. The main drawback of
RBMT systems is that sentences in any natural language may assume a large vari-
ety of structures. Also, machine translation often suffers from ambiguities of various
types (Dorr et. al., 1998). As a consequences, translation from one natural lan-
guage into another requires enormous knowledge about the syntax and semantics of
both the source and target languages. Capturing all the knowledge in rule form is
daunting task if not impossible. On the other hand, SBMT techniques depend on
how accurately various probabilities are measured. Realistic measurements of these
probabilities can be made only if a large volume of parallel corpus is made available.
However, availability of such huge data is not easy. Consequently, this scheme is
viable only for small number of language pairs.
2
-
Chapter 1. Introduction
Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes
use of past translation examples to generate the translation of a given input. An
EBMT system stores in its example base of translation examples between two lan-
guages, the source language (SL) and the target language (TL). These examples are
subsequently used as guidance for future translation tasks. In order to translate a
new input sentence in SL, a1 similar SL sentence is retrieved from the example base,
along with its translation in TL. This example is then adapted suitably to generate a
translation of the given input. It has been found that EBMT has several advantages
in comparison with other MT paradigms (Sumita and Iida, 1991):
1. It can be upgraded easily by adding more examples to the example base;
2. It utilizes translators expertise, and adds a reliability factor to the translation;
3. It can be accelerated easily by indexing and parallel computing;
4. It is robust because of best-match reasoning.
Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered
EBMT to be one major and effective approach among different MT paradigms,
primarily because it exploits the linguistic knowledge stored in an aligned text in a
more efficient way.
We apprehend from the above observation that for development of MT systems
from English to Indian languages, EBMT should be one of the preferred approaches.
This is because a significant volume of parallel corpus is available between English
and different Indian languages in the form of government notices, translation books,
1Sometimes more than one sentence is also retrieved
3
-
advertisement material etc. Although this data is generally not available in elec-
tronic form yet, converting them into machine readable form is much easier than
formulating explicit translation rules as required by an RBMT system. In fact some
parallel data in electronic form has been made available through some projects (e.g.
EMILLE :http://www.emille.lancs.ac.uk/home.html). Also, there has been some
concerted effort from various government organizations like TDIL2, CIIL Mysore3,
C-DAC Nodia4, (Vikas, 2001) and various institutes, e. g., IIT Bombay5, IIT Kan-
pur6, LTRC (IIIT Hyderabad)7 and develop linguistic resources. At the same time
this data is not large enough to design an English to Hindi SBMT, which typically
requires several hundred thousand of sentences. These resources, we hope, will be
fruitfully utilized for developing different EBMT systems involving Indian languages.
Of the different Indian languages8 Hindi has some major advantages over the oth-
ers as far as working on MT is concerned. Not only is Hindi the national language of
India, it is also the most popular among all Indian languages. With respect to Indian
languages, all the major works that have been reported so far (e.g. ANGLAHINDI
(Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), Ma-
Tra (Human aided MT)9) are primarily concerned English and Hindi as their pre-
ferred languages. In 2003 Hindi has been considered as the surprise language
(Oard, 2003) by DARPA. As a consequence, different universities (e.g. CMU, Johns
Hopkins, USC-ISI) have invested efforts in developing MT systems involving Hindi.
2http://tdil.mit.gov.in/3http://www.ciil.org/4http://www.cdacnoida.com/5http://www.cfilt.iitb.ac.in6http://www.cse.iitk.ac.in/users/isciig/7http://ltrc.iiit.net/8India has 17 official languages, and more than 1000 dialects
(http://azaz.essortment.com/languagesindian rsbo.htm)9http://www.ncst.ernet.in/matra/about.shtml
4
-
Chapter 1. Introduction
This world-wide popularity of the language makes the study of English to Hindi
machine translation more meaningful in todays context.
One major advantage of having the above-mentioned English to Hindi translation
systems available on-line is that it helped us in working on the systems to examine
the quality of their outputs. In this respect, we find that the outputs given by the
above systems are not always the correct translations of the inputs. The following
Table 1.1 illustrates the above statement with respect to the systems AnglaHindi
and Shakti. In this table we show the translations produced by the above two
systems for different inputs, and also show the correct translations of these sentences.
Input Output of Output of Actual
Sentences AnglaHindi Shakti Translation
Ram married Sita. raam ne siita vi-
vahaa kiyaa
raam ne siitaa vi-
vaaha kiyaa
raam ne siitaa se
vivaaha kiyaa
Fan is on. pankhaa ho par pankhaa la-
gaataar hai
pankhaa chal rahaa
hai
This dish tastes
good.
yaha vyanjan
achchhaa hotaa
hai
yah thalii achc-
chaa swaad letii
hai
iss vyanjan kaa
swaad achchhaa
hai
The soup lacks
salt.
soop namak kam
hotaa hai
shorbaa namak
kamii hai
soop mein namak
kam hai
It is raining. yah varshaa ho
rahii hai
yah varshaa ho
rahii hai
varshaa ho rahii
hai
They have a big
fight.
unke paas eka
badhii ladaae hai
unke badhii
ladaaiyaan hain
unkii ghamasan
ladaii huii
Table 1.1: Output of AnglaHindi and Shakti MT
System
5
-
1.1. Description of the Work Done and Summary of the Chapters
We have found many such instances where the outputs produced by the systems
may not be considered to be correct Hindi translations of the respective inputs. This
observation prompts us to study different aspects of English to Hindi translations in
order to understand the difficulty in machine translations, particularly with respect
to English to Hindi translation, also, how can these shortcomings be dealt with
under an EBMT framework. The research is concerned with the above studies.
1.1 Description of the Work Done and Summary
of the Chapters
The success of an EBMT system lies on two different modules: (i) Similarity mea-
surement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a
suitable translation example is retrieved from a systems example base. Adapta-
tion is the procedure by which a retrieved translation is modified to generate the
translation of the given input. Various retrieval strategies have been developed (e.g.
(Nagao, 1984), (Sato, 1992), (Collins and Cunningham, 1996)). All these retrieval
strategies aim at retrieving an example from the example base such that the retrieved
example is similar to the input sentence. This is due to the fact that the fundamental
intuition behind EBMT is that translations of similar sentences of the source lan-
guage will be similar in the target languages as well. Thus the concept of retrieval is
intricately related with the concept of similarity measurement between sentences.
But the main difficulty with respect to this assumption is that there is no straight-
forward way to measure similarity between sentences. In different works different
approaches have been defined for measuring similarity between sentences. For exam-
ple, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based
6
-
Chapter 1. Introduction
metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and
Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid
retrieval scheme (e.g. (Collins, 1998)).
In all these works similarity measurement and adaptation are considered
in isolation. This we feel is the major hindrance with respect to EBMT. In this
work we therefore propose a novel approach for measuring similarity. We intend
to look at similarity from the point of view of adaptation. We suggest that a past
example will be considered as the most similar with respect to an input sentence, if
its adaptation towards generating the desired translation is the simplest. The work
carried out in this research is aimed at achieving this goal. Our studies therefore start
in the following way. We first look at adaptation in detail. An efficient adaptation
scheme is very important for an EBMT system because even a very large example
base cannot, in general, guarantee an exact match for a given input sentence. As
a consequence, the need for an efficient and systematic adaptation scheme arises
for modifying a retrieved example, and thereby generating the required translation.
Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,
1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these
schemes suggest that primarily there are four basic adaptation operations, i.e. word
addition, word deletion, word replacement and copy.
In our approach we started with these basic operations: word addition, word
deletion, word replacement and copy. However, in this respect we notice the follow-
ing:
1. Both English and Hindi relies heavily on suffixes for morphological changes.
There are a number of suffixes for achieving declension of verbs and nouns.
Further, in Hindi there are situations when morphological changes in the ad-
7
-
1.1. Description of the Work Done and Summary of the Chapters
jectives is also required depending upon the number and gender of the corre-
sponding noun/pronoun. Since the number of suffixes is limited, we feel that
instead of purely word-based operations if adaptation operations are focused
on the suffixes, then in many situations significant amount of computational
efforts may be saved.
2. A further observation with respect to Hindi is that there are situations when in-
stead of suffixes whole words are used for bringing in morphological variations.
For example, the present continuous form of Hindi verbs is: + + . Here the words rahaa,
rahii or rahe are used to achieve the morphological variation. Which of
these will be used depend upon the number and gender of the subject. Sim-
ilarly, hai, hain and ho are used corresponding to situations when the
subject is singular or plural and person, respectively. We term these words as
morpho-words. Appendix A gives details of different Hindi morpho-words
and their usages.
A major fall out of the above observation is that in some situations, adaptation
may be carried out by dealing with the morpho-words instead of whole words, which
are computationally much less expensive than dealing with constituent words as a
whole. Thus we propose an adaptation scheme consisting of ten operations: addition,
deletion, and replacement of constituent words, addition, deletion, and replacement
of morpho-words, addition, deletion, and replacement of suffixes and copy. Chapter
2 of the thesis discusses these adaptation operations in detail.
One point, however, we notice with respect to the above operations is that the
above-mentioned operations cannot deal with translation divergences in an efficient
way. Divergence occurs when structurally similar sentences of the source language
8
-
Chapter 1. Introduction
do not translate into sentences that are similar in structures in the target language.
(Dorr, 1993). We therefore felt study of divergence is an important aspect for any
MT system. With respect to an EBMT system the need arises because of the two
reasons:
The past example that is retrieved for carrying out the task of adaptation
has a normal translation, but translation of the input sentence should involve
divergence.
The translation of the retrieved example involves divergence, whereas the input
sentence should have a normal translation.
In this work we made an in-depth study of divergence with respect to English to
Hindi translation. In this regard one may note that divergence is a highly language-
dependent phenomenon. Its nature may change along with the source and target
languages under consideration. Although divergence has been studied extensively
with respect to translation between European languages (e.g. (Dorr et. al., 2002),
(Watanabe et. al., 2000)) very little studies on divergence may be found regarding
translations in Indian languages. The only work that came into our notice is in (Dave
et. al., 2002). In this work the author has followed the classifications given in (Dorr,
1993) and tried to find examples of each of them with respect to English to Hindi
translation. In this regard it may be noted that Dorr has described seven differ-
ent divergence types: structural, categorical, conflational, promotional, demotional,
thematic and lexical, with respect to translations between European languages.
However, we find that all the different divergence types explained in Dorrs work
do not apply with respect to Indian languages. In fact, we found very few (if not
none) examples of thematic and promotional divergence with respect to English
9
-
1.1. Description of the Work Done and Summary of the Chapters
to Hindi translation. On the other hand we identified three new types of divergence
that have not so far been cited in any other works on divergence. We named these
divergences as nominal, pronominal and possessional, respectively. We have
further observed that all the different divergence types (barring structural) for
which we found instances in English to Hindi translation may be further divided into
several sub-categories. Chapter 3 explains in detail different divergence types and
their sub-types that we have observed with respect to English to Hindi translation,
and illustrates them with suitable examples. Some of these results have already been
presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).
Presence of divergence examples in the example base makes straightforward ap-
plication of the above-mentioned adaptation scheme difficult. As mentioned earlier,
application of the operations discussed in Chapter 2 will not be able to generate
the correct translation if the input sentence requires normal translation, whereas
the translation of retrieved example involves divergence, or vice versa. To overcome
this difficulty we suggest that the example base may be partitioned into two parts:
one containing examples of normal translation, the other containing the examples
of divergence, so that given an input sentence an EBMT system may retrieve an
example from the appropriate part of the example base. However, implementation
of the above scheme requires design of algorithms for:
1) Partitioning the example base sentences.
2) Designing an efficient retrieval policy.
We attempt to answer the first one by designing algorithms for identification of
translation divergence, i.e. if an English sentence and its Hindi translation are given
as input, these algorithms will detect whether this translation involves any of the said
10
-
Chapter 1. Introduction
types of divergence. The remaining part of Chapter 3 discusses different algorithms
that we developed for identification of divergence from a given English-Hindi pair
of sentences. The identification algorithms designed by us consider the Functional
tag (FT10) of the constituent words and the Syntactic Phrasal Annotated Chunk
(SPAC11) of the SL and TL sentences. When these two do not match for a source
language sentence and its translation in the TL, a divergence can be identified. With
respect to each divergence categories and their sub-categories we have identified
the appropriate FTs and SPACs whose presence/absence indicate possibilities of
certain divergence. By systematically analyzing the FTs and SPACs of the English
sentence and its Hindi translation the algorithms arrive at a decision on whether
this translation involves any divergence. Thus the algorithm partitions the example
base in two parts: Normal Example Base and Divergence Example Base. Some of
these algorithms have already been presented in (Gupta and Chatterjee, 2003b).
To answer the second question, we feel that given an input sentence if it can be
decided a priori whether its translation will involve divergence then the retrieval can
be made accordingly. To handle the situation when the translation of input sentence
does not involve any divergence, we devise a cost of adaptation based two-level
filtration scheme that enables quick retrieval from normal example base12. Chapter 4
describes our scheme of retrieval from divergence example base in situations involving
divergence. Here our primary attempt is to develop a procedure so that given an
input English sentence it can decide whether its Hindi translation will involve any
type of divergence. Obviously, this decision has to be made before resorting to
the actual translation. Hence we call it prior identification of divergence. The
10Appendix B provides details on the FTs.11SPAC structure is discussed in detail in Appendix C.12This scheme is discussed in Chapter 5.
11
-
1.1. Description of the Work Done and Summary of the Chapters
algorithm seeks evidence from the example base and the WordNet. In this work we
have used WordNet 2.013 to measure semantic similarity of the constituent words
of the input sentence, and various words present in the example base sentences to
arrive at a decision in this regard. The scheme works in the following way. We first
identified the roles of different Functional Tags (FT) towards causing divergence.
We observe with respect to different divergence type and sub-types that each FT
may have one of the three following roles;
1) Its presence is mandatory for the corresponding divergence (sub-)type to occur;
2) Its absence if mandatory for the corresponding divergence (sub-)type to occur;
3) Occurrence/non-occurrence of the divergence (sub-)type is not influenced by
the FT under consideration.
This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given
an input sentence the scheme first determines its constituent FTs. We have used
ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding
is then compared to the above-mentioned knowledge base (Table 4.2) to identify
the set (D) of divergence types that may possibly occur in the translation of this
sentence. Further investigation is carried out to discard elements from the set D, so
that the divergence that may actually occur can be pin-pointed. In this respect we
proceed in the following way. Corresponding to each divergence type we identify the
functional tag that is at the root of causing the divergence. We call it the problem-
atic FT corresponding to that particular divergence. Table 4.3 presents our finding
in this regard. Corresponding to each possible divergence (as found in D) the scheme
13http://www.cogsci.princeton.edu/cgi-bin/webwn14http://www.lingsoft.fi/cgi-bin/engcg
12
-
Chapter 1. Introduction
works as follows. It first retrieves from the input sentence the constituent word cor-
responding to the problematic FT of the divergence type under consideration. Then
the semantic similarity of this word is compared to other words. Proximity in this
semantic distance is then used as a yardstick for similarity measurement. Chapter
4 discusses this scheme in detail.
Finally, in Chapter 5 we look at how cost of adaptation may be used as a similar-
ity measurement scheme. It has been stated that no unique definition of similarity
exists for comparing sentences. Similarity between sentences may be viewed from
different perspectives. In this work, we have first considered two most general sim-
ilarity schemes: syntactic similarity and semantic similarity. The ideas have
been borrowed from the domain of Information Technology (Manning and Schutze,
1999). According to the definition given therein semantic similarity is measured on
the basis of commonality of words. The more is the number of words common be-
tween two sentences, the more similarity is said to exist between the two sentences
under consideration. However, it has been shown in (Chatterjee, 2001) that this
measurement of similarity is not always helpful from EBMT point of view. For ex-
ample, it has been shown there that although the sentences The horse had a good
run. and The horse is good to run on. have most of the key words common, the
structure of their Hindi translations are very different. Consequently, adaptation of
the translation of one of them to generate the translation of the other is computa-
tionally demanding. On the other hand, syntactic similarity between two sentences
is measured on the basis of commonality of morpho-functional tags between them.
In this case, adaptation may require a large number of constituent word replacement
(WR) operations. Each of these WR operations involves reference to some dictio-
nary for picking up the appropriate words in the target language. Typically the
13
-
1.1. Description of the Work Done and Summary of the Chapters
dictionary access will involve accessing an external storage, and thereby will incur
significant computational cost. Thus a purely syntax-based similarity measurement
scheme may not be suitable for an EBMT system.
In this work we therefore propose that from EBMT perspective retrieval and
adaptation should be looked at in a unified way. In this chapter (i.e. Chapter
5) we investigate feasibility of the above proposal in depth. In this respect we first
look into the overall adaptation operations deeply. We have already observed that
these operations are invoked successively to remove the discrepancies between the
input sentence and the retrieved example. These discrepancies, as we observe, may
be in the actual words, or in the overall structure of the sentences. For illustration,
suppose the input sentence is The boy eats rice everyday., whose Hindi translation
ladkaa har roz chaawla khaataa hai has to be generated. The nature of the adap-
tation varies depending upon which example is retrieved from the example base. For
illustration:
a) If the retrieved example is The boy eats rice, the adaptation procedure needs
to apply a constituent word addition operation (WA) to take care of the adverb
everyday.
b) However, if the retrieved sentence is The boy plays cricket everyday. ladkaa
roz cricket kheltaa hai, then the adaptation procedure needs to invoke two
constituent word replacement (WR) operations : to replace Hindi of play,
i.e. khel with the Hindi of eat, i.e. khaa, and cricket (cricket) with
chaawal (rice).
c) In case the retrieved example is The boy is eating rice., one adaptation op-
eration that is constituent word addition (WA) is required for the adverb
14
-
Chapter 1. Introduction
everyday. Further to take care of verb conjugation some morpho-word and
suffix operations need to be carried out. This is because the Hindi transla-
tion of The boy is eating rice is : ladkaa (boy) chaawal (rice) khaa (eat)
rahaa (..ing) hai (is). But the translation of the input sentence The boy
eats rice everyday should be ladkaa har roz chaawal khaataa hai. Thus the
morpho-word rahaa, which is required for the present continuous tense of
the retrieved sentence needs to be deleted. Further the suffix taa is to be
added to the root main verb to get the required present indefinite verb form
of the input.
d) However, if the retrieved example is Does the boy eat rice?, then adaptation
procedure needs to take care of the structural variation between the inter-
rogative form of the retrieved example, and the affirmative form of the input
sentence.
Obviously, the more will be the discrepancy between the retrieved example and
the input sentence, the more will be the number of adaptation operations towards
generating the desired translation. The above illustrations make certain points evi-
dent:
a) Adaptation operations are required for performing two general tasks: dealing
with constituent words (along with their suffixes, morpho-words), and dealing
with the overall structure of the sentence.
b) Each invocation of adaptation operation pertains to a particular part of speech,
such as, noun, verb, adverb etc.
c) Of the ten adaptation operations (described earlier with respect to Chapter
15
-
1.1. Description of the Work Done and Summary of the Chapters
2) only the WA and WR operations require dictionary15 searches. Since dic-
tionary search typically involves accessing an external device (e.g hard disk),
a dictionary search is computationally more expensive than other operations
(e.g. constituent word deletion, morpho-word operations) which are purely
RAM16-based and hence computationally cheaper.
The above observations help us to proceed towards achieving the intended goal
of using cost of adaptation as a measurement of similarity. As a first step towards
achieving the intended goal, we suggest to divide the dictionary into several parts
based on the part-of-speech (POS) of the words. Division of the dictionary into
several parts according to the POS reduces the search time for each invocation of
the above procedures, and as a consequence, the search time is reduced. The cost of
adaptation based similarity measurement approach then proceeds along the following
line:
a) We first estimate the average cost for each of the ten adaptation operations.
We observe that these costs depend on two major types of parameters. On
one hand they depend on certain linguistic aspects, such as, the average length
of the sentences in both source and target languages, the number of suffixes
(used with different POS), the number of morpho-words etc. On the other
hand, these costs are related to the machine on which the EBMT system is
working. Since we aim at analyzing the costs in a general way, we assumed
these machine-dependent costs to be variables in all our analysis. For the lin-
guistic parameters, we used values that we have obtained by analyzing about
15By dictionary we mean a source language to target language word dictionary available on-line.
16Random Access Memory
16
-
Chapter 1. Introduction
30,000 examples of English to Hindi translations. These examples were col-
lected from various sources that are translation books, advertisement materi-
als, childrens story books and government notices, which are freely available
in non-electronic form.
b) At the second step, we estimated the costs incurred in adapting various func-
tional tags17. In particular, we have considered cost of adaptation due to vari-
ations in active and passive verb morphology, subject/object, pre-modifying
adjective, genitive case and wh-family words. These costs are stored in various
tables, in Section 5.4.
c) At the third step we have considered costs of adaptation due to differences in
sentence structure. Here, we have considered four different sentence structures:
affirmative, negative, interrogative, negative-interrogative. These adaptation
costs too are stored in tabular form. Section 5.4 gives details of this analysis.
Once these basic costs are modelled, we are in a position to experiment on costs
of adaptation as a similarity measure vis-a`-vis semantics and syntax based similarity
measurement scheme discussed above. Our experiments have clearly established the
efficiency of the proposed scheme over the others. Part of this work is also presented
in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:
1) It may end up in comparing a given input with all the example-base sentences
to ascertain the least cost of adaptation.
2) Another major question that may arise is that whether the cost of adaptation
scheme is efficient enough to handle sentences that are structurally more com-
17In fact we worked on Functional Slots which are more general than Functional Tags. Thisis discussed in detail in Section 2.2
17
-
1.1. Description of the Work Done and Summary of the Chapters
plicated, e.g. complex or compound sentences. It is a generally accepted fact
that complex sentences are difficult to handle in an MT system (Dorr et. al.,
1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003).
In order to deal with first difficulty we have proposed a two-level filtration scheme.
This scheme helps in selecting a smaller number of examples from the example base,
which may subsequently be subjected to the rigorous treatment for determining their
costs of adaptation with respect to the given input. We have also justified that this
scheme does not leave out the sentences whose translations are easier to adapt for
the given input.
In this work we have given a solution for the second problem too. We have
given rules for splitting a complex sentence into more than one simple sentence.
Translations of these simple sentences may then be generated by the EBMT system.
These individual translations may then be combined to obtain the translation of the
given complex sentence input. If the cost of adaptation based similarity measurement
scheme is applied for translating the simple sentences, then the cost of adaptation
of the complex sentence too can be estimated, by adding the individual costs with
the cost of combining the individual translations. Since the last operation is purely
algorithmic its computational complexity can be easily computed, and hence the
overall cost of adaptation be estimated. With respect to dealing with complex
sentences, we have however used certain restrictions. We considered sentences with
only one subordinate clause. Further, the presence of a connecting word is also
mandatory. Evidently, more complicated complex sentence structures are available,
and further investigations are required for developing techniques for handling them
in an EBMT framework.
In this connection we like to mention that we have explained the cost of adap-
18
-
Chapter 1. Introduction
tation with respect to a selected set of sentence structures, and for a selected set of
Functional slots. Definitely many more variations are available with respect to these
parameters. Consequently, more work has to be done to form rules for handling
these variations. However, we feel that the work described in research provides the
suitable guideline for further continuation of the research.
1.2 Some Critical Points
1) The aim of this research is not to construct an English to Hindi EBMT system.
Rather our intuition is to analyze the requirements that help in building an
effective EBMT system. The motivation behind this research came from two
major observations:
Although some MT system for translation from English to Hindi already
exist, the quality of their translation is often not up to the mark. This
promoted us to look into the process of MT to ascertain the inherent
difficulties.
We have chosen EBMT as our preferred paradigm because of its certain
advantages our other MT paradigms such as RBMT, SBMT. One major
advantage of EBMT is that it requires neither a huge parallel corpus as
required by SBMT, nor it requires framing a large rule base required by
RBMT. Study of EBMT is therefore feasible for us as we did not have
access to such linguistics resources.
2) In order to design our scheme we have studied about 30,000 English to Hindi
translation examples available off-line. Although now large volumes of English
19
-
1.2. Some Critical Points
English sentence: The horses have been running for one hour.Tagged form: @DN> ART the, @SUBJ N PL horse %ghodaa%,@+FAUXV V PRES have, @-FAUXV V PCP2 be, @-FMAINV V PCP1run %daudaa%, @ADVL PREP for, @QN> NUM CARD one %ek%, @
-
Chapter 1. Introduction
this research will be helpful for developing MT system not only for Hindi but also for
other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer
from the same drawback - unavailability of linguistics resources. However, demands
for developing MT systems from English to these languages is increasing with time
not only because these are prominent regional languages of India, but also they
are important minority languages in other countries such as U.K. (Somers, 1997).
The studies made in the research should pave the way for developing EBMT system
involving these languages as well.
21
-
Chapter 2
Adaptation in English to Hindi
Translation: A Systematic
Approach
-
Adaptation in English to Hindi Translation: A Systematic Approach
2.1 Introduction
The need for an efficient and systematic adaptation scheme arises for modifying a
retrieved example, and thereby generating the required translation. This chapter is
devoted to the study of systematic adaptation approach. Various approaches have
been pursued in dealing with adaptation aspect of an EBMT system. Some of the
major approaches are described below.
1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:
high-level grafting and keyhole surgery. High-level grafting deals with phrases.
Here an entire phrasal segment of the target sentence is replaced with another
phrasal segment from a different example. On the other hand, keyhole surgery
deals with individual words in an existing target segment of an example. Under
this operation words are replaced or morphologically fine-tuned to suit the
current translation task. For instance, suppose the input sentence is The girl
is playing in the park., and in the example base we have the following examples:
(a) The boy is playing.
(b) Rita knows that girl.
(c) It is a big park.
(d) Ram studies in the school.
For the high level grafting the sentences (a) and (d) will be used. Then keyhole
surgery will be applied for putting in the translations of the words park and
girl. These translations will be extracted from (b) and (c).
2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on
three steps: finding the difference, replacing the difference, and smoothing the
23
-
2.1. Introduction
output. The differing segments of the input sentence and the source template
are identified. Translations of these different segments in the input sentence
are produced by rule-based methods, and these translated segments are fitted
into a translation template. The resulting sentence is then smoothed over by
checking for person and number agreement, and inflection mismatches. For
example, assume the input sentence and selected template are:
SI A very efficient lady doctor is busy.
St A lady doctor is busy.
Tt mahilaa chikitsak vyasta hai
The parsing process however shows that The very efficient lady doctor is a
noun phrase, and so matches it with The lady doctor - ek mahilaa chikit-
sak. The very efficient lady doctor is translated as ek bahut yogya mahilaa
chikitsak, by the rule-based noun phrase translation system. This is inserted
into Tt giving the following: Tt: ek bahut yogya mahilaa chikitsak vyasta hai.
3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here
two different cases are considered: full-case adaptation and partial-case adap-
tation. Full-case adaptation is employed when a problem is fully covered by the
retrieved example. Here desired translation is created by substitution alone.
No addition and deletion are required for adapting TL for generating the trans-
lation of SL. Here TL and SL denote example base target language sentence
and input source language sentence, respectively. In this case five scenarios
are possible: SAME, ADAPT, IGNORE, ADAPTZERO and IGNOREZERO.
Partial-case adaptation is used when a single unifying example does not exist.
Here three more operations are required on the top of the above five. These
three operations are ADD, DELETE and DELETZERO.
24
-
Adaptation in English to Hindi Translation: A Systematic Approach
Figure 2.1: The five possible scenarios in the SL SL TL interface of partialcase matching
Note that there is a subtle difference between ADAPT and ADAPTZERO.
For ADAPT as well as for ADAPTZERO, both SL and SL have same links
but different chunks. If TL has words corresponding to the chunk which is
different in SL and SL, then the words in TL should be modified and this is
the case of ADAPT. One the other hand, if no corresponding chunk is present
in TL then it is the case of ADAPTZERO. Therefore, in that case no work is
needed for adaptation. Similar subtleties may be observed between DELETE
and DELETZERO, and also between IGNORE and IGNOREZERO. Other
operations (such as, SAME, ADD) have obvious interpretations. Figure 2.1
provides the conceptual view of partial case matching.
4. Somers (2001) proposes adaptation from case-based reasoning point of view.
The simplest of the CBR adaptation methods is null adaptation where no
changes are recommended. In a more general situation various substitution
methods (e.g. reinstatiation, parameter adjustment)/transformation methods
(e.g. commonsense transformation and model-guided repair) may be applied.
For example, suppose the input sentence (I) and the retrieved example (R)
25
-
2.1. Introduction
are:
I That old woman has died.
R That old man has died. wah boodhaa aadmii mar gayaa
To generate the desired translation of the word man aadmii is first re-
placed with the translation of woman aurat in R. This operation is called
reinstantiation. At this stage an intermediate translation wah boodhaa aurat
mar gayaa is obtained. To obtain the final translation wah boodhii aurat
mar gayii, the system must also change the adjective boodhaa to boodhii
and the word gayaa to gayii. This is called parameter adjustment.
5. The adaptation scheme proposed by McTait (2001) works in the following way.
Translation patterns that share lexical items with the input and partially cover
it are retrieved in a pattern matching procedure. From these, the patterns
whose SL side cover the SL input to the greatest extent (longest cover) are
selected. They are termed base patterns, as they provide sentential context in
the translation process. It is intuitive that the greater extent of the cover is
provided by the base patterns, the more is the context, and the lesser is the
ambiguity and complexity in the translation process. If the SL side of the base
pattern does not fully cover the SL input, any unmatched segments are bound
to the variable on the SL side of the base pattern. The translations of the SL
segments bound to the SL variables of the base pattern are retrieved from the
remaining set of translation patterns, as the text fragments and variables on
the TL side of the base pattern from translation strings.
The following is a simple example: given the source language input is I: AIDS
control programme for Ethiopia, suppose the longest covering base pattern is:
D1: AIDS control programme for (....) ke liye AIDS contral smahaaroo (...).
26
-
Adaptation in English to Hindi Translation: A Systematic Approach
To complete the match between I and the source language side of D1, a trans-
lation pattern containing the text fragment Ethiopia is required i.e.
D2: (...) Ethiopia (...) Ethiopia (...).
The TL translation T: ethiopia ke liye AIDS contral smahaaroo is generated
by recombing the text fragments: Ethiopia and ethiopia are aligned in D2
as are the variables in the base pattern D1. Since Ethiopia and ethiopia
are aligned on a 1:1 basis, and so are the variables in the base pattern D1, the
TL text fragment Ethiopia is bound to the variable on the TL side of D1 to
produce T.
6. In HEBMT (Jain, 1995) examples are stored in an abstracted form for deter-
mining the structural similarity between the input sentence and the example
sentences. The target language sentence is generated using the target pat-
tern of the sentence that has lesser distance with the input sentence. The
system substitutes the corresponding translations of syntactic units identified
by a finite state machine in the target pattern. Variation in tense of verb,
and variations due to number, gender etc. are taken care at this stage for
generating the appropriate translation. This system translates from Hindi to
English, therefore, we explain its adaptation process with the example of Hindi
to English translation.
For example, suppose the input sentence is merii somavara ko jaa rahii hai
and its matches with examples sentence R: meraa dosta itavaar ko aayegaa.
Steps (a) to (f) below, show the process of translation.
(a) merii somavara ko jaa rahii hai(input sentence)
27
-
2.1. Introduction
(b) 123(syntactic grouping)
(c) [Mary] [Monday] [go] (English translation of syntactic groups)
(d) {on} (target pattern of example R)
(e) [Mary] [is going] on [Monday] (Translation after substitution)
(f) Mary is going on Monday (Final translated output)
Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,
1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (Guvenir and Cicekli,
1998). But overall in our view the adaptation procedures employed in different
EBMT systems primarily consist of four operations:
Copy, where the same chunk of the retrieved translation example is used in
the generated translation;
Add, where a new chunk is added in the retrieved translation example;
Delete, when some chunk of the retrieved example is deleted; and
Replace, where some chunk of the retrieved example is replaced with a new
one to meet the requirements of the current input.
The operations prescribed in different systems vary in the chunks they deal with.
Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional
suffix).
1snp : noun, adj+noun, noun+ kaa+noun2npk2: noun+ko3mv: verb-part
28
-
Adaptation in English to Hindi Translation: A Systematic Approach
With respect to English and Hindi, we find that both the languages depend
heavily on suffixes for verb morphology, changing numbers from singular to plu-
ral and vice versa, case endings, etc. Appendix A provides detail descriptions
of various Hindi suffixes. Keeping the above in view we differentiated the adap-
tation operations in two groups: word based and suffix based. The word based
operations are further subdivided into two categories: constituent word based and
morpho-word based. Thus the adaptation scheme proposed here consists of ten op-
erations: Copy (CP), Constituent word deletion (WD), Constituent word addition
(WA), Constituent word replacement(WR), Morpho-word deletion (MD), Morpho-
word addition (MA), Morpho-word replacement(MR), Suffix addition (SA), Suffix
deletion (SD) and Suffix replacement (SR). Section 2.2 illustrates the roles of the
these operations in adapting a retrieved translation example.
The advantage of the above classification of adaptation operations is twofold.
Firstly, it helps in identifying the specific task that has to be carried out in the step-
by-step adaptation for a given input. Secondly, it helps in measuring the average
cost of each of the above operations in a meaningful way, which in turn helps in
estimating the total adaptation cost for a given sentence. This estimate can be used
as a tool for similarity measurement between an input and the stored examples.
These issues are discussed in Chapter 5.
2.2 Description of the Adaptation Operations
The ten adaptation operations mentioned above are described below.
1. Constituent Word Replacement (WR): One may get the translation of the
input sentence by replacing some words in the retrieved translation example.
29
-
2.2. Description of the Adaptation Operations
Suppose the input sentence is: The squirrel was eating groundnuts., and the
most similar example retrieved by the system (along with its Hindi translation)
is: The elephant was eating fruits. haathii phal khaa rahaa thaa. The
desired translation may be generated by replacing haathii with the Hindi of
squirrel, i.e. gilharii and replacing phal with the Hindi of groundnuts,
i.e. moongphalii. These are examples of the operation of constituent word
replacement.
2. Constituent Word Deletion (WD): In some cases one may have to delete some
words from the translation example to generate the required translation. For
example, suppose the input sentence is: Animals were dying of thirst. If the
retrieved translation example is : Birds and Animals were dying of thirst.
pakshii aur pashu pyaas se mar rahe the, then the desired translation can
be obtained by deleting pakshii aur (i.e the Hindi of birds and) from the
retrieved translation. Thus the adaptation here requires two constituent word
deletions.
3. Constituent Word Addition (WA): This operation is the opposite of constituent
word deletion. Here addition of some additional words in the retrieved trans-
lation example is required for generating the translation. For illustration, one
may consider the example given above with the roles of input and retrieved
sentences being reversed.
4. Morpho-word Replacement (MR): In this case one morpho-word is replaced by
another morpho-word in the retrieved translation example. Consider a case
when the input sentence is: The squirrel was eating groundnuts., and the
retrieved example is: The squirrel is eating groundnuts. gilharii moongfalii
30
-
Adaptation in English to Hindi Translation: A Systematic Approach
khaa rahii hai. In order to take care of the variation in tense the morpho-
word hai is to be replaced with thaa. This is an example of Morpho-word
replacement.
5. Morpho-word Deletion (MD): Here some morpho-word(s) are deleted from the
retrieved translation example. For illustration, if the input sentence is He
eats rice., and the retrieved example is: He is eating rice. wah chaawal
khaa rahaa hai, then to obtain the desired translation4 first the morpho-word
rahaa is to be deleted from the retrieved translation example.
6. Morpho-word Addition (MA): This is the opposite case of morpho-word dele-
tion. Here some morpho-words need to be added in the retrieved example in
order to generate the required translation.
7. Suffix Replacement (SR): Here the suffix attached to some constituent word
of the retrieved sentence is replaced with a different suffix to meet the current
translation requirements. This may happen with respect to noun, adjective
verb, or case ending . For illustration,
(a) To change the number of nouns
Boy (ladkaa) Boys (ladke)
The suffix aa is replaced with e so in order to get its plural form in
Hindi.
(b) Change of Adjectives
Bad boy (buraa ladkaa) Bad girl (burii ladkii)
The suffix aa is replaced with ii to get the adjective burii.
4Of course the final translation will be obtained by adding the the suffix taa with the wordkhaa.
31
-
2.2. Description of the Adaptation Operations
(c) Morphological changes in verb
He reads. (wah padtaa hai) She reads. (wah padtii hai)
The suffix taa is replaced with tii to get the verb padtii, which is
required to indicate that the subject is feminine.
(d) Morphological changes due to case ending
boy (ladkaa) from boy (ladke se)
room (kamraa) in room (karmre mein)
The suffix aa is replaced with e to get the nouns ladke and kamre.
8. Suffix Deletion (SD): By this operation the suffix attached to some constituent
word may be removed, and thereby the root word may be obtained. This
operation is illustrated in the following examples:
(a) To change the number of nouns
women (aauraten) woman (aaurat),
The suffix en is deleted from aauraten to get the Hindi translation
of woman.
(b) Morphological changes in verb
He reads. (wah padtaa hai) He is reading. (wah pad rahaa hai)
The suffix taa is deleted from padtaa to get the root form pad of
the English verb read.
(c) Morphological changes due to case ending
in the houses (gharon mein) houses (ghar)
in words (shabdon mein) words (shabd)
The suffix on is deleted from gharon and shabdon to get the Hindi
translation of nouns houses and words, respectively.
32
-
Adaptation in English to Hindi Translation: A Systematic Approach
9. Suffix Addition (SA): Here a suffix is added to some constituent word in the
retrieved example. Note that here the word concerned is in its root form in
the retrieved example. One may consider the examples given above with the
roles of input and retrieved sentences reversed as suitable examples for suffix
addition operation.
10. Copy (CP): When some word (with or without suffix) of the retrieved example
is retained in toto in the required translation then it is called a copy operation.
Figure 2.2 provides an example of adaptation using the above operations. In this
example the input sentence is He plays football daily., and the retrieved translation
example is:
They are playing football. we football khel rahe hain
(They) (football) (play) (...ing) (are)
The translation to be generated is : wah roz football kheltaa hai. When carried
out adaptation using both word and suffix operations the adaptation steps look as
given in Figure 2.2. In this respect one may note that Hindi is a free order language,
and consequently the position of adverb is not fixed. Hence the above input sentence
may have different Hindi translations:
wah roz football kheltaa hai
wah football roz kheltaa hai
roz wah football kheltaa hai
While implementing an EBMT system one has to stick to some specific format.
The adverb will be added according to the format adapted by the system.
33
-
2.2. Description of the Adaptation Operations
Input we football khel rahe hain
Operations WR WA CP SA MD MR
Output wah roz football kheltaa hai
Figure 2.2: Example of Different Adaptation Operations
Which adaptation operations will be required to translate a given input sentence
depends upon the translation example retrieved from the example base. A variety
of examples may be adapted to generate the desired translation, but obviously with
varying computational costs. For efficient performance, an EBMT system, therefore,
needs to retrieve an example that can be adapted to the desired translation with
least cost. This brings in the notion of similarity among sentences. The proposed
adaptation procedure has the advantage that it provides a systematic way of evalu-
ating the overall adaptation cost. This estimated cost may then be used as a good
measure of similarity for appropriate retrieval from the example base. How cost of
adaptation may be used as a yardstick to measure similarity between sentences will
be described in Chapter 5.
Here our aim is to count the number of adaptation operations required in adapt-
ing a retrieved example to generate the translation of a given input. Obviously, de-
pending upon the situation one has to apply some adaptation operations for changing
different functional slot5(Singh, 2003), such as, subject(), object (), verb
(). Also certain operations are required for changing the kind of sentence, e.g.
5The following example illustrates the difference between functional slots and functional tags.Consider the sentence The old man is weak.. The subject of this sentence is the noun phrase Theold man. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that theis a determiner, old is adjective, and man is the subject. But, as mentioned above, the entirenoun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is, i.e. subject slot. Note that a particular functional slot may have variable number of words.The sequence of functional slots in a sentence provide the sentence pattern. The difference betweenvarious tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.
34
-
Adaptation in English to Hindi Translation: A Systematic Approach
affirmative to negative, negative to interrogative etc. Table 2.2 contains the nota-
tions for roles of different functional slot and operators, which are required for the
subsequent discussion.
Operators Role of operators
< > For functional slot and part of speech and its transforma-
tion. E.g. , etc.
& Both functional slots or past of speech and its transforma-
tion should be present.
or Either first slot/tag, second slot/tag or both.
{} For non-obligatory functional tag/slot or for optional adap-
tation operation.
[ ] For the property of functional slot/tag.
Functional
Slot
Role of Functional slot
Linking verbs in English are: are , am , was, were, become,
seem etc., and in Hindi are: hai, hain, ho, thaa, the etc.
Auxiliary verb (if any) and main verb of the sentence
Auxiliary verb
Main verb
Subject
Object
First object
Second object
Subjective complement
-ing verb form other than the main verb.
-ed or -en verb forms other than the main verb.
to-infinitive form of verb.
Adverb
Adjective phrase
preposition phrase
preposition
Table 2.2: Notations Used in Sentence Patterns
35
-
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
The following sections describe how many such operations are required in dif-
ferent cases. In particular we consider the following functional slot and sentence
kinds:
1. Tense and Form of the Verb. Since there are three tenses (viz. Present,
Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect
Continuous), in all one can have 12 different structures of verb and passive
form verb structure also.
2. Subject/Object functional slot. Variations in subject/object functional slot
may happen in many different ways, such as, Proper Noun, Common Noun
(Singular or Plural), Pronoun, PCP1 form6 and PCP2 form7. Study of varia-
tion in pre-modifier adjectives, genitive case, quantifier and determiner tags.
3. Study of wh-family interrogative sentences.
4. Kind of sentence. Whether the sentence is affirmative, negative, interrogative
and negative interrogative.
Systematic study of these patterns, and their components helps in estimating
the adaptation costs between them.
2.3 Study of Adaptation Procedure for Morpho-
logical Variation of Active Verbs
Hindi verb morphological variations depend on four aspects: gender, number and
person of subject and tense (and form) of the sentence. All these variations effect
6-ing verb form other than the main verb7-ed or -en verb forms other than the main verb
36
-
Adaptation in English to Hindi Translation: A Systematic Approach
the adaptation procedure. In Hindi, these conjugations are realized by using suffixes
attached to the root verbs, and/or by adding some auxiliary verbs (see Table A.3 of
Appendix A). Since there are 12 different structures (depending upon the tense and
form), the adaptation scheme should have the capabilities to adapt any one of them
for any of the input type. Hence altogether 1212, i.e. 144 different combinations are
possible. However, Table A.3 (Appendix A) shows that in Hindi, perfect cont