Chapter 1 Introduction - a reservoir of Indian...
Transcript of Chapter 1 Introduction - a reservoir of Indian...
-
15
Chapter 1 Introduction
Machine Translation (MT), also known as automatic translation or mechanical
translation, is the computerized method that automates all or part of the process
of translating from one human language to another.
Importance of MT in the modern global world as an instrument to bridge the
digital divide and its multi-disciplinary academic thrusts and absence of such
system in legal domain, remain justification enough for this research work. This
research work is an attempt to build a system to translate simple sentences in
legal domain from Punjabi into English. The need of this system arises from the
translations of the legal documents transferred from District Courts of Punjab to
the High Court in Chandigarh. The FIRs which are written in Punjabi are
translated into English before presenting it to the High Court. To the best of my
knowledge, no Machine Translation System is being developed from Punjabi to
English in legal domain. Similar translations from some of the Indian languages to
English have been developed. For Example Hinglish, a Machine Translation
System for pure (standard) Hindi to pure English forms, developed by R. Mahesh
K. Sinha and Anil Thakur.[1-3]
This chapter introduces Machine Translation, its concepts, various approaches
for Machine Translation Systems and key activities involved in it. It also provides
a formal description about the research question undertaken, its objectives as
-
16
well as need and scope of the study. The approach followed along with the
reasons behind its selection to solve this research problem has been explained in
brief. The chapter concludes by presenting major contributions of the research
work and an outline of the study. This work is based on Gurmukhi and Roman
scripts. The examples given in this thesis work are in Gurmukhi script along with
their transliteration. For Example (sakar). The transliteration provided is
based on transliteration software the GTrans, which is developed by Punjabi
University, Patiala, Punjab, India.
1.1 Machine Translation
The term 'Machine Translation' (MT) refers to the computerized system
responsible for the production of translations with or without human assistance.[4]
Machine Translation is an application of computer and language sciences which
helps in development of systems answering practical needs. Computer programs
are producing translations which may not be perfect translations of literary texts,
but produce useful translations of technical manuals, scientific documents,
commercial prospectuses, administrative memoranda and medical reports.
1.1.1 Some Misconceptions about Machine Translation
There are some misconceptions about MT. It is believed that the quality of
translation from an MT system is very poor. Such a notion has some veracity
because no existing system can produce really perfect translations. However, this
does not make MT useless. A rough translation would be very helpful if we have
-
17
a document containing very important information, written in a language which we
do not understand. Moreover a human translator normally does not immediately
produce a perfect translation. It is a general practice to divide the job of
translating a document into two stages. The first stage is to produce a draft
translation, which may not be necessarily perfect. It is then revised either by the
same translator or by another translator with a view to improve previous
translation. For the most part, the aim of MT is only to automate the first, draft
translation process and to make the overall translation process fast, simple and
cheaper.[5] MT does not threaten translators job. However, MT systems can take
over some of the boring, repetitive translation jobs and allow human translators to
concentrate on more interesting tasks, where their specialist skills are really
required.
1.1.2 History of Machine Translation
1.1.2.1 Machine Translation of Non-Indian Languages
The idea of mechanical dictionaries originated in the 20th century, still the origins
of MT can be traced back to 17th century. In 1946-1947, there came the first
tentative idea of using the newly invented computers for translating natural
languages. Weaver received much credit for bringing the concept of MT to the
public when he published an influential paper on using computers for translation
in 1949. The early 1950s were a period of intense research in MT in both the
United States and Europe. Massachusetts Institute of Technology (MIT) started
-
18
research in this field in 1951. In 1952 the first conference on MT was held, but it
was not until 1954 that a translation system was demonstrated. This system was
not that accurate but it attracted the interest of researchers and media all over the
world. The first journal on Mechanical Translation was published at MIT in 1953.
In 1959, IBM installed an MT system for the United States Air Force, followed by
Georgetown University installing systems at Euratom and the United States
Atomic Energy Agency. Despite some success of early MT systems, MT research
funding was on the verge of serious reduction. The growing dissatisfaction of
research sponsors caused the United States National Academy of Sciences to
set up the Automatic Language Processing Advisory Committee (ALPAC) in
1966. ALPAC, whose members were the major sponsors of current MT research
projects, was to evaluate the effectiveness, costs and potential future progress of
MT. The level of global MT activity probably reached, if not exceeded, the highest
levels during the mid 1960 at the time of the ALPAC report. 1976 marked a
positive turning point for MT research Canada made public their METEO System,
which translated weather forecasts. Later that year, the European Commission
purchased SYSTRAN, a Russian-English system. MT interest and activity has
increased ever since, and MT has been established as a legitimate field of
research. In 1970, there was first Doctoral thesis in MT of Anthony G. Oettinger,
which carried the study for a Russian mechanical dictionary. The largest growth
area has been in the marketing and sale of commercial MT systems, many for
personal computers, and in the provision of MT-based services. The picture in
-
19
MT research also changed in the early 1980, however it still concentrated largely
in well-established projects at universities (Grenoble, Saarbrcken, Montral,
Texas, Kyoto, and the Eurotra project) and in connection with systems such as
Systran, Logos, ALPS and Weidner. It was perhaps these systems, however
crude in terms of linguistic quality, which more than anything else alerted the
translation profession to the possibilities of exploiting the increasing
sophistication of computers in the service of translation.. The 1990 showed MT
implemented as an online service. The 2000 have shown even more research
into MT and many new, efficient hybrid algorithms.[6-19]
1.1.2.2 Machine Translation in India
The earliest efforts in Machine Translation in India date from the mid 80s and
early 90s. The prominent among these efforts are the research and development
projects at Indian Institute of Technology Kanpur, University of Hyderabad,
National Center for Software Technology, Mumbai and Center for Development
of Advanced Computing (CDAC), Pune. [20] Since the mid and late 90, a few
more projects have been initiatedat Indian Institute of Technology Bombay,
International Institute of Information Technology Hyderabad, Anna University
KB Chandrasekhar Research Center Chennai and Jadavpur University, Kolkata.
There are also a couple of efforts from the private sector - from Super Infosoft
Private Limited, and more recently, the IBM India Research Laboratory.of IT,
Ministry of Communications and Information Technology, Government of India,
-
20
has played an instrumental role by funding the projects. Indian Languages (TDIL)
program of the Ministry of Information Technology (MIT) and also the UNDP.
University Grants Commission (UGC) also started supporting minor and major
research projects involving development of linguistic parsers and Machine
Translation Systems. Indian Institutes of Technology (IITs), Indian Institutes of
Information Technology (IIITs), Centre for Development of Advanced Computing
(C-DAC), Indian Institute of Science (IIS), Indian Statistical Institute (ISI),
Jawaharlal Nehru University (JNU), Mahatma Gandhi International Hindi
University (MGIHU) and other institutes have significant contributions in this field.
The private enterprises like Tata Institute of Fundamental Research (TIFR), Tata
Consultancy Services (TCS) have also funded Indian language technology
research and development.[21,22]
IIT Guwahati, CDAC Kolkata, Jawaharlal Nehru University (JNU) New Delhi are
also involved in developing the Machine Translation Systems for different Indian
languages.[20] Punjabi University Patiala has also entered into the field of
Machine Translation and successfully developed Hindi-Punjabi Machine
Translation System and vice versa. Thapar University Patiala is also working on
UNL based Machine Translation System.[22, 23]
1.1.3 Approaches used in Machine Translation
Broadly classifying, approaches used for translation are Rule based and Corpus
based. Rule based approach is further classified to direct, interlingual and
-
21
transfer based approach. The direct translation approach was typical of the "first
generation" of MT systems. The indirect approach of interlingua and transfer
based systems are meant to characterise the "second generation" of MT
systems. Interlingua and transfer approaches are essentially based on the
specification of rules (for morphology, syntax, lexical selection, semantic analysis,
and generation). During the last few years, there is beginning of emergence of
third generation" of MT systems depending upon knowledge-based, corpus-
based and hybrid approaches. Corpus-based methods do not rely on external
knowledge sources such as machine readable dictionaries, concept hierarchies,
or sense-tagged texts. They do not assign sense tags to words, rather, rely on
monolingual corpora and methods based on translational equivalence as found in
word-aligned parallel corpora.[25, 26]
1.1.3.1 Rule Based Approaches
The assumption of rule-based MT is, that translation is a process which requires
the analysis and representation of the 'meaning' of source language texts and the
generation of equivalent target language texts. Representations should be
unambiguous lexically and structurally. There are three major approaches: (a)
The Direct approach in which the word to word translation is performed without
analyzing it structurally. (b) The 'Transfer' approach in which the translation
process operates in three stages Analysis into Abstract Source Language
representations, Transfer into Abstract Target Language representations, and
-
22
Generation or Synthesis into Target Language Texts and (c) the two-stage
'Interlingua' model, where analysis is into some language-neutral representation
and generation starts from this interlingua representation.[27]
1.1.3.1.1 Direct Approach
Direct translation is the oldest approach of MT. If the MT system uses direct
translation, it usually means that the source language text will not be analyzed
structurally beyond morphology. The translation is based on large dictionaries
and word-by-word translation with some simple grammatical adjustments e.g. on
word order and morphology. The translation unit of the approach is usually a
word. The lexicon is normally conceived of as the repository of word-specific
information. Traditional lexical resources are machine readable dictionaries that
contain list of words. These lists might delineate senses of a word, represent the
meaning of a word, or specify the syntactic frames in which a word can appear,
but the level of granularity with which they are concerned is the individual word.
One of the oldest still used MT systems today, Systran, is basically a direct
translation system. The first version of it was published in 1969. Over the years
the system has been developed quite a lot, but still its translation capability is
mainly based on very large bilingual dictionaries. No general linguistic theory or
parsing principles are necessarily present for direct translation to work; these
systems depend instead on well developed dictionaries, morphological analysis,
-
23
and text processing software.[28] Figure 1.1 shows the block diagram of Direct
MT System.
Figure 1.1 Direct MT System
E.g. Direct translation from Punjabi to English is
P:
T: pach ui
E: Bird flew
The disadvantage of direct method is that, it is unidirectional, i. e., if the target is
to be translated back into the source language, a different transformation must be
used. It uses n2 translation modules for translations among n languages, thus
making it exponentially large for multi-language translating system. Other
problem with the direct method arises if the structure of sentence is complex, it
requires complex grammatical analysis and word ordering in the target language
sentence can often be wrong. Additionally, if lexical ambiguity exists, incorrect
translation of words occurs.[28] Analysis of relations between different parts of
the sentence is often lacking, which can lead to poor translation. Direct
translation is very inaccurate for languages with structural and lexical differences.
A direct translation system is used for a closely related language pair. Eg
Punjabi-Hindi translation or vice versa
Analysis of
Language X
Synthesis of
Language Y Language X Language Y
-
24
Georgetown Automatic Translation (GAT) System developed by Georgetown
University, used direct approach for translating Russian texts (mainly from
physics and organic chemistry) to English.[29] The Mark II is also a direct
translation approach based Russian to English MT System for U.S. Air Force.[30]
RUSLAN is a direct Machine Translation System between closely related
languages, Czech and Russian.[31] SYSTRAN is a direct Machine Translation
System developed by Huchins and Somers. The system was originally built for
English-Russian Language Pair. [32] In India, the team of G. S. Lehal, G. S.
Joshan and Vishal Goyal at Punjabi University Patiala developed online Hindi-
Punjabi and Punjabi-Hindi Machine Translation Systems using direct translation
approach.[33]
1.1.3.1.2 Transfer Approach
When the major shortcomings of direct translation were realized, researchers
started working on the transfer method. It occupies the level above direct
translation in the MT pyramid and is also known as indirect or Linguistic
Knowledge (LK) translation. With the Transformer architecture or Direct MT, the
translation process relies on some knowledge of the source language and some
knowledge about how to transform partly analysed source sentences into strings
that look like target language sentences. With the Transfer Based architecture,
on the other hand, translation relies on extensive knowledge of both the source
and the target languages and of the relationships between analysed sentences in
-
25
both languages. It requires linguistic knowledge of the source and target
languages as well as the differences between them. The transfer method first
parses the sentence of the source language in the Analysis stage and then
applies rules that map the grammatical segments of the source sentence to a
representation in the target language in Transfer Stage and finally send in the
Synthesis stage where the target language sentences are generated.[5]
In this approach, the software attempts to deconstruct the grammar of the
input language to build a grammatical model of each sentence. The grammatical
model of the input language is then mapped to the grammatical model of the
output language. Transfer systems divide translation into steps which clearly
differentiate source language and target language parts. In the transfer approach
only the ambiguities inherent in the language in question are tackled. Rather than
operating in two stages, Transfer approach has three stages. The first stage
converts texts into intermediate representations in which ambiguities have been
resolved irrespective of any other language. Differences between languages, in
vocabulary and structure, are handled in the intermediary transfer stage. In the
third stage these are converted into equivalent representations of the target
language. Analysis and generation programs are specific for particular languages
and independent of each other.
For Example
P : (SNP) (ONP) (VP)
T: (SNP) bacc (ONP) mih (VP) pasand karad han
-
26
E: (SNP)Children (VP)Like (ONP)Sweets
Above example shows English to Punjabi translation of a sentence. After
syntactically and semantically analyzing the sentence, we can easily translate a
sentence even with different structures (SVO SOV). The transfer approach
uses n2 transfer modules, n analysis components, and n synthesis components,
where n is the number of languages in the translation system. Thus, one of its
bottleneck is the sheer size of the rules needed for its implementation. Figure 1.2
shows the block diagram of Transfer MT System.
Figure 1.2 Transfer MT System
The TAUM project developed at Montreal in 1970 for translation of weather
forecast from English to French uses Syntactic Transfer System. AnglaBharati,
an Indian system developed at IIT Kanpur under the expert guidance of RMK
Sinha deals with Machine Translationfrom English to Indian languages, primarily
Hindi, using a pseudo-interlingual rule-based Transfer Approach. It uses post-
editing to resolve ambiguity/complexity. It is mainly developed for public health
domain.[34] MaTra is another India n Langauges based Human-Assisted
translation project for English to Indian languages based on Transfer
Approach.[21] The MaTra lexicon approach is general-purpose, but the system
Language X Analyzer
Language Y Synthesizer
Lang X-Lang Y Transfer
Language X Language Y
Transfer Module
-
27
has been applied mainly in the domains of news, annual reports and technical
phrases. The Computer Science Department at the University of Hyderabad has
been working on an English-Kannada MT system, using the Universal Clause
Structure Grammar (UCSG) formalism. It is again based on transfer-based
approach, and has been applied to the domain of government circulars. The
Jadavpur University at Kolkata developed on a rule-based English-Hindi MAT for
news using the Transfer Approach.[21]
1.1.3.1.3 Interlingual Approach
The Interlingual Approach was historically the next step in the development of
MT. In an Interlingual based MT approach translation is done via an intermediary
(semantic) representation of the source language text. Interlingua is supposed to
be a language independent representation from which translations can be
generated to different target languages. The interlingua approach assumes that it
is possible to convert source texts into representations common to more than one
language. From such interlingual representations texts are generated into other
languages. Translation is thus in two stages: from the source language (SL) to
the interlingua (IL) and from the IL to the target language (TL). Programs for
analysis are independent from programs for generation, in a multilingual
configuration, any analysis program can be linked to any generation program.
Procedures for SL analysis are intended to be SL-specific and not oriented to any
particular TL, likewise programs for TL synthesis are TL-specific and not
-
28
designed for input from particular SLs. Translation from and into n languages
requires n(n-1) bilingual 'direct translation' systems; but with translation via an
interlingua just 2n interlingual programs are needed. With more than three
languages the interlingua approach is claimed to be more economic. On the other
hand, the complexity of the interlingua itself is greatly increased. Perhaps then
"Machine Translation" is not an appropriate term, since the machine only
completes the first stage of the process. It would be more accurate to talk of a
tool that aids the translation process, rather than an independent translation
system.[27] Figure 1.3 shows the block diagram of Interlingual MT System.
Fig 1.3 Interlingual MT System
Eg. Interlingual representation of sentence in Universal Networking Language is
P:
T: pach ui
UNL: agt ( , (icl>bird))
There are a few problems with the interlingual approach. The interlingual
approach requires an analyzer for each source language and a generator for
each target language. Analysis of source text requires a deep semantic analysis
that requires extensive world knowledge. Unfortunately, the true meaning of a
sentence cannot always be extracted. Additionally, if a text is analyzed as deeply
Language X Analyzer
Language Y Synthesizer
Interlingua Language X Language Y
-
29
as is expected, then much of the source authors style will be lost. A further
problem is that using an interlingua in MT can lead to extra, unnecessary work, in
some cases.
University of Texas during the 1970s developed METAL system for German and
English using interlingua approach.[35] At the end of the 1990s, the Institute of
Advanced Studies of the United Nations University, Tokyo began its multinational
interlingua based MT project based on a standardized intermediary language,
Universal Networking Language (UNL). The UNL is an international project of the
United Nations University, with an aim to create an Interlingua for all major
human languages. It was initially for the six official languages of the United
Nations and then for other widely spoken languages, namely, Hindi, Arabic,
Chinese, English, French, German, Indonesian, Italian, Japanese, Portuguese,
Russian, and Spanish.[35] IIT, Bombay is one of the members of the team
responsible for developing UNL models for Hindi. The AnglaBharti system
(developed at IIT Kanpur) uses a pseudo-interlingua approach to analyze English
only once and creates an intermediate structure called Pseudo Lingua for Indian
Language (PLIL) instead of designing translators for English to each Indian
language. The PLIL structure can be converted to each Indian language through
a process of text-generation. The idea of using PLIL is primarily to exploit
structural similarity among the Indian languages to obtain advantages similar to
that of using interlingua approach.[36] A team at Thapar University, Patiala is
-
30
working on Punjabi Language Server which includes Punjabi-UNL Enconverter
and UNL-Punjabi Deconverter. [33]
1.1.3.2 Data Driven Approaches
Most recently, corpus-based methods have changed the traditional picture.
Corpus-based methods of word sense discrimination are knowledge-lean, and do
not rely on external knowledge sources such as machine readable dictionaries,
concept hierarchies, or sense-tagged text. They do not assign sense tags to
words; rather, they discriminate among word meanings based on information
found in unannotated corpora. It relies on methods based on translational
equivalence as found in word-aligned parallel corpora. Corpus-based approaches
to Machine Translation partially succeeded to replace traditional rule-based
approaches.[26] The main advantage of corpus-based Machine Translation
Systems is that these are self-customizing in the sense that they can learn the
translations of terminology and even stylistic phrasing from previously translated
materials.
1.1.3.2.1 Knowledge-Based MT
Knowledge-Based MT (KBMT) is used to fill the gaps between the two extremes
of human-only and machine-only translations. It provides high-quality translation
much faster and at much lower cost. It is the combination of tightly integrated
translation technologies with unique translation processes driven by highly skilled
linguists. The objective of knowledge-based translation is to capture as much as
-
31
possible of linguists knowledge into the translation systems knowledge base.
For this, the system takes the use of source and target language dictionaries,
source and target language structures and rules, word meanings in different
contexts and language constructs, domain specific terminology, previously
translated words, phrases, sentences, paragraphs, language style and cultural
differences etc. By capturing all these knowledge sources it produces the high
quality output. Figure 1.4 shows Knowledge Based Machine Translation System.
Figure 1.4: KBMT Representation
This model does not require total understanding of the source text but assumes
that an interpretation engine can achieve successful translation into several
languages. KBMT is implemented on the interlingual architecture; it differs from
other interlingual techniques in the extent of depth to which it will analyze the
source language and its reliance on explicit knowledge of the world.[25] Figure
1.4 gives the graphical representation of KBMT System.
The KANTOO project is an object-oriented C++ implementation of KANT
technology for Machine Translation. The KANTOO is designed to be a more
robust, efficient and maintainable version of KANT for commercial customers.
Language X Analyzer
Language Y Synthesizer
Knowledge Representation
Language X Language Y
Augmentor
-
32
Besides Analyzer and Generator, KANTOO has an integrated set of support tools
for efficient knowledge maintenance. LUTE project at NTT and ETL research, a
Japenese multilingiual project also applied knowledge based approach.[27, 28]
1.1.3.2.2 Example-Based MT
This method was proposed in 1981 and Distributed Language Translation (DLT)
System of Japan is based on this approach. The example-based approach was
founded on processes of extracting and selecting equivalent phrases or word
groups from a databank of parallel bilingual texts, which have been aligned either
by statistical methods or by more traditional rule-based methods. The main
advantage of the approach (in comparison with rule-based approach) is that,
there is an assurance that the results will be accurate and idiomatic, since the
texts have been extracted from databanks of actual translations produced by
professional translators.[12]
Figure 1.5 EBMT Representation
The idea behind Example-Based MT (EBMT) is to translate a sentence using
previously analyzed examples of similar sentences. A database of previously
Language X Analyzer
Language Y Synthesizer
Matching Expression Conversion
Language X Language Y
Translation Memory
-
33
analyzed text is stored in the Translation Memory (TMEM) as shown in Figure
1.5. TMEM enables translators to store original texts and their translated
versions side by side, so that corresponding sentences of the source and target
are aligned. The translator can thus search for phrases or full sentences written
in one language from translation memory and is able to display corresponding
phrases in the other language, which matches exactly or approximately with the
previous language. Ideally, it will find an exact structural match for the source
sentence and replace the source word with the target words. However, it is often
the case that there is no exact match for a source sentence. In this case, the
system will chunk the source sentence and try to find a match in the example
database.
AnglaBharti system developed at IIT Kanpur, uses example-base to
identify noun and verb phrases to resolve their ambiguities.[36] AnglaBharti-II
launched in 2004, addresses many of the shortcomings of the earlier
architecture. It uses example-based approach to eliminate the difficulties in
making the modification of the rule-base.[37] AnglaHindi, another system
developed at IIT Kanpur, besides using all the modules of AnglaBharti, also
makes use of an abstracted example-base for translating frequently encountered
noun phrases and verb phrases. In AnglaHindi, the example-based approach is
invoked before the rule-based approach. The example-base is statistically
derived from the corpus.[34]
-
34
1.1.3.2.3 Statistical approach
It is relatively a new method and its strategies are based upon statistical
approaches. Here, statistical methods are used as the means of analysis and
generation; no linguistic rules are applied.[25] The essence of this method lies in
aligning phrases, word groups and individual words of the parallel texts, and in
calculating the probabilities that any one word in a sentence of one language
corresponds to a word or words in the translated sentence. The Statistical Based
MT has given more acceptable results by picking those word(s) from the given
surrounding words which have the highest probability of occupying its current
position. Here, the MT engine is trained based on large volumes of existing
content and its translation known as "bilingual text corpora." The MT engine uses
the large volumes of data to create statistical rules. These rules determine the
appropriate selection based on the probability of correct translation of given word,
phrase, or sentence of a language. Large volumes of electronic text of similar
content are required to get the best quality output from the MT engine. By the
turn of the century, this newer approach based on statistical models where a
word or phrase is translated to one of a number of possibilities based on the
probability of correct translation has achieved marked success. The best
examples substantially outperform rule-based systems. Statistics-based Machine
Translation (SMT) also may prove easier and less expensive to expand, if the
system can be taught new knowledge domains or languages by giving it large
samples of existing human-translated texts. Despite some success, however,
-
35
severe problems still exist i.e. outputs are often ungrammatical and the quality
and accuracy of translation falls well below that of a human linguist. Statistical-
Based MT (SBMT) includes some statistical techniques such as n-gram
modeling, maximum entropy modeling, and decision tree modeling. All pure
SBMT systems derive data from corpora that it has previously analyzed and do
not rely on linguistic information. SBMT methods select the best representation
choice based on Bayes theorem: argmaxw P(w|s) .SBMT will pick the word (w)
that has the highest probability of occupying its current position, given the
surrounding words.
RAND Corporation undertook statistical analyses of a large corpus of Russian
Physics texts, to extract bilingual glossaries and grammatical information. The
IBM India Research Lab at New Delhi has recently initiated work on statistical MT
between English and Indian languages, building on IBMs existing work on
statistical MT.[21] Google language translator also uses statistical approach for
translation. Microsoft Bing Translator allows users to translate texts or entire web
pages into different languages. This translation service is also using Statistical
Machine Translation strategy to some extent.
1.1.3.3 Hybrid Approach
During the last few years, the "third generation" of hybrid systems combining the
rule-based approaches of the earlier types and the more recent corpus-based
methods have also emerged. Hybrid methods are still fundamentally statistics-
-
36
based, but incorporate higher level abstract syntax rules to arrive at the final
translation.
An Interactive Japanese to English Translation System was introduced to support
non-natives of English to write English material, uses hybrid approach for
translation. Turkish to English Machine Translation System is a Hybrid Machine
Translation System by combining two different approaches to MT. The Hybrid
Approach transfers a Turkish sentence to all of its possible English translations,
using a set of manually written transfer rules. Then, it uses a probabilistic
language model to pick the most probable translation out of this set.. SisHiTra
developed by Gonzalez et. al is a also hybrid Machine Translation System from
Spanish to Catalan. This project tried to combine knowledge-based and corpus-
based techniques to produce a Spanish-to-Catalan Machine Translation System
with no semantic constraints. Bengali to Hindi Machine Translation System
developed at IIT Kharagpur also uses Hybrid Approach for translation.[38-41]
1.1.4 Key Activities[41-45]
Overview of common key activities, which formulate a Machine Translation
System are described as under. These activities are usually executed in
sequence. However, depending upon the technique being followed, one or more
of these activities may be omitted.
PRE-PROCESSING: This module tokenizes the input text into words based on
the list of word boundaries. Pre-processing phase also includes filtering the text.
-
37
Text filtering means detecting and marking certain special expressions like
named entities, collocations etc. Another important task performed in pre-
processing can be text normalization that includes checking the spelling
variations and replacing it with standard spellings. Pre-processing may include
activities to reduce the complexity of translation of source language and to
increase the accuracy of translator.
MORPHOLOGICAL ANALYSIS: The purpose of a morphological analyzer is to
return root word and grammatical information about all the possible word classes
for a given word. Morphological analysis phase also includes extraction of the
grammatical information including number, gender and tense information for all
the tokens. Since Indian languages have a rich inflectional morphology,
morphological analyzer is an essential tool for such languages.
PART OF SPEECH TAGGING: The output of the morphological analyzer is
usually ambiguous because a single word in the source language may have
number of tags. A particular word can be used as a noun, an adjective or a verb
etc. Part of speech tagger disambiguates the ambiguous output of morph
analyzer by using the contextual information in which the word is being used.
PHRASE CHUNKING: Chunking is a way of organizing information into familiar
groupings. Phrase chunking is a natural language process that separates and
segment sentences into their sub constituents such as noun, verb, and
-
38
prepositional phrases. Typical chunks are noun phrases, prepositional phrases
and verb phrases. Chunking works on POS tagged text, so its accuracy depends
upon the accuracy of POS tagger. The chunker can be rule based or
probabilistic.
TRANSLATION AND TRANSLITERATION: All of the above activities analyse the
given input. Having all the necessary information regarding the words in a
sentence, the next step is to find its equivalent in the target language. The
translation engine has two parts, Translation and Transliteration. Translation
includes finding the word equivalent from a bilingual lexicon. Transliteration is
writing the word in different script without interpreting. The transliteration process
also uses a lexicon of character mappings for Source and Target language
characters. It is used for out-of-vocabulary words and recognised named entities.
All other words are translated.
In the direct MT systems, source language words are simply replaced by target
language words but in Indirect MT system, synthesizers for target language
phrases are also needed.
SYNTHESIS: If the source language and target language have different word
order, this step tries to reorder the words according to the grammar of target
language. For example, the word order in Punjabi is Subject-Object-Verb. On the
other hand, English is Subject-Verb-Object language. According to the grammar
of target language, some reordering techniques are required.
-
39
POST PROCESSING: Post processing improves the quality of the translation
produced by the machine. The extent of requirement of post processing depends
upon the quality of the output received. This phase improves the translation
quality by making corrections in the generated output. Post Processor is actually
a corrector of ill formed sentences.
1.2 Research questions
Presently there is no Machine Translation System available from Punjabi to
English in legal domain; however a grammar checker and POS tagger is
available for Punjabi language. Similar Machine Translation Systems are
available for Indian languages to English language but these belong to different
domains. Though systems are available from Hindi to English for the domain of
public health, news and annual reports, but none are available Punjabi.
The problem statement for the present research work has been formulated as
below:
To develop algorithms and lexical resources along with a software package
to translate a simple sentence written in Punjabi language to English. The
sentence should lie in legal domain and should follow a particular syntax.
Present research study is basically to develop a Punjabi to English Machine
Translation System for legal documents which translates a simple Punjabi
sentence in legal domain to English. The system will be helpful to the persons
with a little knowledge of English. They can translate sentences in Punjabi easily
-
40
into sentences in English without the need of any interpreter, thus removing the
language barrier.
1.2.1 Objectives
The objectives of this study are:
1. To study Punjabi and English language and their divergences.
2. To study the inflectional morphology of Punjabi and various types of
agreement in Punjabi sentences.
3. To adapt the existing lexical resources such as morph database for part of
speech tagging.
4. To develop lexicon for collocations in Punjabi text.
5. To develop algorithms for part of speech tagging and phrase chunking
modules.
6. To develop a module for finding the gender (Masculine, Feminine or Both)
and number (Singular, Plural) information for the nouns and pronouns
used in the sentence.
7. To develop a module to translate tagged Punjabi words to their English
equivalents.
8. To develop transliteration module for handling named entities and out-of-
vocabulary words.
9. To develop algorithm for synthesizing translated phrases to an English
sentence.
-
41
10. To develop algorithm for post processing tasks.
11. To develop test cases for evaluating the system critically.
1.2.2 Challenges
MT across the languages is a challenging task for several reasons like, the
difference in the structure of source and target languages, ambiguity, multiword
units like idioms, phrases and tense generation and many more. Some of the
major challenges faced in development of Punjabi to English MT system are as
follows.
1. Word ordering is different for Punjabi and English. In Punjabi, word
order is Subject-Object-Verb (SOV) whereas in English, it is Subject-
Verb-Object (SVO). Lexical differences also exist in these two
languages as in some cases a group of words used in Punjabi has a
single-word equivalent in English.
2. Articles are used in English but not in Punjabi. The articles can be
added at the time of post processing to correct the sentence in some
cases.
3. Lack of lexical resources such as digital bilingual dictionary, Tagged
Corpus etc. There is no machine readable dictionary available for
Punjabi to English which can be directly used for translation, however
dictionaries are available to explain the meaning of a word.
Morphological Analyzer for Punjabi developed at Punjabi University,
-
42
Patiala cannot be used directly into the system. However the database
can be adapted in Punjabi to English translation system. No tagged
corpus is available for statistical tagging. Tagged corpus has been built
using the set of training sentences.
4. There are multiple translations of a Punjabi word to English. It may
depend upon the context in which the word is present in the sentence.
5. To identify the proper nouns present in the sentence.
6. Punjabi is free-word order language, so it was a challenging task to
identify the phrase performing the function of subject in the sentence.
7. There is a major challenge for development of a rule based system for
Machine Translation. The rule which we made for a particular type of
sentence is overruled in another type of sentence.
8. Output of the translator needs some grammatical correctness.
1.2.3 Need and Scope
The need of the system arises from the translations of the legal documents
transferred from district courts of Punjab to the High Court in Chandigarh. The
FIR which is written in Punjabi language are translated to English before
presenting it to the High Court. The scope of the system can be extended to
many legal agencies where the translation from Punjabi to English is needed.
The need for legal document translation can arise in a number of different
situations, from the finalisation of a large international business deals, or
-
43
relocation of employees from one company site to another across national
borders. Serious circumstances, such as litigations and disputes over business
affairs taken to foreign courts, can also call for legal translations. Legal translation
services may be required by any business or individual, though they are most
commonly required by law offices and courts, especially for court proceedings on
an international level. So, when lawyers deal with foreign documents, legal
translation services are a must.
1.2.4 Potential Use
As on today the need of translation is much more than it was in past. It is
undoubtedly an important topic socially, politically, commercially, scientifically,
and intellectually (or philosophically) and one whose importance is likely to
increase day by day. Some of the areas highlighting the importance of MT are
briefly described below:
The socio-political importance of MT arises in communities where more
than one language is generally spoken. Here, the only viable alternative to
rather widespread use of translation is the adoption of a single common
language, which is not an attractive alternative, because it involves the
dominance of the chosen language, to the disadvantage of speakers of the
other languages, and raises the prospect of the other languages becoming
second-class, and ultimately disappearing. Since the loss of a language
often involves the disappearance of a distinctive culture, and a way of
-
44
thinking, this is a loss that should matter to everyone. So translation is
necessary for communication for ordinary human interaction and for
gathering the required information.[10] The major problem of the
translation is that there is scarcity of human translators. Also there is a
limit on the extent of their productivity without automation. In short, an
automation of translation is a social and political necessity for modern
societies, which does not impose a common language on it.
It is also a necessity of organizations like the European Community and
the UN, for whom multilingualism is both a basic principle and a fact of
everyday life.
The commercial importance of MT is summarized below:
(a) Manual translation is less expensive. Translation is a highly skilled job,
requiring knowledge of a number of languages, and in some countries,
translators salaries are comparable to other highly trained
professionals.
(b) Machine Translation is speedy whereas Manual Translation proves
exorbitant and sometimes it causes loss of revenue for the company. A
professional translator translates approximately 4-6 pages of
translation (approximately 2000 words) per day which increases the
time period to translate product documentation. Hence the launch of
new product gets delayed. Considering the above drawbacks of
manual translation, Machine Translation is comparatively more
-
45
important in speeding up the process of translation. However, the
output of machine translator can be further edited by human
translators.
Scientifically, MT is interesting, because it is an obvious application and
testing ground for many ideas in Computer Science, Artificial Intelligence
and Linguistics.
Philosophically, MT is again interesting, because it represents an attempt
to automate an activity that can require the full range of human knowledge.
For example, getting the correct translation of negatively charged
electrons and protons into Punjabi depends upon the knowledge of
charge on protons, so the interpretation cannot be something like
negatively charged electrons and negatively charged protons. In this
sense, the extent to which one can automate translation is an indication of
the extent to which one can automate thinking.[10]
MT is mainly used to bridge the Digital Divide: The Internet changes the
world very fast. We can find vast amount of information on Internet. But
most of this information is in English. In the context of rural India, most of
this information is effectively unavailable to the rural masses without
having any knowledge about English language. In spite of all the progress
that is being made in the field of Information Technology, rural masses
remain deprived of the technological advancements. One of the primary
reasons for this is the incapability in information distribution and language
-
46
barrier is one of the biggest hurdles in this information distribution. There is
a great demand to translate Web pages and electronic mail messages into
the native language. There is a demand of Internet-based online
translation services.
MT can be used to assist human translator. There is demand of online
versions of electronic dictionaries as translation systems for helping
human translators.
Thus, if MT becomes more accurate and efficient enough, it can break down
cultural barriers and make communication much easier among speakers of
different languages.
1.3 Assumptions
We cannot build a fully automatic high quality Machine Translation System. It is
even difficult to build a system for two different word order languages. ed set of
sentences. Assumptions taken for development of this system are:
1. If a paragraph is being input to the system, it should have proper delimiters
for each sentence.
2. The sentences should be simple. The system does not work for complex,
compound, passive and interrogative sentences.
3. The sentence given as input to the translator, should be limited to six
phrases including verb phrase.
-
47
4. Word level ambiguity where a word can have number of tags with same
grammatical category but different meanings with respect to context is not
resolved.
5. Abbreviations must contain a period between the characters.
1.4 Architecture of Punjabi to English Machine Translation System
Fig 1.6 Architecture of Punjabi to English Machine Translation System
-
48
1.5 Approach Applied for the System
The approach applied for our Machine Translation System is the Rule based
approach. Since both the languages are different word order languages and have
number of divergence patterns, the indirect approach is best suited approach for
such type of translations. The system is broadly classified into three phases,
Analysis Phase, Translation Phase and the Generation Phase. Analysis phase
consists of Pre-processing, Tokenization, Tagging and Chunking. The Translation
phase includes Translation and Transliteration of each token and the Generation
phase involves Synthesis and Post-Processing.
Following is the brief introduction about the steps involved in it:
1.5.1 Analysis Component
The analysis component analyzes the source language text and passes the
tagged phrases to the translation engine.
1.5.1.1 Pre-Processing and Tokenization
1.5.1.1.1 Tokenization
The tokenizer takes input from the text generated by the previous module. This
module, uses space or a punctuation mark, as delimiter, extracts tokens (words)
from the text and gives it to the next module for further processing.
1.5.1.1.2 Pre-Processing
In the pre-processing phase, number of operations are applied on input
sentences to make them processable by the translator that those can be
-
49
processed by the translator with better accuracy. The system performs following
pre-processing tasks.
(a) Text Normalization
A small module has been developed to generate the database for
standardization. The module finds the spelling mistakes of each word and
replaces it with the correct word by taking it from the database. For Example. In
the word , pairi bindi may be included after the character , it should be
corrected by taking correct word from the database. In some words adhak is
used with some characters, in other these are not used. A database for
standardization has been developed by analyzing the words with the frequency of
occurrences for variant spellings.
(b) Identifying Collocations
Many words in the input sentences creates problem if treated alone. In the pre-
processing phase, we recognize those words and join the words to make it a
single word so that it can be translated in the target language.
P:
T: uhd maut hds vicc h ga
P:
T: lak d bhl shur kar ditt ga
In the above sentences, (h ga) is joined as (h) and (kar ditt)
is joined as (kt)
-
50
Pre-processor combines the adjoining words from the sentence to a single word
by checking them from the database of joined words.
(c) Identifying Named Entities
Named entity tagging refers to the task of identifying named entities (such as
person names) in a text. It is an important subtask for information extraction and
retrieval. The system under discussion only extracts those words which show the
names of the persons, places etc. The extraction of such words is important so
that these should not be translated. After recognizing these words, these are sent
to transliteration module. To recognize that, some rules have been developed by
checking preceding or succeeding word. For Example. the names may be
preceded by (sr), (sardr), (sardrn,), (srmt), (misar),
(misz), (mis) etc. or followed by (sigh), (kaur) or surname.
1.5.1.2 Morph Analyzing and Tagging
The next step is to tag each word with the grammatical information about it. In
Punjabi grammar, the parts of speech include noun, verb, adjective, adverb,
pronoun, preposition, conjunction, interjection, operators, auxiliary verbs etc. Tag
contains the information about grammatical category of word, gender, number,
person and the case in which it can be used. It works in two steps.
-
51
1.5.1.2.1 POS Tagging Morphological database already created in Punjabi University, Patiala is being
used to get the information about each token and according to the information
gathered from it, tags are formed.
Tag contains grammatical category-gender-person-number-case-tense-phrase-
type. The fields not applicable to a particular category are left blank. For
Example. Tags for the word (d) are ipo- - - - - - - -, v-b-s-s- -f-x- -. The above
tag for the word shows that it can be used as inflected postposition or as verb
with any gender, singular, second person, and future tense. In Punjabi language,
a word can have number of tags as a particular word can be used in number of
ways. The tagger first checks the category of each word from the database and
then adds gender, number, person or tense information to it. [19, 20] .Each word
is attached with a number of tags. Since a particular word may have a number of
tags, there is need to check which tag is applicable to a particular word in a
sentence.
1.5.1.2.2 Ambiguity Resolution between Different Tags
A Hybrid Approach which is a combination of rule based and statistical based
approach is used to solve ambiguity for a word with number of tags. First level of
ambiguity exists when a particular word can have number of tags of different
grammatical categories. The probability for the existence of a particular tag
-
52
should be calculated using Viterbi Algorithm by observing the frequency of
occurrence of tags of the preceding words.
For Example
P:
T: uh hds vicc zam h gi
In the above sentence, the word (zam) has two tags, one shows that it is
a noun and the other shows that it is an inflected adjective. In the first sentence,
probability of occurrence for the word as a noun is more than as an adjective, so
it is considered as a noun.
But, if the sentence is
P:
T: uh n zam dm n mr ditt
Here the word (zam) has higher probability as an adjective
Second level of ambiguity that has been resolved is, when there are a number of
tags that show a particular word as noun, but can be used as singular or plural.
For Example. tag for the word (mu) is, n-m- -s-o - - -, n-m- -p-d- - - -. For
such type of ambiguity rule based approach is used.
The tagged word can be a noun in singular or a noun in plural.
In the sentence,
P:
T: sr mu laan ga
-
53
In the above case the tag n-m- -p-d- - - - -, is selected as the number of verb
phrase is plural and its appropriate word in English is boys, whereas in the case
P:
T: ik mu n main rki
Here the tag for (mu) should be n-m- -s-o- - - - and its appropriate
equivalent in English is boy. Such type of ambiguity can be resolved by
considering the number ie. Singular or plural of the auxiliary verb or the main verb
present in the sentence. For resolution of ambiguity, the rules are ordered
according to priority.
1.5.1.3 Chunking
Chunking involves grouping of words of input sentence into phrases such as
noun phrase, postpositional phrase and verb phrase. A rule based chunker has
been developed for this purpose. First of all, from the Punjabi sentence, subject is
chosen by applying the rules of subject noun phrase and then from the predicate
other phrases are recognized.
For Example
P:
T: shkat n ih savikr kt hai
Here (shkat n) is taken as subject noun phrase and the rest of sentence
as predicate.
-
54
By combination of different word classes, Noun Phrases, Adjective Phrases,
Prepositional Phrases and Verb phrases are formed. A noun phrase consists of
nouns or pronouns. It can be preceded by its modifiers which can be adjectives.
An adjective phrase is a phrase with an adjective as its head. It can also consist
of adjectives with modifiers. (bahut m ) is an adjective phrase in the
sentence (uhd niyat bahut m s). In Punjabi language,
preposition is called postposition as it comes after the noun or pronoun. The
postposition and its object make up a postpositional phrase, which can be used
to modify noun phrases. For example. in the sentence
(uh dasv jamt d vidirth s), (dasv jamt d) is the
prepositional phrase. Verb Phrase consists of main verb, followed by operators
and auxiliary verb and preceded by an adverb. Operators are of four types,
Primary operator, Passive operator, Modal operator and Progressive operator.
These operators help to emphasize the working of main verb.[8] For
implementation in MT System, a different database is maintained for conjunct
verbs having their English equivalent by checking the preceding word.
Chunking is performed using the rules of noun phrases, postpositional phrases
and the verb phrases. The rules for division to phrases are stored in the rule base
of Punjabi and the conversion rules are stored in the rule base of target
language.
-
55
1.5.2 Translation Engine
This component either translates or transliterates each token into target language
equivalent converting the source language tokens to target language tokens.
1.5.2.1 Transliteration
The named entities recognized in the pre-processing phase and out of
vocabulary words are given as an input to the transliteration module.
Transliteration means to write them sensing the characters in the words. For
Example, (manjt) in Punjabi is transliterated in English as manjeet, m for
, n for , j for , ee for , t for . This transliteration process also uses a
database of transliterating characters and certain rules to insert vowels wherever
needed.
1.5.2.2 Translation using Bilingual Dictionary
Next step in translation is the use of a bilingual dictionary to translate each word
in Punjabi to its English equivalent. The meanings of the words are sensed
depending upon the morph information given in the tag attached to each word.
1.5.3 Generation Component
This component synthesizes the target language equivalent tokens into
sentences and then corrects those sentences to increase the accuracy of the
translator.
-
56
1.5.3.1 Synthesis
After getting English equivalent of each word in Punjabi sentence, it should be
synthesized first to phrases and then to the sentence using structural rules of
English language. These rules of language are also stored in the rule base of
English.
1.5.3.2 Post Processing
After converting all the source text to target text, there are some of the
grammatical errors that need to be corrected. For this purpose, we have
formulated the rules for correcting the grammatical errors. This Post Processing
phase is responsible for correcting grammatical errors in the generated output.
Some rules used for Post Editing include
Deletion of Preposition, if present with an Adverb
If prepositions and adverbs exist together in the translated sentence, the
preposition is deleted
Replace of with to if it is between a verb and a noun.
1.6 Thesis Organization
In the first chapter of this thesis, Machine Translation is introduced and details
about various types of MT systems are provided. The benefits, applications, and
challenges of Machine Translation are described. After elaborating the various
approaches used for Machine Translation and stages in a generic MT system, a
formal description about the research question that we intend to undertake in this
-
57
thesis work along with the major contribution and achievements of the research
are provided.
Chapter 2 discusses the existing work in the field of Machine Translation in India
and outside India. This chapter on literature survey forms the basis of our work
on developing the Machine Translation System and later on helps us in
comparing our work with the existing state of the art in Machine Translation
System.
Chapter 3 explains Punjabi and English languages and divergences in their
patterns with respect to Machine Translation.
Chapter 4, 5 and 6 provide the design and implementation details of various
activities involved in the Machine Translation System. Chapter 4 describes the
Analysis Phase which contains Pre-processing, Morph Analyzing, Tagging and
Chunking. Chapter 5 describes the Translation Engine and Chapter 6 discusses
the Generation Component which includes Synthesis and Post Processing.
Chapter 7 provides the evaluation of the system and its results.
Chapter 8 concludes the thesis by providing a summary of the research work
undertaken, contributions of the research work, assumptions and limitations, and
some directions in which this work could be extended in future.
-
58
1.7 Summary
In this chapter, Introduction to Machine Translation, key activities involved and
various approaches for developing Machine Translation have been provided. It is
followed by a formal statement for this research work along with its objectives,
challenges involved, need and scope, and potential application areas of this
system. Further, the approach followed to develop the Punjabi to English
Machine Translation System has been discussed along with an overview of the
architecture of this system. The chapter concludes with a brief outline of this
thesis. The next chapter provides a survey of the existing literature in the field of
Machine Translation.