Chapter 1 Introduction - a reservoir of Indian...

15

Chapter 1 Introduction

Machine Translation (MT), also known as automatic translation or mechanical

translation, is the computerized method that automates all or part of the process

of translating from one human language to another.

Importance of MT in the modern global world as an instrument to bridge the

digital divide and its multi-disciplinary academic thrusts and absence of such

system in legal domain, remain justification enough for this research work. This

research work is an attempt to build a system to translate simple sentences in

legal domain from Punjabi into English. The need of this system arises from the

translations of the legal documents transferred from District Courts of Punjab to

the High Court in Chandigarh. The FIRs which are written in Punjabi are

translated into English before presenting it to the High Court. To the best of my

knowledge, no Machine Translation System is being developed from Punjabi to

English in legal domain. Similar translations from some of the Indian languages to

English have been developed. For Example Hinglish, a Machine Translation

System for pure (standard) Hindi to pure English forms, developed by R. Mahesh

K. Sinha and Anil Thakur.[1-3]

This chapter introduces Machine Translation, its concepts, various approaches

for Machine Translation Systems and key activities involved in it. It also provides

a formal description about the research question undertaken, its objectives as

16

well as need and scope of the study. The approach followed along with the

reasons behind its selection to solve this research problem has been explained in

brief. The chapter concludes by presenting major contributions of the research

work and an outline of the study. This work is based on Gurmukhi and Roman

scripts. The examples given in this thesis work are in Gurmukhi script along with

their transliteration. For Example (sakar). The transliteration provided is

based on transliteration software the GTrans, which is developed by Punjabi

University, Patiala, Punjab, India.

1.1 Machine Translation

The term 'Machine Translation' (MT) refers to the computerized system

responsible for the production of translations with or without human assistance.[4]

Machine Translation is an application of computer and language sciences which

helps in development of systems answering practical needs. Computer programs

are producing translations which may not be perfect translations of literary texts,

but produce useful translations of technical manuals, scientific documents,

commercial prospectuses, administrative memoranda and medical reports.

1.1.1 Some Misconceptions about Machine Translation

There are some misconceptions about MT. It is believed that the quality of

translation from an MT system is very poor. Such a notion has some veracity

because no existing system can produce really perfect translations. However, this

does not make MT useless. A rough translation would be very helpful if we have

17

a document containing very important information, written in a language which we

do not understand. Moreover a human translator normally does not immediately

produce a perfect translation. It is a general practice to divide the job of

translating a document into two stages. The first stage is to produce a draft

translation, which may not be necessarily perfect. It is then revised either by the

same translator or by another translator with a view to improve previous

translation. For the most part, the aim of MT is only to automate the first, draft

translation process and to make the overall translation process fast, simple and

cheaper.[5] MT does not threaten translators job. However, MT systems can take

over some of the boring, repetitive translation jobs and allow human translators to

concentrate on more interesting tasks, where their specialist skills are really

required.

1.1.2 History of Machine Translation

1.1.2.1 Machine Translation of Non-Indian Languages

The idea of mechanical dictionaries originated in the 20th century, still the origins

of MT can be traced back to 17th century. In 1946-1947, there came the first

tentative idea of using the newly invented computers for translating natural

languages. Weaver received much credit for bringing the concept of MT to the

public when he published an influential paper on using computers for translation

in 1949. The early 1950s were a period of intense research in MT in both the

United States and Europe. Massachusetts Institute of Technology (MIT) started

18

research in this field in 1951. In 1952 the first conference on MT was held, but it

was not until 1954 that a translation system was demonstrated. This system was

not that accurate but it attracted the interest of researchers and media all over the

world. The first journal on Mechanical Translation was published at MIT in 1953.

In 1959, IBM installed an MT system for the United States Air Force, followed by

Georgetown University installing systems at Euratom and the United States

Atomic Energy Agency. Despite some success of early MT systems, MT research

funding was on the verge of serious reduction. The growing dissatisfaction of

research sponsors caused the United States National Academy of Sciences to

set up the Automatic Language Processing Advisory Committee (ALPAC) in

1966. ALPAC, whose members were the major sponsors of current MT research

projects, was to evaluate the effectiveness, costs and potential future progress of

MT. The level of global MT activity probably reached, if not exceeded, the highest

levels during the mid 1960 at the time of the ALPAC report. 1976 marked a

positive turning point for MT research Canada made public their METEO System,

which translated weather forecasts. Later that year, the European Commission

purchased SYSTRAN, a Russian-English system. MT interest and activity has

increased ever since, and MT has been established as a legitimate field of

research. In 1970, there was first Doctoral thesis in MT of Anthony G. Oettinger,

which carried the study for a Russian mechanical dictionary. The largest growth

area has been in the marketing and sale of commercial MT systems, many for

personal computers, and in the provision of MT-based services. The picture in

19

MT research also changed in the early 1980, however it still concentrated largely

in well-established projects at universities (Grenoble, Saarbrcken, Montral,

Texas, Kyoto, and the Eurotra project) and in connection with systems such as

Systran, Logos, ALPS and Weidner. It was perhaps these systems, however

crude in terms of linguistic quality, which more than anything else alerted the

translation profession to the possibilities of exploiting the increasing

sophistication of computers in the service of translation.. The 1990 showed MT

implemented as an online service. The 2000 have shown even more research

into MT and many new, efficient hybrid algorithms.[6-19]

1.1.2.2 Machine Translation in India

The earliest efforts in Machine Translation in India date from the mid 80s and

early 90s. The prominent among these efforts are the research and development

projects at Indian Institute of Technology Kanpur, University of Hyderabad,

National Center for Software Technology, Mumbai and Center for Development

of Advanced Computing (CDAC), Pune. [20] Since the mid and late 90, a few

more projects have been initiatedat Indian Institute of Technology Bombay,

International Institute of Information Technology Hyderabad, Anna University

KB Chandrasekhar Research Center Chennai and Jadavpur University, Kolkata.

There are also a couple of efforts from the private sector - from Super Infosoft

Private Limited, and more recently, the IBM India Research Laboratory.of IT,

Ministry of Communications and Information Technology, Government of India,

20

has played an instrumental role by funding the projects. Indian Languages (TDIL)

program of the Ministry of Information Technology (MIT) and also the UNDP.

University Grants Commission (UGC) also started supporting minor and major

research projects involving development of linguistic parsers and Machine

Translation Systems. Indian Institutes of Technology (IITs), Indian Institutes of

Information Technology (IIITs), Centre for Development of Advanced Computing

(C-DAC), Indian Institute of Science (IIS), Indian Statistical Institute (ISI),

Jawaharlal Nehru University (JNU), Mahatma Gandhi International Hindi

University (MGIHU) and other institutes have significant contributions in this field.

The private enterprises like Tata Institute of Fundamental Research (TIFR), Tata

Consultancy Services (TCS) have also funded Indian language technology

research and development.[21,22]

IIT Guwahati, CDAC Kolkata, Jawaharlal Nehru University (JNU) New Delhi are

also involved in developing the Machine Translation Systems for different Indian

languages.[20] Punjabi University Patiala has also entered into the field of

Machine Translation and successfully developed Hindi-Punjabi Machine

Translation System and vice versa. Thapar University Patiala is also working on

UNL based Machine Translation System.[22, 23]

1.1.3 Approaches used in Machine Translation

Broadly classifying, approaches used for translation are Rule based and Corpus

based. Rule based approach is further classified to direct, interlingual and

21

transfer based approach. The direct translation approach was typical of the "first

generation" of MT systems. The indirect approach of interlingua and transfer

based systems are meant to characterise the "second generation" of MT

systems. Interlingua and transfer approaches are essentially based on the

specification of rules (for morphology, syntax, lexical selection, semantic analysis,

and generation). During the last few years, there is beginning of emergence of

third generation" of MT systems depending upon knowledge-based, corpus-

based and hybrid approaches. Corpus-based methods do not rely on external

knowledge sources such as machine readable dictionaries, concept hierarchies,

or sense-tagged texts. They do not assign sense tags to words, rather, rely on

monolingual corpora and methods based on translational equivalence as found in

word-aligned parallel corpora.[25, 26]

1.1.3.1 Rule Based Approaches

The assumption of rule-based MT is, that translation is a process which requires

the analysis and representation of the 'meaning' of source language texts and the

generation of equivalent target language texts. Representations should be

unambiguous lexically and structurally. There are three major approaches: (a)

The Direct approach in which the word to word translation is performed without

analyzing it structurally. (b) The 'Transfer' approach in which the translation

process operates in three stages Analysis into Abstract Source Language

representations, Transfer into Abstract Target Language representations, and

22

Generation or Synthesis into Target Language Texts and (c) the two-stage

'Interlingua' model, where analysis is into some language-neutral representation

and generation starts from this interlingua representation.[27]

1.1.3.1.1 Direct Approach

Direct translation is the oldest approach of MT. If the MT system uses direct

translation, it usually means that the source language text will not be analyzed

structurally beyond morphology. The translation is based on large dictionaries

and word-by-word translation with some simple grammatical adjustments e.g. on

word order and morphology. The translation unit of the approach is usually a

word. The lexicon is normally conceived of as the repository of word-specific

information. Traditional lexical resources are machine readable dictionaries that

contain list of words. These lists might delineate senses of a word, represent the

meaning of a word, or specify the syntactic frames in which a word can appear,

but the level of granularity with which they are concerned is the individual word.

One of the oldest still used MT systems today, Systran, is basically a direct

translation system. The first version of it was published in 1969. Over the years

the system has been developed quite a lot, but still its translation capability is

mainly based on very large bilingual dictionaries. No general linguistic theory or

parsing principles are necessarily present for direct translation to work; these

systems depend instead on well developed dictionaries, morphological analysis,

23

and text processing software.[28] Figure 1.1 shows the block diagram of Direct

MT System.

Figure 1.1 Direct MT System

E.g. Direct translation from Punjabi to English is

P:

T: pach ui

E: Bird flew

The disadvantage of direct method is that, it is unidirectional, i. e., if the target is

to be translated back into the source language, a different transformation must be

used. It uses n2 translation modules for translations among n languages, thus

making it exponentially large for multi-language translating system. Other

problem with the direct method arises if the structure of sentence is complex, it

requires complex grammatical analysis and word ordering in the target language

sentence can often be wrong. Additionally, if lexical ambiguity exists, incorrect

translation of words occurs.[28] Analysis of relations between different parts of

the sentence is often lacking, which can lead to poor translation. Direct

translation is very inaccurate for languages with structural and lexical differences.

A direct translation system is used for a closely related language pair. Eg

Punjabi-Hindi translation or vice versa

Analysis of

Language X

Synthesis of

Language Y Language X Language Y

24

Georgetown Automatic Translation (GAT) System developed by Georgetown

University, used direct approach for translating Russian texts (mainly from

physics and organic chemistry) to English.[29] The Mark II is also a direct

translation approach based Russian to English MT System for U.S. Air Force.[30]

RUSLAN is a direct Machine Translation System between closely related

languages, Czech and Russian.[31] SYSTRAN is a direct Machine Translation

System developed by Huchins and Somers. The system was originally built for

English-Russian Language Pair. [32] In India, the team of G. S. Lehal, G. S.

Joshan and Vishal Goyal at Punjabi University Patiala developed online Hindi-

Punjabi and Punjabi-Hindi Machine Translation Systems using direct translation

approach.[33]

1.1.3.1.2 Transfer Approach

When the major shortcomings of direct translation were realized, researchers

started working on the transfer method. It occupies the level above direct

translation in the MT pyramid and is also known as indirect or Linguistic

Knowledge (LK) translation. With the Transformer architecture or Direct MT, the

translation process relies on some knowledge of the source language and some

knowledge about how to transform partly analysed source sentences into strings

that look like target language sentences. With the Transfer Based architecture,

on the other hand, translation relies on extensive knowledge of both the source

and the target languages and of the relationships between analysed sentences in

25

both languages. It requires linguistic knowledge of the source and target

languages as well as the differences between them. The transfer method first

parses the sentence of the source language in the Analysis stage and then

applies rules that map the grammatical segments of the source sentence to a

representation in the target language in Transfer Stage and finally send in the

Synthesis stage where the target language sentences are generated.[5]

In this approach, the software attempts to deconstruct the grammar of the

input language to build a grammatical model of each sentence. The grammatical

model of the input language is then mapped to the grammatical model of the

output language. Transfer systems divide translation into steps which clearly

differentiate source language and target language parts. In the transfer approach

only the ambiguities inherent in the language in question are tackled. Rather than

operating in two stages, Transfer approach has three stages. The first stage

converts texts into intermediate representations in which ambiguities have been

resolved irrespective of any other language. Differences between languages, in

vocabulary and structure, are handled in the intermediary transfer stage. In the

third stage these are converted into equivalent representations of the target

language. Analysis and generation programs are specific for particular languages

and independent of each other.

For Example

P : (SNP) (ONP) (VP)

T: (SNP) bacc (ONP) mih (VP) pasand karad han

26

E: (SNP)Children (VP)Like (ONP)Sweets

Above example shows English to Punjabi translation of a sentence. After

syntactically and semantically analyzing the sentence, we can easily translate a

sentence even with different structures (SVO SOV). The transfer approach

uses n2 transfer modules, n analysis components, and n synthesis components,

where n is the number of languages in the translation system. Thus, one of its

bottleneck is the sheer size of the rules needed for its implementation. Figure 1.2

shows the block diagram of Transfer MT System.

Figure 1.2 Transfer MT System

The TAUM project developed at Montreal in 1970 for translation of weather

forecast from English to French uses Syntactic Transfer System. AnglaBharati,

an Indian system developed at IIT Kanpur under the expert guidance of RMK

Sinha deals with Machine Translationfrom English to Indian languages, primarily

Hindi, using a pseudo-interlingual rule-based Transfer Approach. It uses post-

editing to resolve ambiguity/complexity. It is mainly developed for public health

domain.[34] MaTra is another India n Langauges based Human-Assisted

translation project for English to Indian languages based on Transfer

Approach.[21] The MaTra lexicon approach is general-purpose, but the system

Language X Analyzer

Language Y Synthesizer

Lang X-Lang Y Transfer

Language X Language Y

Transfer Module

27

has been applied mainly in the domains of news, annual reports and technical

phrases. The Computer Science Department at the University of Hyderabad has

been working on an English-Kannada MT system, using the Universal Clause

Structure Grammar (UCSG) formalism. It is again based on transfer-based

approach, and has been applied to the domain of government circulars. The

Jadavpur University at Kolkata developed on a rule-based English-Hindi MAT for

news using the Transfer Approach.[21]

1.1.3.1.3 Interlingual Approach

The Interlingual Approach was historically the next step in the development of

MT. In an Interlingual based MT approach translation is done via an intermediary

(semantic) representation of the source language text. Interlingua is supposed to

be a language independent representation from which translations can be

generated to different target languages. The interlingua approach assumes that it

is possible to convert source texts into representations common to more than one

language. From such interlingual representations texts are generated into other

languages. Translation is thus in two stages: from the source language (SL) to

the interlingua (IL) and from the IL to the target language (TL). Programs for

analysis are independent from programs for generation, in a multilingual

configuration, any analysis program can be linked to any generation program.

Procedures for SL analysis are intended to be SL-specific and not oriented to any

particular TL, likewise programs for TL synthesis are TL-specific and not

28

designed for input from particular SLs. Translation from and into n languages

requires n(n-1) bilingual 'direct translation' systems; but with translation via an

interlingua just 2n interlingual programs are needed. With more than three

languages the interlingua approach is claimed to be more economic. On the other

hand, the complexity of the interlingua itself is greatly increased. Perhaps then

"Machine Translation" is not an appropriate term, since the machine only

completes the first stage of the process. It would be more accurate to talk of a

tool that aids the translation process, rather than an independent translation

system.[27] Figure 1.3 shows the block diagram of Interlingual MT System.

Fig 1.3 Interlingual MT System

Eg. Interlingual representation of sentence in Universal Networking Language is

P:

T: pach ui

UNL: agt ( , (icl>bird))

There are a few problems with the interlingual approach. The interlingual

approach requires an analyzer for each source language and a generator for

each target language. Analysis of source text requires a deep semantic analysis

that requires extensive world knowledge. Unfortunately, the true meaning of a

sentence cannot always be extracted. Additionally, if a text is analyzed as deeply

Language X Analyzer


Interlingua Language X Language Y

29

as is expected, then much of the source authors style will be lost. A further

problem is that using an interlingua in MT can lead to extra, unnecessary work, in

some cases.

University of Texas during the 1970s developed METAL system for German and

English using interlingua approach.[35] At the end of the 1990s, the Institute of

Advanced Studies of the United Nations University, Tokyo began its multinational

interlingua based MT project based on a standardized intermediary language,

Universal Networking Language (UNL). The UNL is an international project of the

United Nations University, with an aim to create an Interlingua for all major

human languages. It was initially for the six official languages of the United

Nations and then for other widely spoken languages, namely, Hindi, Arabic,

Chinese, English, French, German, Indonesian, Italian, Japanese, Portuguese,

Russian, and Spanish.[35] IIT, Bombay is one of the members of the team

responsible for developing UNL models for Hindi. The AnglaBharti system

(developed at IIT Kanpur) uses a pseudo-interlingua approach to analyze English

only once and creates an intermediate structure called Pseudo Lingua for Indian

Language (PLIL) instead of designing translators for English to each Indian

language. The PLIL structure can be converted to each Indian language through

a process of text-generation. The idea of using PLIL is primarily to exploit

structural similarity among the Indian languages to obtain advantages similar to

that of using interlingua approach.[36] A team at Thapar University, Patiala is

30

working on Punjabi Language Server which includes Punjabi-UNL Enconverter

and UNL-Punjabi Deconverter. [33]

1.1.3.2 Data Driven Approaches

Most recently, corpus-based methods have changed the traditional picture.

Corpus-based methods of word sense discrimination are knowledge-lean, and do

not rely on external knowledge sources such as machine readable dictionaries,

concept hierarchies, or sense-tagged text. They do not assign sense tags to

words; rather, they discriminate among word meanings based on information

found in unannotated corpora. It relies on methods based on translational

equivalence as found in word-aligned parallel corpora. Corpus-based approaches

to Machine Translation partially succeeded to replace traditional rule-based

approaches.[26] The main advantage of corpus-based Machine Translation

Systems is that these are self-customizing in the sense that they can learn the

translations of terminology and even stylistic phrasing from previously translated

materials.

1.1.3.2.1 Knowledge-Based MT

Knowledge-Based MT (KBMT) is used to fill the gaps between the two extremes

of human-only and machine-only translations. It provides high-quality translation

much faster and at much lower cost. It is the combination of tightly integrated

translation technologies with unique translation processes driven by highly skilled

linguists. The objective of knowledge-based translation is to capture as much as

31

possible of linguists knowledge into the translation systems knowledge base.

For this, the system takes the use of source and target language dictionaries,

source and target language structures and rules, word meanings in different

contexts and language constructs, domain specific terminology, previously

translated words, phrases, sentences, paragraphs, language style and cultural

differences etc. By capturing all these knowledge sources it produces the high

quality output. Figure 1.4 shows Knowledge Based Machine Translation System.

Figure 1.4: KBMT Representation

This model does not require total understanding of the source text but assumes

that an interpretation engine can achieve successful translation into several

languages. KBMT is implemented on the interlingual architecture; it differs from

other interlingual techniques in the extent of depth to which it will analyze the

source language and its reliance on explicit knowledge of the world.[25] Figure

1.4 gives the graphical representation of KBMT System.

The KANTOO project is an object-oriented C++ implementation of KANT

technology for Machine Translation. The KANTOO is designed to be a more

robust, efficient and maintainable version of KANT for commercial customers.

Language X Analyzer


Knowledge Representation


Augmentor

32

Besides Analyzer and Generator, KANTOO has an integrated set of support tools

for efficient knowledge maintenance. LUTE project at NTT and ETL research, a

Japenese multilingiual project also applied knowledge based approach.[27, 28]

1.1.3.2.2 Example-Based MT

This method was proposed in 1981 and Distributed Language Translation (DLT)

System of Japan is based on this approach. The example-based approach was

founded on processes of extracting and selecting equivalent phrases or word

groups from a databank of parallel bilingual texts, which have been aligned either

by statistical methods or by more traditional rule-based methods. The main

advantage of the approach (in comparison with rule-based approach) is that,

there is an assurance that the results will be accurate and idiomatic, since the

texts have been extracted from databanks of actual translations produced by

professional translators.[12]

Figure 1.5 EBMT Representation

The idea behind Example-Based MT (EBMT) is to translate a sentence using

previously analyzed examples of similar sentences. A database of previously

Language X Analyzer


Matching Expression Conversion


Translation Memory

33

analyzed text is stored in the Translation Memory (TMEM) as shown in Figure

1.5. TMEM enables translators to store original texts and their translated

versions side by side, so that corresponding sentences of the source and target

are aligned. The translator can thus search for phrases or full sentences written

in one language from translation memory and is able to display corresponding

phrases in the other language, which matches exactly or approximately with the

previous language. Ideally, it will find an exact structural match for the source

sentence and replace the source word with the target words. However, it is often

the case that there is no exact match for a source sentence. In this case, the

system will chunk the source sentence and try to find a match in the example

database.

AnglaBharti system developed at IIT Kanpur, uses example-base to

identify noun and verb phrases to resolve their ambiguities.[36] AnglaBharti-II

launched in 2004, addresses many of the shortcomings of the earlier

architecture. It uses example-based approach to eliminate the difficulties in

making the modification of the rule-base.[37] AnglaHindi, another system

developed at IIT Kanpur, besides using all the modules of AnglaBharti, also

makes use of an abstracted example-base for translating frequently encountered

noun phrases and verb phrases. In AnglaHindi, the example-based approach is

invoked before the rule-based approach. The example-base is statistically

derived from the corpus.[34]

34

1.1.3.2.3 Statistical approach

It is relatively a new method and its strategies are based upon statistical

approaches. Here, statistical methods are used as the means of analysis and

generation; no linguistic rules are applied.[25] The essence of this method lies in

aligning phrases, word groups and individual words of the parallel texts, and in

calculating the probabilities that any one word in a sentence of one language

corresponds to a word or words in the translated sentence. The Statistical Based

MT has given more acceptable results by picking those word(s) from the given

surrounding words which have the highest probability of occupying its current

position. Here, the MT engine is trained based on large volumes of existing

content and its translation known as "bilingual text corpora." The MT engine uses

the large volumes of data to create statistical rules. These rules determine the

appropriate selection based on the probability of correct translation of given word,

phrase, or sentence of a language. Large volumes of electronic text of similar

content are required to get the best quality output from the MT engine. By the

turn of the century, this newer approach based on statistical models where a

word or phrase is translated to one of a number of possibilities based on the

probability of correct translation has achieved marked success. The best

examples substantially outperform rule-based systems. Statistics-based Machine

Translation (SMT) also may prove easier and less expensive to expand, if the

system can be taught new knowledge domains or languages by giving it large

samples of existing human-translated texts. Despite some success, however,

35

severe problems still exist i.e. outputs are often ungrammatical and the quality

and accuracy of translation falls well below that of a human linguist. Statistical-

Based MT (SBMT) includes some statistical techniques such as n-gram

modeling, maximum entropy modeling, and decision tree modeling. All pure

SBMT systems derive data from corpora that it has previously analyzed and do

not rely on linguistic information. SBMT methods select the best representation

choice based on Bayes theorem: argmaxw P(w|s) .SBMT will pick the word (w)

that has the highest probability of occupying its current position, given the

surrounding words.

RAND Corporation undertook statistical analyses of a large corpus of Russian

Physics texts, to extract bilingual glossaries and grammatical information. The

IBM India Research Lab at New Delhi has recently initiated work on statistical MT

between English and Indian languages, building on IBMs existing work on

statistical MT.[21] Google language translator also uses statistical approach for

translation. Microsoft Bing Translator allows users to translate texts or entire web

pages into different languages. This translation service is also using Statistical

Machine Translation strategy to some extent.

1.1.3.3 Hybrid Approach

During the last few years, the "third generation" of hybrid systems combining the

rule-based approaches of the earlier types and the more recent corpus-based

methods have also emerged. Hybrid methods are still fundamentally statistics-

36

based, but incorporate higher level abstract syntax rules to arrive at the final

translation.

An Interactive Japanese to English Translation System was introduced to support

non-natives of English to write English material, uses hybrid approach for

translation. Turkish to English Machine Translation System is a Hybrid Machine

Translation System by combining two different approaches to MT. The Hybrid

Approach transfers a Turkish sentence to all of its possible English translations,

using a set of manually written transfer rules. Then, it uses a probabilistic

language model to pick the most probable translation out of this set.. SisHiTra

developed by Gonzalez et. al is a also hybrid Machine Translation System from

Spanish to Catalan. This project tried to combine knowledge-based and corpus-

based techniques to produce a Spanish-to-Catalan Machine Translation System

with no semantic constraints. Bengali to Hindi Machine Translation System

developed at IIT Kharagpur also uses Hybrid Approach for translation.[38-41]

1.1.4 Key Activities[41-45]

Overview of common key activities, which formulate a Machine Translation

System are described as under. These activities are usually executed in

sequence. However, depending upon the technique being followed, one or more

of these activities may be omitted.

PRE-PROCESSING: This module tokenizes the input text into words based on

the list of word boundaries. Pre-processing phase also includes filtering the text.

37

Text filtering means detecting and marking certain special expressions like

named entities, collocations etc. Another important task performed in pre-

processing can be text normalization that includes checking the spelling

variations and replacing it with standard spellings. Pre-processing may include

activities to reduce the complexity of translation of source language and to

increase the accuracy of translator.

MORPHOLOGICAL ANALYSIS: The purpose of a morphological analyzer is to

return root word and grammatical information about all the possible word classes

for a given word. Morphological analysis phase also includes extraction of the

grammatical information including number, gender and tense information for all

the tokens. Since Indian languages have a rich inflectional morphology,

morphological analyzer is an essential tool for such languages.

PART OF SPEECH TAGGING: The output of the morphological analyzer is

usually ambiguous because a single word in the source language may have

number of tags. A particular word can be used as a noun, an adjective or a verb

etc. Part of speech tagger disambiguates the ambiguous output of morph

analyzer by using the contextual information in which the word is being used.

PHRASE CHUNKING: Chunking is a way of organizing information into familiar

groupings. Phrase chunking is a natural language process that separates and

segment sentences into their sub constituents such as noun, verb, and

38

prepositional phrases. Typical chunks are noun phrases, prepositional phrases

and verb phrases. Chunking works on POS tagged text, so its accuracy depends

upon the accuracy of POS tagger. The chunker can be rule based or

probabilistic.

TRANSLATION AND TRANSLITERATION: All of the above activities analyse the

given input. Having all the necessary information regarding the words in a

sentence, the next step is to find its equivalent in the target language. The

translation engine has two parts, Translation and Transliteration. Translation

includes finding the word equivalent from a bilingual lexicon. Transliteration is

writing the word in different script without interpreting. The transliteration process

also uses a lexicon of character mappings for Source and Target language

characters. It is used for out-of-vocabulary words and recognised named entities.

All other words are translated.

In the direct MT systems, source language words are simply replaced by target

language words but in Indirect MT system, synthesizers for target language

phrases are also needed.

SYNTHESIS: If the source language and target language have different word

order, this step tries to reorder the words according to the grammar of target

language. For example, the word order in Punjabi is Subject-Object-Verb. On the

other hand, English is Subject-Verb-Object language. According to the grammar

of target language, some reordering techniques are required.

39

POST PROCESSING: Post processing improves the quality of the translation

produced by the machine. The extent of requirement of post processing depends

upon the quality of the output received. This phase improves the translation

quality by making corrections in the generated output. Post Processor is actually

a corrector of ill formed sentences.

1.2 Research questions

Presently there is no Machine Translation System available from Punjabi to

English in legal domain; however a grammar checker and POS tagger is

available for Punjabi language. Similar Machine Translation Systems are

available for Indian languages to English language but these belong to different

domains. Though systems are available from Hindi to English for the domain of

public health, news and annual reports, but none are available Punjabi.

The problem statement for the present research work has been formulated as

below:

To develop algorithms and lexical resources along with a software package

to translate a simple sentence written in Punjabi language to English. The

sentence should lie in legal domain and should follow a particular syntax.

Present research study is basically to develop a Punjabi to English Machine

Translation System for legal documents which translates a simple Punjabi

sentence in legal domain to English. The system will be helpful to the persons

with a little knowledge of English. They can translate sentences in Punjabi easily

40

into sentences in English without the need of any interpreter, thus removing the

language barrier.

1.2.1 Objectives

The objectives of this study are:

1. To study Punjabi and English language and their divergences.

2. To study the inflectional morphology of Punjabi and various types of

agreement in Punjabi sentences.

3. To adapt the existing lexical resources such as morph database for part of

speech tagging.

4. To develop lexicon for collocations in Punjabi text.

5. To develop algorithms for part of speech tagging and phrase chunking

modules.

6. To develop a module for finding the gender (Masculine, Feminine or Both)

and number (Singular, Plural) information for the nouns and pronouns

used in the sentence.

7. To develop a module to translate tagged Punjabi words to their English

equivalents.

8. To develop transliteration module for handling named entities and out-of-

vocabulary words.

9. To develop algorithm for synthesizing translated phrases to an English

sentence.

41

10. To develop algorithm for post processing tasks.

11. To develop test cases for evaluating the system critically.

1.2.2 Challenges

MT across the languages is a challenging task for several reasons like, the

difference in the structure of source and target languages, ambiguity, multiword

units like idioms, phrases and tense generation and many more. Some of the

major challenges faced in development of Punjabi to English MT system are as

follows.

1. Word ordering is different for Punjabi and English. In Punjabi, word

order is Subject-Object-Verb (SOV) whereas in English, it is Subject-

Verb-Object (SVO). Lexical differences also exist in these two

languages as in some cases a group of words used in Punjabi has a

single-word equivalent in English.

2. Articles are used in English but not in Punjabi. The articles can be

added at the time of post processing to correct the sentence in some

cases.

3. Lack of lexical resources such as digital bilingual dictionary, Tagged

Corpus etc. There is no machine readable dictionary available for

Punjabi to English which can be directly used for translation, however

dictionaries are available to explain the meaning of a word.

Morphological Analyzer for Punjabi developed at Punjabi University,

42

Patiala cannot be used directly into the system. However the database

can be adapted in Punjabi to English translation system. No tagged

corpus is available for statistical tagging. Tagged corpus has been built

using the set of training sentences.

4. There are multiple translations of a Punjabi word to English. It may

depend upon the context in which the word is present in the sentence.

5. To identify the proper nouns present in the sentence.

6. Punjabi is free-word order language, so it was a challenging task to

identify the phrase performing the function of subject in the sentence.

7. There is a major challenge for development of a rule based system for

Machine Translation. The rule which we made for a particular type of

sentence is overruled in another type of sentence.

8. Output of the translator needs some grammatical correctness.

1.2.3 Need and Scope

The need of the system arises from the translations of the legal documents

transferred from district courts of Punjab to the High Court in Chandigarh. The

FIR which is written in Punjabi language are translated to English before

presenting it to the High Court. The scope of the system can be extended to

many legal agencies where the translation from Punjabi to English is needed.

The need for legal document translation can arise in a number of different

situations, from the finalisation of a large international business deals, or

43

relocation of employees from one company site to another across national

borders. Serious circumstances, such as litigations and disputes over business

affairs taken to foreign courts, can also call for legal translations. Legal translation

services may be required by any business or individual, though they are most

commonly required by law offices and courts, especially for court proceedings on

an international level. So, when lawyers deal with foreign documents, legal

translation services are a must.

1.2.4 Potential Use

As on today the need of translation is much more than it was in past. It is

undoubtedly an important topic socially, politically, commercially, scientifically,

and intellectually (or philosophically) and one whose importance is likely to

increase day by day. Some of the areas highlighting the importance of MT are

briefly described below:

The socio-political importance of MT arises in communities where more

than one language is generally spoken. Here, the only viable alternative to

rather widespread use of translation is the adoption of a single common

language, which is not an attractive alternative, because it involves the

dominance of the chosen language, to the disadvantage of speakers of the

other languages, and raises the prospect of the other languages becoming

second-class, and ultimately disappearing. Since the loss of a language

often involves the disappearance of a distinctive culture, and a way of

44

thinking, this is a loss that should matter to everyone. So translation is

necessary for communication for ordinary human interaction and for

gathering the required information.[10] The major problem of the

translation is that there is scarcity of human translators. Also there is a

limit on the extent of their productivity without automation. In short, an

automation of translation is a social and political necessity for modern

societies, which does not impose a common language on it.

It is also a necessity of organizations like the European Community and

the UN, for whom multilingualism is both a basic principle and a fact of

everyday life.

The commercial importance of MT is summarized below:

(a) Manual translation is less expensive. Translation is a highly skilled job,

requiring knowledge of a number of languages, and in some countries,

translators salaries are comparable to other highly trained

professionals.

(b) Machine Translation is speedy whereas Manual Translation proves

exorbitant and sometimes it causes loss of revenue for the company. A

professional translator translates approximately 4-6 pages of

translation (approximately 2000 words) per day which increases the

time period to translate product documentation. Hence the launch of

new product gets delayed. Considering the above drawbacks of

manual translation, Machine Translation is comparatively more

45

important in speeding up the process of translation. However, the

output of machine translator can be further edited by human

translators.

Scientifically, MT is interesting, because it is an obvious application and

testing ground for many ideas in Computer Science, Artificial Intelligence

and Linguistics.

Philosophically, MT is again interesting, because it represents an attempt

to automate an activity that can require the full range of human knowledge.

For example, getting the correct translation of negatively charged

electrons and protons into Punjabi depends upon the knowledge of

charge on protons, so the interpretation cannot be something like

negatively charged electrons and negatively charged protons. In this

sense, the extent to which one can automate translation is an indication of

the extent to which one can automate thinking.[10]

MT is mainly used to bridge the Digital Divide: The Internet changes the

world very fast. We can find vast amount of information on Internet. But

most of this information is in English. In the context of rural India, most of

this information is effectively unavailable to the rural masses without

having any knowledge about English language. In spite of all the progress

that is being made in the field of Information Technology, rural masses

remain deprived of the technological advancements. One of the primary

reasons for this is the incapability in information distribution and language

46

barrier is one of the biggest hurdles in this information distribution. There is

a great demand to translate Web pages and electronic mail messages into

the native language. There is a demand of Internet-based online

translation services.

MT can be used to assist human translator. There is demand of online

versions of electronic dictionaries as translation systems for helping

human translators.

Thus, if MT becomes more accurate and efficient enough, it can break down

cultural barriers and make communication much easier among speakers of

different languages.

1.3 Assumptions

We cannot build a fully automatic high quality Machine Translation System. It is

even difficult to build a system for two different word order languages. ed set of

sentences. Assumptions taken for development of this system are:

1. If a paragraph is being input to the system, it should have proper delimiters

for each sentence.

2. The sentences should be simple. The system does not work for complex,

compound, passive and interrogative sentences.

3. The sentence given as input to the translator, should be limited to six

phrases including verb phrase.

47

4. Word level ambiguity where a word can have number of tags with same

grammatical category but different meanings with respect to context is not

resolved.

5. Abbreviations must contain a period between the characters.

1.4 Architecture of Punjabi to English Machine Translation System

Fig 1.6 Architecture of Punjabi to English Machine Translation System

48

1.5 Approach Applied for the System

The approach applied for our Machine Translation System is the Rule based

approach. Since both the languages are different word order languages and have

number of divergence patterns, the indirect approach is best suited approach for

such type of translations. The system is broadly classified into three phases,

Analysis Phase, Translation Phase and the Generation Phase. Analysis phase

consists of Pre-processing, Tokenization, Tagging and Chunking. The Translation

phase includes Translation and Transliteration of each token and the Generation

phase involves Synthesis and Post-Processing.

Following is the brief introduction about the steps involved in it:

1.5.1 Analysis Component

The analysis component analyzes the source language text and passes the

tagged phrases to the translation engine.

1.5.1.1 Pre-Processing and Tokenization

1.5.1.1.1 Tokenization

The tokenizer takes input from the text generated by the previous module. This

module, uses space or a punctuation mark, as delimiter, extracts tokens (words)

from the text and gives it to the next module for further processing.

1.5.1.1.2 Pre-Processing

In the pre-processing phase, number of operations are applied on input

sentences to make them processable by the translator that those can be

49

processed by the translator with better accuracy. The system performs following

pre-processing tasks.

(a) Text Normalization

A small module has been developed to generate the database for

standardization. The module finds the spelling mistakes of each word and

replaces it with the correct word by taking it from the database. For Example. In

the word , pairi bindi may be included after the character , it should be

corrected by taking correct word from the database. In some words adhak is

used with some characters, in other these are not used. A database for

standardization has been developed by analyzing the words with the frequency of

occurrences for variant spellings.

(b) Identifying Collocations

Many words in the input sentences creates problem if treated alone. In the pre-

processing phase, we recognize those words and join the words to make it a

single word so that it can be translated in the target language.

P:

T: uhd maut hds vicc h ga

P:

T: lak d bhl shur kar ditt ga

In the above sentences, (h ga) is joined as (h) and (kar ditt)

is joined as (kt)

50

Pre-processor combines the adjoining words from the sentence to a single word

by checking them from the database of joined words.

(c) Identifying Named Entities

Named entity tagging refers to the task of identifying named entities (such as

person names) in a text. It is an important subtask for information extraction and

retrieval. The system under discussion only extracts those words which show the

names of the persons, places etc. The extraction of such words is important so

that these should not be translated. After recognizing these words, these are sent

to transliteration module. To recognize that, some rules have been developed by

checking preceding or succeeding word. For Example. the names may be

preceded by (sr), (sardr), (sardrn,), (srmt), (misar),

(misz), (mis) etc. or followed by (sigh), (kaur) or surname.

1.5.1.2 Morph Analyzing and Tagging

The next step is to tag each word with the grammatical information about it. In

Punjabi grammar, the parts of speech include noun, verb, adjective, adverb,

pronoun, preposition, conjunction, interjection, operators, auxiliary verbs etc. Tag

contains the information about grammatical category of word, gender, number,

person and the case in which it can be used. It works in two steps.

51

1.5.1.2.1 POS Tagging Morphological database already created in Punjabi University, Patiala is being

used to get the information about each token and according to the information

gathered from it, tags are formed.

Tag contains grammatical category-gender-person-number-case-tense-phrase-

type. The fields not applicable to a particular category are left blank. For

Example. Tags for the word (d) are ipo- - - - - - - -, v-b-s-s- -f-x- -. The above

tag for the word shows that it can be used as inflected postposition or as verb

with any gender, singular, second person, and future tense. In Punjabi language,

a word can have number of tags as a particular word can be used in number of

ways. The tagger first checks the category of each word from the database and

then adds gender, number, person or tense information to it. [19, 20] .Each word

is attached with a number of tags. Since a particular word may have a number of

tags, there is need to check which tag is applicable to a particular word in a

sentence.

1.5.1.2.2 Ambiguity Resolution between Different Tags

A Hybrid Approach which is a combination of rule based and statistical based

approach is used to solve ambiguity for a word with number of tags. First level of

ambiguity exists when a particular word can have number of tags of different

grammatical categories. The probability for the existence of a particular tag

52

should be calculated using Viterbi Algorithm by observing the frequency of

occurrence of tags of the preceding words.

For Example

P:

T: uh hds vicc zam h gi

In the above sentence, the word (zam) has two tags, one shows that it is

a noun and the other shows that it is an inflected adjective. In the first sentence,

probability of occurrence for the word as a noun is more than as an adjective, so

it is considered as a noun.

But, if the sentence is

P:

T: uh n zam dm n mr ditt

Here the word (zam) has higher probability as an adjective

Second level of ambiguity that has been resolved is, when there are a number of

tags that show a particular word as noun, but can be used as singular or plural.

For Example. tag for the word (mu) is, n-m- -s-o - - -, n-m- -p-d- - - -. For

such type of ambiguity rule based approach is used.

The tagged word can be a noun in singular or a noun in plural.

In the sentence,

P:

T: sr mu laan ga

53

In the above case the tag n-m- -p-d- - - - -, is selected as the number of verb

phrase is plural and its appropriate word in English is boys, whereas in the case

P:

T: ik mu n main rki

Here the tag for (mu) should be n-m- -s-o- - - - and its appropriate

equivalent in English is boy. Such type of ambiguity can be resolved by

considering the number ie. Singular or plural of the auxiliary verb or the main verb

present in the sentence. For resolution of ambiguity, the rules are ordered

according to priority.

1.5.1.3 Chunking

Chunking involves grouping of words of input sentence into phrases such as

noun phrase, postpositional phrase and verb phrase. A rule based chunker has

been developed for this purpose. First of all, from the Punjabi sentence, subject is

chosen by applying the rules of subject noun phrase and then from the predicate

other phrases are recognized.

For Example

P:

T: shkat n ih savikr kt hai

Here (shkat n) is taken as subject noun phrase and the rest of sentence

as predicate.

54

By combination of different word classes, Noun Phrases, Adjective Phrases,

Prepositional Phrases and Verb phrases are formed. A noun phrase consists of

nouns or pronouns. It can be preceded by its modifiers which can be adjectives.

An adjective phrase is a phrase with an adjective as its head. It can also consist

of adjectives with modifiers. (bahut m ) is an adjective phrase in the

sentence (uhd niyat bahut m s). In Punjabi language,

preposition is called postposition as it comes after the noun or pronoun. The

postposition and its object make up a postpositional phrase, which can be used

to modify noun phrases. For example. in the sentence

(uh dasv jamt d vidirth s), (dasv jamt d) is the

prepositional phrase. Verb Phrase consists of main verb, followed by operators

and auxiliary verb and preceded by an adverb. Operators are of four types,

Primary operator, Passive operator, Modal operator and Progressive operator.

These operators help to emphasize the working of main verb.[8] For

implementation in MT System, a different database is maintained for conjunct

verbs having their English equivalent by checking the preceding word.

Chunking is performed using the rules of noun phrases, postpositional phrases

and the verb phrases. The rules for division to phrases are stored in the rule base

of Punjabi and the conversion rules are stored in the rule base of target

language.

55

1.5.2 Translation Engine

This component either translates or transliterates each token into target language

equivalent converting the source language tokens to target language tokens.

1.5.2.1 Transliteration

The named entities recognized in the pre-processing phase and out of

vocabulary words are given as an input to the transliteration module.

Transliteration means to write them sensing the characters in the words. For

Example, (manjt) in Punjabi is transliterated in English as manjeet, m for

, n for , j for , ee for , t for . This transliteration process also uses a

database of transliterating characters and certain rules to insert vowels wherever

needed.

1.5.2.2 Translation using Bilingual Dictionary

Next step in translation is the use of a bilingual dictionary to translate each word

in Punjabi to its English equivalent. The meanings of the words are sensed

depending upon the morph information given in the tag attached to each word.

1.5.3 Generation Component

This component synthesizes the target language equivalent tokens into

sentences and then corrects those sentences to increase the accuracy of the

translator.

56

1.5.3.1 Synthesis

After getting English equivalent of each word in Punjabi sentence, it should be

synthesized first to phrases and then to the sentence using structural rules of

English language. These rules of language are also stored in the rule base of

English.

1.5.3.2 Post Processing

After converting all the source text to target text, there are some of the

grammatical errors that need to be corrected. For this purpose, we have

formulated the rules for correcting the grammatical errors. This Post Processing

phase is responsible for correcting grammatical errors in the generated output.

Some rules used for Post Editing include

Deletion of Preposition, if present with an Adverb

If prepositions and adverbs exist together in the translated sentence, the

preposition is deleted

Replace of with to if it is between a verb and a noun.

1.6 Thesis Organization

In the first chapter of this thesis, Machine Translation is introduced and details

about various types of MT systems are provided. The benefits, applications, and

challenges of Machine Translation are described. After elaborating the various

approaches used for Machine Translation and stages in a generic MT system, a

formal description about the research question that we intend to undertake in this

57

thesis work along with the major contribution and achievements of the research

are provided.

Chapter 2 discusses the existing work in the field of Machine Translation in India

and outside India. This chapter on literature survey forms the basis of our work

on developing the Machine Translation System and later on helps us in

comparing our work with the existing state of the art in Machine Translation

System.

Chapter 3 explains Punjabi and English languages and divergences in their

patterns with respect to Machine Translation.

Chapter 4, 5 and 6 provide the design and implementation details of various

activities involved in the Machine Translation System. Chapter 4 describes the

Analysis Phase which contains Pre-processing, Morph Analyzing, Tagging and

Chunking. Chapter 5 describes the Translation Engine and Chapter 6 discusses

the Generation Component which includes Synthesis and Post Processing.

Chapter 7 provides the evaluation of the system and its results.

Chapter 8 concludes the thesis by providing a summary of the research work

undertaken, contributions of the research work, assumptions and limitations, and

some directions in which this work could be extended in future.

58

1.7 Summary

In this chapter, Introduction to Machine Translation, key activities involved and

various approaches for developing Machine Translation have been provided. It is

followed by a formal statement for this research work along with its objectives,

challenges involved, need and scope, and potential application areas of this

system. Further, the approach followed to develop the Punjabi to English

Machine Translation System has been discussed along with an overview of the

architecture of this system. The chapter concludes with a brief outline of this

thesis. The next chapter provides a survey of the existing literature in the field of

Machine Translation.

Chapter 1 Introduction - a reservoir of Indian...

Documents

Transcript of Chapter 1 Introduction - a reservoir of Indian...