Machine Learning for Machine Translationacmsc/TMW2014/P_bhattacharyya.pdfMachine Learning for...

126
Machine Learning for Machine Translation Pushpak Bhattacharyya CSE Dept., IIT Bombay ISI Kolkata, Jan 6, 2014 Jan 6, 2014 ISI Kolkata, ML for MT 1

Transcript of Machine Learning for Machine Translationacmsc/TMW2014/P_bhattacharyya.pdfMachine Learning for...

Machine Learning for

Machine Translation

Pushpak Bhattacharyya

CSE Dept

IIT Bombay

ISI Kolkata

Jan 6 2014

Jan 6 2014 ISI Kolkata ML for MT 1

NLP Trinity

Algorithm

Problem

Language Hindi

Marathi

English

French Morph Analysis

Part of Speech Tagging

Parsing

Semantics

CRF

HMM

MEMM

NLP Trinity

Jan 6 2014 ISI Kolkata ML for MT 2

NLP Layer

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

Increased Complexity Of Processing

Jan 6 2014 ISI Kolkata ML for MT 3

Motivation for MT

MT NLP Complete

NLP AI complete

AI CS complete

How will the world be different when the language barrier disappears

Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)

Solution automation

Jan 6 2014 ISI Kolkata ML for MT 4

Taxonomy of MT systems

MT Approaches

Knowledge Based Rule Based MT

Data driven Machine Learning Based

Example Based MT (EBMT)

Statistical MT

Interlingua Based Transfer Based

Jan 6 2014 ISI Kolkata ML for MT 5

Why is MT difficult

Language divergence

Jan 6 2014 ISI Kolkata ML for MT 6

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

NLP Trinity

Algorithm

Problem

Language Hindi

Marathi

English

French Morph Analysis

Part of Speech Tagging

Parsing

Semantics

CRF

HMM

MEMM

NLP Trinity

Jan 6 2014 ISI Kolkata ML for MT 2

NLP Layer

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

Increased Complexity Of Processing

Jan 6 2014 ISI Kolkata ML for MT 3

Motivation for MT

MT NLP Complete

NLP AI complete

AI CS complete

How will the world be different when the language barrier disappears

Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)

Solution automation

Jan 6 2014 ISI Kolkata ML for MT 4

Taxonomy of MT systems

MT Approaches

Knowledge Based Rule Based MT

Data driven Machine Learning Based

Example Based MT (EBMT)

Statistical MT

Interlingua Based Transfer Based

Jan 6 2014 ISI Kolkata ML for MT 5

Why is MT difficult

Language divergence

Jan 6 2014 ISI Kolkata ML for MT 6

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

NLP Layer

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

Increased Complexity Of Processing

Jan 6 2014 ISI Kolkata ML for MT 3

Motivation for MT

MT NLP Complete

NLP AI complete

AI CS complete

How will the world be different when the language barrier disappears

Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)

Solution automation

Jan 6 2014 ISI Kolkata ML for MT 4

Taxonomy of MT systems

MT Approaches

Knowledge Based Rule Based MT

Data driven Machine Learning Based

Example Based MT (EBMT)

Statistical MT

Interlingua Based Transfer Based

Jan 6 2014 ISI Kolkata ML for MT 5

Why is MT difficult

Language divergence

Jan 6 2014 ISI Kolkata ML for MT 6

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Motivation for MT

MT NLP Complete

NLP AI complete

AI CS complete

How will the world be different when the language barrier disappears

Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)

Solution automation

Jan 6 2014 ISI Kolkata ML for MT 4

Taxonomy of MT systems

MT Approaches

Knowledge Based Rule Based MT

Data driven Machine Learning Based

Example Based MT (EBMT)

Statistical MT

Interlingua Based Transfer Based

Jan 6 2014 ISI Kolkata ML for MT 5

Why is MT difficult

Language divergence

Jan 6 2014 ISI Kolkata ML for MT 6

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Taxonomy of MT systems

MT Approaches

Knowledge Based Rule Based MT

Data driven Machine Learning Based

Example Based MT (EBMT)

Statistical MT

Interlingua Based Transfer Based

Jan 6 2014 ISI Kolkata ML for MT 5

Why is MT difficult

Language divergence

Jan 6 2014 ISI Kolkata ML for MT 6

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Why is MT difficult

Language divergence

Jan 6 2014 ISI Kolkata ML for MT 6

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Why is MT difficult Language Divergence

One of the main complexities of MT Language Divergence

Languages have different ways of expressing meaning

Lexico-Semantic Divergence

Structural Divergence

Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)

Jan 6 2014 ISI Kolkata ML for MT 7

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Languages differ in expressing thoughts Agglutination

Finnish ldquoistahtaisinkohanrdquo

English I wonder if I should sit down for a whileldquo

Analysis

ist + sit verb stem

ahta + verb derivation morpheme to do something for a while

isi + conditional affix

n + 1st person singular suffix

ko + question particle

han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)

Jan 6 2014 ISI Kolkata ML for MT 8

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Language Divergence Theory Lexico-

Semantic Divergences (few examples)

Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan

Categorial divergence

Change is in POS category

The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)

Jan 6 2014 ISI Kolkata ML for MT 9

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Language Divergence Theory Structural Divergences

SVOSOV E Peter plays basketball H piitar basketball kheltaa haai

Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime

Minister)

Jan 6 2014 ISI Kolkata ML for MT 10

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Language Divergence Theory Syntactic

Divergences (few examples)

Constituent Order divergence

E Singh the PM of India will address the nation today

H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

Adjunction Divergence

E She will visit here in the summer

H vah yahaa garmii meM aayegii (she here summer-in will come)

Preposition-Stranding divergence

E Who do you want to go with

H kisake saath aap jaanaa chaahate ho (who withhellip)

Jan 6 2014 ISI Kolkata ML for MT 11

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Vauquois Triangle

Jan 6 2014 ISI Kolkata ML for MT 12

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Kinds of MT Systems (point of entry from source to the target text)

Jan 6 2014 ISI Kolkata ML for MT 13

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Illustration of transfer SVOSOV

S

NP VP

V N NP

N John eats

bread

S

NP VP

V N

John eats

NP

N

bread

(transfer svo sov)

Jan 6 2014 ISI Kolkata ML for MT 14

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Universality hypothesis

Universality hypothesis At the

level of ldquodeep meaningrdquo all texts

are the ldquosamerdquo whatever the

language

Jan 6 2014 ISI Kolkata ML for MT 15

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)

H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स

अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke

maadhyam se apne raajaswa ko badhaayaa

G11 Government_(ergative) elections_after Mumbai_in

taxes_through its revenue_(accusative) increased

E11 The Government increased its revenue after the

elections through taxes in Mumbai

Jan 6 2014 ISI Kolkata ML for MT 16

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Interlingual representation complete disambiguation

Washington voted Washington to power Vote

past

Washington power Washington

emphasis

ltis-a gt action

ltis-a gt place

ltis-a gt capability ltis-a gt hellip

ltis-a gt person

goal

Jan 6 2014 ISI Kolkata ML for MT 17

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Kinds of disambiguation needed for a complete and correct interlingua graph

N Name

P POS

A Attachment

S Sense

C Co-reference

R Semantic Role

Jan 6 2014 ISI Kolkata ML for MT 18

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech Noun or Verb

Jan 6 2014 ISI Kolkata ML for MT 19

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

John is the name of a PERSON

Jan 6 2014 ISI Kolkata ML for MT 20

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD Financial

bank or River bank

Jan 6 2014 ISI Kolkata ML for MT 21

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

ldquoitrdquo ldquobankrdquo

Jan 6 2014 ISI Kolkata ML for MT 22

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Issues to handle

Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed

ISSUES Part Of Speech

NER

WSD

Co-reference

Subject Drop

Pro drop (subject ldquoIrdquo)

Jan 6 2014 ISI Kolkata ML for MT 23

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Typical NLP tools used

POS tagger

Stanford Named Entity Recognizer

Stanford Dependency Parser

XLE Dependency Parser

Lexical Resource

WordNet

Universal Word Dictionary (UW++)

Jan 6 2014 ISI Kolkata ML for MT 24

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

System Architecture Stanford

Dependency Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple Sentence Analyser

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simple Enco

Simplifier

Jan 6 2014 ISI Kolkata ML for MT 25

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Target Sentence Generation from interlingua

Lexical Transfer

Target Sentence

Generation

Syntax

Planning

Morphological

Synthesis

(WordPhrase

Translation ) (Word form

Generation) (Sequence)

Jan 6 2014 ISI Kolkata ML for MT 26

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Generation Architecture Deconversion = Transfer + Generation

Jan 6 2014 ISI Kolkata ML for MT 27

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Transfer Based MT

Marathi-Hindi

Jan 6 2014 ISI Kolkata ML for MT 28

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Indian Language to Indian Language Machine Translation (ILILMT)

Bidirectional Machine Translation System

Developed for nine Indian language pairs

Approach

Transfer based

Modules developed using both rule based and statistical approach

Jan 6 2014 ISI Kolkata ML for MT 29

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Architecture of ILILMT System

Morphological Analyzer

Source Text

POS Tagger

Chunker

Vibhakti Computation

Name Entity Recognizer

Word Sense Disambiguatio

n

Lexical Transfer

Agreement Feature

Interchunk

Word Generator

Intrachunk

Target Text

Analysis

Transfer

Generation

Jan 6 2014 ISI Kolkata ML for MT 30

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

M-H MT system Evaluation

Subjective evaluation based on machine translation quality

Accuracy calculated based on score given by linguists

S5 Number of score 5 Sentences S4 Number of score 4 sentences

S3 Number of score 3 sentences N Total Number of sentences

Accuracy =

Score 5 Correct Translation

Score 4 Understandable with

minor errors

Score 3 Understandable with

major errors

Score 2 Not Understandable

Score 1 Non sense translation

Jan 6 2014 ISI Kolkata ML for MT 31

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Evaluation of Marathi to Hindi MT System

0

02

04

06

08

1

12

MorphAnalyzer

POS Tagger Chunker VibhaktiCompute

WSD LexicalTransfer

WordGenerator

Precision

Recall

Module-wise precision and recall

Jan 6 2014 ISI Kolkata ML for MT 32

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Evaluation of Marathi to Hindi MT System (cont)

Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the

translation quality Accuracy 6532

Result analysis

Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality

Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Jan 6 2014 ISI Kolkata ML for MT 33

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Statistical Machine Translation

Jan 6 2014 ISI Kolkata ML for MT 34

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Czeck-English data

[nesu] ldquoI carryrdquo

[ponese] ldquoHe will carryrdquo

[nese] ldquoHe carriesrdquo

[nesou] ldquoThey carryrdquo

[yedu] ldquoI driverdquo

[plavou] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 35

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

To translate hellip

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 36

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Hindi-English data

[DhotA huM] ldquoI carryrdquo

[DhoegA] ldquoHe will carryrdquo

[DhotA hAi] ldquoHe carriesrdquo

[Dhote hAi] ldquoThey carryrdquo

[chalAtA huM] ldquoI driverdquo

[tErte hEM] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 37

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Bangla-English data

[bai] ldquoI carryrdquo

[baibe] ldquoHe will carryrdquo

[bay] ldquoHe carriesrdquo

[bay] ldquoThey carryrdquo

[chAlAi] ldquoI driverdquo

[sAMtrAy] ldquoThey swimrdquo

Jan 6 2014 ISI Kolkata ML for MT 38

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

To translate hellip (repeated)

I will carry

They drive

He swims

They will drive

Jan 6 2014 ISI Kolkata ML for MT 39

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Foundation

Data driven approach Goal is to find out the English sentence e

given foreign language sentence f whose p(e|f) is maximum

Translations are generated on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Jan 6 2014 ISI Kolkata ML for MT 40

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

SMT Language Model

To detect good English sentences

Probability of an English sentence w1w2 helliphellip wn can be written as

Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)

Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability

Trigram model probability calculation

Jan 6 2014 ISI Kolkata ML for MT 41

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

SMT Translation Model

P(f|e) Probability of some f given hypothesis English translation e

How to assign the values to p(e|f)

Sentences are infinite not possible to find pair(ef) for all sentences

Introduce a hidden variable a that represents alignments between the individual words in the sentence pair

Sentence level

Word level

Jan 6 2014 ISI Kolkata ML for MT 42

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Alignment

If the string e= e1l= e1 e2 hellipel has l words and

the string f= f1m=f1f2fm has m words

then the alignment a can be represented by a series a1

m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and

if it is not connected to any English word then aj= O

Jan 6 2014 ISI Kolkata ML for MT 43

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Example of alignment

English Ram went to school

Hindi Raama paathashaalaa gayaa

Ram went to school

ltNullgt Raama paathashaalaa gayaa

Jan 6 2014 ISI Kolkata ML for MT 44

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Translation Model Exact expression

Five models for estimating parameters in the expression [2]

Model-1 Model-2 Model-3 Model-4 Model-5

Choose alignment given e and m

Choose the identity of foreign word given e m a

Choose the length of foreign language string given e

Jan 6 2014 ISI Kolkata ML for MT 45

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

a

eafef )|Pr()|Pr(

m

emafeaf )|Pr()|Pr(

m

emafememaf )|Pr()|Pr()|Pr(

m

emafem )|Pr()|Pr(

m

m

j

jj

jj emfaafem1

1

1

1

1 )|Pr()|Pr(

m

j

jj

j

jj

j

m

emfafemfaaem1

1

11

1

1

1

1 )|Pr()|Pr()|Pr(

)|Pr( emaf )|Pr( em

m

j

jj

j

jj

j emfafemfaa1

1

11

1

1

1

1 )|Pr()|Pr(

Proof of Translation Model Exact

expression

m is fixed for a particular f hence

marginalization

marginalization

Jan 6 2014 ISI Kolkata ML for MT 46

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Alignment

Jan 6 2014 ISI Kolkata ML for MT 47

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Fundamental and ubiquitous

Spell checking

Translation

Transliteration

Speech to text

Text to speeh

Jan 6 2014 ISI Kolkata ML for MT 48

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

EM for word alignment from

sentence alignment example

English

(1) three rabbits

a b

(2) rabbits of Grenoble

b c d

French

(1) trois lapins

w x

(2) lapins de Grenoble

x y z

Jan 6 2014 ISI Kolkata ML for MT 49

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Initial Probabilities

each cell denotes t(a w) t(a x) etc

a b c d

w 14 14 14 14

x 14 14 14 14

y 14 14 14 14

z 14 14 14 14

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

The counts in IBM Model 1

Works by maximizing P(f|e) over the entire corpus

For IBM Model 1 we get the following relationship

c (w f |w e f e ) =t (w f |w e )

t (w f |we

0 ) +hellip + t (w f |wel )

c (w f |w e f e ) is the fractional count of the alignment of w f

with w e in f and e

t (w f |w e ) is the probability of w f being the translation of w e

is the count of w f in f

is the count of w e in e

Jan 6 2014 ISI Kolkata ML for MT 51

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Example of expected count

C[aw (a b)(w x)]

t(aw)

= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)

t(aw)+t(ax)

14

= ----------------- X 1 X 1= 12

14+14

Jan 6 2014 ISI Kolkata ML for MT 52

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

ldquocountsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 13 13 13

y 0 13 13 13

z 0 13 13 13

a b

w x

a b c d

w 12 12 0 0

x 12 12 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 53

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Revised probability example

trevised(a w)

12

= -------------------------------------------------------------------

(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)

Jan 6 2014 ISI Kolkata ML for MT 54

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Revised probabilities table

a b c d

w 12 14 0 0

x 12 512 13 13

y 0 16 13 13

z 0 16 13 13

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

ldquorevised countsrdquo b c d

x y z

a b c d

w 0 0 0 0

x 0 59 13 13

y 0 29 13 13

z 0 29 13 13

a b

w x

a b c d

w 12 38 0 0

x 12 58 0 0

y 0 0 0 0

z 0 0 0 0

Jan 6 2014 ISI Kolkata ML for MT 56

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Re-Revised probabilities table a b c d

w 12 316 0 0

x 12 85144 13 13

y 0 19 13 13

z 0 19 13 13

Continue until convergence notice that (bx) binding gets progressively stronger

b=rabbits x=lapins

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Derivation of EM based Alignment Expressions

Hindi)(Say language ofy vocabular

English)(Say language ofry vocalbula

2

1

LV

LV

F

E

what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet

जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet

E1

F1

E2

F2

Jan 6 2014 ISI Kolkata ML for MT 58

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Vocabulary mapping

Vocabulary VE VF

what is in a name that which we call rose by any other will smell as sweet

naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii

Jan 6 2014 ISI Kolkata ML for MT 59

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like

11989011 11989012 hellip 119890

11198971 119891

11 11989112 hellip 119891

11198981

11989021 11989022 hellip 119890

21198972 119891

21 11989122 hellip 119891

21198982

1198901199041 119890

1199042 hellip 119890

119904119897119904 119891

1199041 1198911199042 hellip 119891

119904119898119904

1198901198781 119890

1198782 hellip 119890

119878119897119878 119891

1198781 1198911198782 hellip 119891

119878119898119878

No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890

119904119901 =Index of English word 119890119904119901in English vocabularydictionary

119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary

(Thanks to Sachin Pawar for helping with the maths formulae processing)

Jan 6 2014 ISI Kolkata ML for MT 60

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Hidden variables and parameters

Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878

119904=1 where each hidden variable is

as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French

word 119911119901119902119904 = 0 otherwise

Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893

word in French vocabulary

Jan 6 2014 ISI Kolkata ML for MT 61

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Likelihoods Data Likelihood L(D Θ)

Data Log-Likelihood LL(D Θ)

Expected value of Data Log-Likelihood E(LL(D Θ))

Jan 6 2014 ISI Kolkata ML for MT 62

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Constraint and Lagrangian

119875119894119895

119881119865

119895=1

= 1 forall119894

Jan 6 2014 ISI Kolkata ML for MT 63

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Differentiating wrt Pij

Jan 6 2014 ISI Kolkata ML for MT 64

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Final E and M steps

M-step

E-step

Jan 6 2014 ISI Kolkata ML for MT 65

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Combinatorial considerations

Jan 6 2014 ISI Kolkata ML for MT 66

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Example

ISI Kolkata ML for MT Jan 6 2014 67

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

All possible alignments

ISI Kolkata ML for MT Jan 6 2014 68

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

First fundamental requirement of SMT

Alignment requires evidence of

bull firstly a translation pair to introduce the

POSSIBILITY of a mapping

bull then another pair to establish with

CERTAINTY the mapping

ISI Kolkata ML for MT Jan 6 2014 69

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

For the ldquocertaintyrdquo

We have a translation pair containing alignment candidates and none of the other words in the translation pair

OR

We have a translation pair containing all words in the translation pair except the alignment candidates

ISI Kolkata ML for MT Jan 6 2014 70

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Thereforehellip

If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty

ISI Kolkata ML for MT Jan 6 2014 71

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Rough estimate of data requirement

SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge

ie no meanings or grammatical properties of any words phrases or sentences

Each language has a vocabulary of 100000 words can give rise to about 500000 word forms

through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟

bdquowent‟ and bdquogone‟

ISI Kolkata ML for MT Jan 6 2014 72

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Reasons for mapping to multiple words

Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register

polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)

syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood

ISI Kolkata ML for MT Jan 6 2014 73

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Estimate of corpora requirement

Assume that on an average a sentence is 10 words long

an additional 9 translation pairs for getting at one of the 5 mappings

10 sentences per mapping per word

a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences

Estimate is not wide off the mark

Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs

ISI Kolkata ML for MT Jan 6 2014 74

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Our work on factor based SMT

Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009

Jan 6 2014 ISI Kolkata ML for MT 75

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Case Marker and Morphology crucial in E-H MT

Order of magnitiude facelift in Fluency and fidelity

Determined by the combination of suffixes and semantic relations on the English side

Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers

Jan 6 2014 ISI Kolkata ML for MT 76

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Semantic relations+SuffixesCase

Markers+inflections

I ate mangoes

I ltagt ate eatpast mangoes ltobj

I ltagt mangoes ltobjpl eatpast

mei_ne aam khaa_yaa

Jan 6 2014 ISI Kolkata ML for MT 77

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Our Approach

Factored model (Koehn and Hoang 2007) with the following translation factor

suffix + semantic relation case markersuffix

Experiments with the following relations

Dependency relations from the stanford parser

Deeper semantic roles from Universal Networking Language (UNL)

Jan 6 2014 ISI Kolkata ML for MT 78

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Our Factorization

Jan 6 2014 ISI Kolkata ML for MT 79

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Experiments

Jan 6 2014 ISI Kolkata ML for MT 80

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Corpus Statistics

Jan 6 2014 ISI Kolkata ML for MT 81

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Results The impact of suffix and semantic factors

Jan 6 2014 ISI Kolkata ML for MT 82

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Results The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 83

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Subjective Evaluation The impact of reordering and semantic relations

Jan 6 2014 ISI Kolkata ML for MT 84

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Impact of sentence length (F Fluency AAdequacy E Errors)

Jan 6 2014 ISI Kolkata ML for MT 85

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

A feel for the improvement-baseline

Jan 6 2014 ISI Kolkata ML for MT 86

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

A feel for the improvement-reorder

Jan 6 2014 ISI Kolkata ML for MT 87

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

A feel for the improvement-Semantic relation

Jan 6 2014 ISI Kolkata ML for MT 88

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

A recent study

PAN Indian SMT

Jan 6 2014 ISI Kolkata ML for MT 89

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator

SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi

Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English

Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences

Evaluation with BLEU METEOR scores also show high correlation with BLEU

Jan 6 2014 ISI Kolkata ML for MT 90

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

SMT Systems Trained

Phrase-based (PBSMT) baseline system (S1)

E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)

E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)

IL-IL PBSMT with transliteration post-editing (S4)

Jan 6 2014 ISI Kolkata ML for MT 91

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Natural Partitioning of SMT systems

Clear partitioning of translation pairs by language family pairs based on translation accuracy

Shared characteristics within language families make translation simpler

Divergences among language families make translation difficult

Baseline PBSMT - BLEU scores (S1)

Jan 6 2014 ISI Kolkata ML for MT 92

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

The Challenge of Morphology Morphological complexity vs BLEU

Training Corpus size vs BLEU

Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size

bull Translation accuracy decreases with increasing morphology

bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages

bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Common Divergences Shared Solutions

All Indian languages have similar word order The same structural divergence between English and Indian

languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL

translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common

framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )

Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)

Jan 6 2014 ISI Kolkata ML for MT 94

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Harnessing Shared Characteristics

Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common

phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach

Harnessing common characteristics can improve SMT output

PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)

Jan 6 2014 ISI Kolkata ML for MT 95

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Cognition and Translation Measuring Translation Difficulty

Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013

96 Jan 6 2014 ISI Kolkata ML for MT

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Scenario

Sentences

John ate jam

John ate jam made from apples

John is in a jam

Subjective notion of difficulty

Easy

Moderate

Difficult

Jan 6 2014 ISI Kolkata ML for MT 97

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Use behavioural data

Use behavioural data to decipher strong AI algorithms

Specifically

For WSD by humans see where the eye rests for clues

For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences

Jan 6 2014 ISI Kolkata ML for MT 98

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of

Fixations

Saccades

Jan 6 2014 ISI Kolkata ML for MT 99

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Eye Tracking data Gaze points Position of eye-gaze on the

screen

Fixations A long stay of the gaze on a particular object on the screen

Fixations have both Spatial (coordinates) and Temporal (duration) properties

Saccade A very rapid movement of eye between the positions of rest

Scanpath A path connecting a series of fixations

Regression Revisiting a previously read segment

Jan 6 2014 ISI Kolkata ML for MT 100

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Controlling the experimental setup for eye-tracking

Eye movement patterns influenced by factors like age working proficiency environmental distractions etc

Guidelines for eye tracking Participants metadata (age expertise occupation)

etc

Performing a fresh calibration before each new experiment

Minimizing the head movement

Introduce adequate line spacing in the text and avoid scrolling

Carrying out the experiments in a relatively low light environment

Jan 6 2014 ISI Kolkata ML for MT 101

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Use of eye tracking

Used extensively in Psychology

Mainly to study reading processes

Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354

Used in flight simulators for pilot training

Jan 6 2014 ISI Kolkata ML for MT 102

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

NLP and Eye Tracking research

Kliegl (2011)- Predict word frequency

and pattern from eye movements Doherty et al (2010)- Eye-tracking as an

automatic Machine Translation Evaluation Technique

Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis

Dragsted (2010)- Co-ordination of reading and writing process during translation

Relatively new and open research direction

Jan 6 2014 ISI Kolkata ML for MT 103

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Translation Difficulty Index (TDI)

Motivation route sentences to translators with right competence as per difficulty of translating

On a crowdsourcing platform eg

TDI is a function of

sentence length (l)

degree of polysemy of constituent words (p) and

structural complexity (s)

104 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Contributor to TDI length

What is more difficult to translate

John eats jam vs

John eats jam made from apples vs

John eats jam made from apples grown in orchards vs

John eats bread made from apples grown in orchards on black soil

105 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Contributor to TDI polysemy

What is more difficult to translate

John is in a jam vs

John is in difficulty

Jam has 4 diverse senses difficulty has 4 related senses

106 Jan 6 2014 ISI Kolkata ML for MT

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Contributor to TDI structural complexity

What is more difficult to translate

John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs

John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them

107 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Measuring translation through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

108 Jan 6 2014 ISI Kolkata ML for MT

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Measuring translation difficulty through Gaze data

Translation difficulty indicated by

staying of eye on segments

Jumping back and forth between segments

Example

The horse raced past the garden fell

बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa

gir gayaa

The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Scanpaths indicator of translation difficulty

(Malsburg et al 2007)

Sentence 2 is a clear case of ldquoGarden pathingrdquo

which imposes cognitive load on participants and the prefer syntactic re-analysis

Jan 6 2014 ISI Kolkata ML for MT 110

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program

Built with a purpose of recording gaze and key-stroke data during translation

Can be used for other reading and writing related studies

Using Translog one can Create and Customize translationreading and writing

experiments involving eye-tracking and keystroke logging

Calibrate the eye-tracker

Replay and analyze the recorded log files

Manually correct errors in gaze recording

Jan 6 2014 ISI Kolkata ML for MT 111

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

TPR Database The Translation Process Research (TPR) database

(Carl 2012) is a database containing behavioral data for translation activities

Contains Gaze and Keystroke information for more than 450 experiments

40 different paragraphs are translated into 7 different languages from English by multiple translators

At least 5 translators per language

Source and target paragraphs are annotated with POS tags lemmas dependency relations etc

Easy to use XML data format

Jan 6 2014 ISI Kolkata ML for MT 112

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Experimental setup (12)

Translators translate sentence by sentence typing to a text box

The display screen is attached with a remote eye-tracker which

constantly records the eye movement of the translator

113 Jan 6 2014 ISI Kolkata ML for MT

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Experimental setup (22)

Extracted 20 different text categories from the data

Each piece of text contains 5-10 sentences

For each category we had at least 10 participants who translated the text into different target languages

114 Jan 6 2014 ISI Kolkata ML for MT

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

A predictive framework for TDI

Direct annotation of TDI is fraught with subjectivity and ad-hocism

We use translator‟s gaze data as annotation to prepare training data

Training data

Regressor

Labeling through gaze analysis Features

Test Data

TDI

Jan 6 2014 ISI Kolkata ML for MT 115

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Annotation of TDI (14)

First approximation -gt TDI equivalent to ldquotime taken to translaterdquo

However time taken to translate may not be strongly related to translation difficulty

It is difficult to know what fraction of the total time is spent on translation related thinking

Sensitive to distractions from the environment

Jan 6 2014 ISI Kolkata ML for MT 116

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Annotation of TDI (24)

Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo

This is called Translation Processing Time given by

119879119901 = 119879119888119900119898119901+119879119892119890119899

Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively

Jan 6 2014 ISI Kolkata ML for MT 117

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Annotation of TDI (34)

Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed

f- fixation s- saccade Fs- source Ft- target

119879119901 = 119889119906119903 119891 +

119891 isin 119865119904

119889119906119903 119904 +

119904 isin 119878119904

119889119906119903 119891 +

119891 isin 119865119905

119889119906119903 119904

119904 isin 119878119905

Jan 6 2014 ISI Kolkata ML for MT 118

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Annotation of TDI (44)

The measured TDI score is the Tp normalized over sentence length

119879119863119868119898119890119886119904119906119903119890119889 =119879119901

119904119890119899119905119890119899119888119890_119897119890119899119892119905119893

Jan 6 2014 ISI Kolkata ML for MT 119

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Features

Length Word count of the sentences

Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length

Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence

Measured TDI for TPR database for 80 sentences

Jan 6 2014 ISI Kolkata ML for MT 120

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Experiment and results

Training data of 80 examples 10-fold cross validation

Features computed using Princeton WordNet and Stanford Dependency Parser

Support Vector Regression technique (Joachims et al 1999) along with different kernels

Error analysis was done by Mean Squared Error estimate

We also computed the correlation of the predicted TDI with the measured TDI

Jan 6 2014 ISI Kolkata ML for MT 121

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Examples from the dataset

Jan 6 2014 ISI Kolkata ML for MT 122

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Summary

Covered Interlingual based MT the oldest approach to MT

Covered SMT the newest approach to MT

Presented some recent study in the context of Indian Languages

Covered eye tracking bring cognitive aspects to MT

123 Jan 6 2014 ISI Kolkata ML for MT

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Summary

SMT is the ruling paradigm

But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua

Large scale effort sponsored by ministry of IT TDIL program to create MT systems

Parallel corpora creation is also going on in a consortium mode

Jan 6 2014 ISI Kolkata ML for MT 124

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Conclusions NLP has assumed great importance because

of large amount of text in e-form

Machine learning techniques are increasingly applied

Highly relevant for India where multilinguality is way of life

Machine Translation is more fundamental and ubiquitous than just mapping between two languages

Utterancethought

Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126

Pubs httpwwcseiitbacin~pb

Resources and tools

httpwwwcfiltiitbacin

Jan 6 2014 ISI Kolkata ML for MT 126