Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John...

Introduction to Machine Translation

Mitch Marcus

CIS 530Some slides adapted from slides by

John Hutchins, Bonnie Dorr, Martha Palmer

CIS 530 - Intro to NLP 2

Why use computers in translation? Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs

Any one of these may justify machine translation or computer aids


The Early History of NLP (Hutchins):MT in the 1950s and 1960s

Sponsored by government bodies in USA and USSR (also CIA and KGB)• assumed goal was fully automatic quality output (i.e. of publishable quality)

[dissemination]

• actual need was translation for information gathering [assimilation] Survey by Bar-Hillel of MT research:

• criticised assumption of FAHQT as goal• demonstrated ‘non-feasibility’ of FAHQT (without ‘unrealisable’ encyclopedic

knowledge bases)• advocated “man-machine symbiosis”, i.e. HAMT and MAHT

ALPAC 1966, set up by disillusioned funding agencies• compared latest systems with early unedited MT output (IBM-GU demo, 1954),

criticised for still needing post-editing• advocated machine aids, and no further support of MT research• but failed to identify the actual needs of funders [assimilation]• therefore failed to see that output of IBM-USAF Translator and Georgetown systems

were used and appreciated


Consequences of ALPAC

MT research virtually ended in US identification of actual needs

• assimilation vs. dissemination recognition that ‘perfectionism’ (FAHQT) had neglected:

• operational factors and requirements• expertise of translators• machine aids for translators

henceforth three strands of MT:• translation tools (HAMT, MAHT)• operational systems (post-editing, controlled languages,

domain-specific systems)• research (new approaches, new methods)

computational linguistics born in the aftermath

Machine Translation (Pass 0 – From Intro Lectures)


Why use computers in translation? Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs

Any one of these may justify machine translation or computer aids

(next several slides adapted from Language Weaver)


Statistical Machine Translation Technology

Spanish/EnglishSpanish/EnglishBilingual TextBilingual Text

English TextEnglish Text

Statistical AnalysisStatistical Analysis Statistical AnalysisStatistical Analysis

Que hambre tengo yoQue hambre tengo yo

SpanishSpanish BrokenBrokenEnglishEnglish

EnglishEnglish

What hunger have I,What hunger have I,Hungry I am so,Hungry I am so,I am so hungry,I am so hungry,Have I that hunger …Have I that hunger …

I am so hungryI am so hungry


How A Statistical MT System Learns


Translating a New Document


v.2.0

v.2.4

v.3.0

Source: Aljazeera, January 8, 2005

Language Weaver


Translingual Chat – Instant Messaging

Original Translation


Language Weaver (Al Jazeera 8/2007)

LanguageWeaver Demo Website

http://demo.languageweaver.com/

http://demo.languageweaver.com/


Language Weaver Hybrid Translation Technology

Chinese Source TextSample 1:

车展，一向是衡量一个国家汽车消费现状和市场潜力的“晴雨表”。本届北京国际车展有 24个国家的 1200余家厂商参展， 8天接待 40余万名参观者，创下了中国车展的新纪录，让人深切地感受到汽车市场启动的信号。 “中国是世界最后一个最大的汽车市场”。多年来，这句话更多地包含着汽车商人的一种希冀。然而如今，越来越多的事实预示着它正在变为现实。来自本届车展的一组数据很有说服力。《北京青年报》的一份现场调查显示， 35岁以下参观者约占 35％； 62.1％的被访者表示，参观车展主要是为近期买车搜集信息，甚至在展会上就有可能购买或预订合适的产品； 76％的被访者表示最近两年会购买私家车。今年以来，国内轿车市场的强劲增长让厂家喜上眉梢。据国家统计局公布的数字，前 4个月，全国共生产轿车 26.79万辆，增长 27.6％；特别是 4月份，生产轿车 9万辆，同比增长 50.5％，创造了十几年来轿车月产增长的最高纪录。从销售看，一季度，全国轿车生产企业共销售轿车18.8万辆，同比增长 22％，产销率达 105％；轿车库存比年初下降 1.1万辆，下降幅度近 25％。 Language Weaver Experimental Syntax MT Sample 1 : The motor show, has always been the' barometer' of a national car consumption and market potential. The Beijing International Auto Show has more than 1,200 exhibitors from 24 countries and 8 days of receiving more than 40 million visitors, setting a new record in China's auto show, are deeply aware of the automobile market signals. "China is one of the largest automobile market in the world. Over the years, this phrase implies more auto businessmen. But now, more and more facts indicates that it is to become a reality. Data from the Motor Show is very convincing. The Beijing Qingnian Bao Report on-the-spot investigation showed that about 35 percent of 35-year-old visitors, 62.1 percent of the respondents said that the truck was mainly to buy a car in the near future to collect information, even at the exhibition may purchase or suitable products; 76% of respondents indicated in the past two years to buy private cars. Since the beginning of this year, the strong growth of the domestic car market. According to the figures released by the National Bureau of Statistics, in the first four months, the country produced 267,900 vehicles, up 27.6 percent; in particular, in April, the production of 90,000 vehicles, an increase of 50.5% over the same period last year, setting a record high for the monthly output growth over the past 10-odd years. In terms of sales in the first quarter, manufacturing enterprises in the country sold 188,000 cars, up 22 percent over the same period of last year, up 10.5 percent; 11,000 vehicles, dropping by nearly 25 percent lower than the beginning of the year.


Broadcast Monitoring BBN MAPS & Language Weaver MT


Three MT Approaches: Direct, Transfer, Interlingual (Vauquois triangle)

Interlingua

SemanticStructure

SemanticStructure

SyntacticStructure

SyntacticStructure

WordStructure

WordStructure

Source Text Target Text

SemanticComposition

SemanticDecomposition

SemanticAnalysis

SemanticGeneration

SyntacticAnalysis

SyntacticGeneration

MorphologicalAnalysis

MorphologicalGeneration

SemanticTransfer

SyntacticTransfer

Direct


Examples of Three Approaches

Direct:• I checked his answers against those of the teacher →

Yo comparé sus respuestas a las de la profesora

• Rule: [check X against Y] → [comparar X a Y]

Transfer:• Ich habe ihn gesehen → I have seen him

• Rule: [clause agt aux obj pred] → [clause agt aux pred obj]

Interlingual:• I like Mary→ Mary me gusta a mí

• Rep: [BeIdent (I [ATIdent (I, Mary)] Like+ingly)]


Direct MT: Pros and Cons

Pros• Fast• Simple• Inexpensive

Cons• Unreliable• Not powerful• Rule proliferation• Requires too much context• Major restructuring after lexical substitution


Transfer MT: Pros and Cons

Pros• Don’t need to find language-neutral rep

• No translation rules hidden in lexicon

• Relatively fast

Cons• N2 sets of transfer rules: Difficult to extend

• Proliferation of language-specific rules in lexicon and syntax

• Cross-language generalizations lost


Interlingual MT: Pros and Cons

Pros• Portable (avoids N2 problem)• Lexical rules and structural transformations stated more simply on

normalized representation• Explanatory Adequacy

Cons• Difficult to deal with terms on primitive level: universals?• Must decompose and reassemble concepts• Useful information lost (paraphrase)• (Is thought really language neutral??)


MT Challenges: Ambiguity

Syntactic AmbiguityI saw the man on the hill with the telescope

Lexical AmbiguityE: bookS: libro, reservar

Semantic Ambiguity• Homography:

ball(E) = pelota, baile(S)• Polysemy:

kill(E), matar, acabar (S)• Semantic granularity

esperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)


MT Challenges: Divergences

• Meaning of two translationally equivalent phrases is distributed differently in the two languages

• Example:- English: [RUN INTO ROOM]- Spanish: [ENTER IN ROOM RUNNING]


Divergence E/E’ (Spanish) E/E’ (Arabic)

Categorial be jealous when he returns have jealousy [tener celos] upon his return [ رجوعه [عند

Conflational float come again go floating [ir flotando] return [عاد]

Structural enter the house seek enter in the house [entrar en la casa] search for [ عن [بحثHead Swap run in do something quickly

enter running [entrar corriendo] go-quickly in doing something [اسرع]Thematic I have a headache my-head hurts me [me duele la cabeza] —

Spanish/Arabic Divergences


Divergence Frequency

32% of sentences in UN Spanish/English Corpus (5K) 35% of sentences in TREC El Norte Corpus (19K) Divergence Types

• Categorial (X tener hambre X have hunger) [98%]

• Conflational (X dar puñaladas a Z X stab Z) [83%]

• Structural (X entrar en Y X enter Y) [35%]

• Head Swapping (X cruzar Y nadando X swim across Y) [8%]

• Thematic (X gustar a Y Y like X) [6%]


MT Lexical Choice- WSD

Iraq lost the battle. Ilakuka centwey ciessta. [Iraq ] [battle] [lost].

John lost his computer. John-i computer-lul ilepelyessta. [John] [computer] [misplaced].


WSD with Source Language Semantic Class Constraints

lose1(Agent, Patient: competition) <=> ciessta

lose2 (Agent, Patient: physobj) <=> ilepelyessta


Lexical Gaps: English to Chinese

break

smash

shatter

snap

?

da po - irregular pieces

da sui - small pieces

pie duan -line segments

An Gentle Introduction to Statistical MT: 1949 to 1988


Warren Weaver – 1949 Memorandum I

Proposes Local Word Sense Disambiguation!

‘If one examines the words in a book, one at a time through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of words. "Fast" may mean "rapid"; or it may mean "motionless"; and there is no way of telling which.

But, if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then, if N is large enough one can unambiguously decide the meaning. . .’


Warren Weaver – 1949 Memorandum II

Proposes Interlingua for Machine Translation!

‘Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication—the real but as yet undiscovered universal language—and—then re-emerge by whatever particular route is convenient.’


Warren Weaver – 1949 Memorandum III

Proposes Machine Translation using Information Theory!

‘It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?’

Weaver, W. (1949): ‘Translation’. Repr. in: Locke, W.N. and Booth, A.D. (eds.) Machine translation of languages: fourteen essays (Cambridge, Mass.: Technology Press of the Massachusetts Institute of Technology, 1955), pp. 15-23.


IBM Adopts Statistical MT Approach I (early 1990s)

‘In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center.’


IBM Adopts Statistical MT Approach II

‘The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text.

As wacky as this perspective might sound, it's no stranger than the view that an English sentence gets corrupted into an acoustic signal in passing from the person's brain to his mouth, and this perspective is now essentially universal in automatic speech recognition.’


The Channel Model for Machine Translation

this and following 3 out of 4 slides from original 1990 IBM MT paper


Noisy Channel - Why useful?

Word reordering in translation handled by P(S)• P(S) factor frees P(T | S) from worrying about word order in

the “Source” language

Word choice in translation handled by P (T|S)• P(T| S) factor frees P(S) from worrying about picking the

right translation


An Alignment

distortion

fertility


Fertilities and Lexical Probabilities for not


Fertilities and Lexical Probabilities for hear


Schematic of Translation Model

fertility

translation

distortion

null cepts

from What's New in Statistical Machine Translation, Kevin Knight and Philipp Koehn, Tutorial at HLT/NAACL 2003


How do we evaluate MT?

Human-based Metrics• Semantic Invariance• Pragmatic Invariance• Lexical Invariance• Structural Invariance• Spatial Invariance• Fluency• Accuracy: Number of Human Edits required

—HTER: Human Translation Error Rate

• “Do you get it?” Automatic Metrics: Bleu


BiLingual Evaluation Understudy (BLEU —Papineni, 2001)

Automatic Technique, but …. Requires the pre-existence of Human (Reference)

Translations Compare n-gram matches between candidate translation

and 1 or more reference translations


Bleu Metric

Chinese-English Translation Example:

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John...

Documents

Transcript of Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John...