Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John...

43
Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John Hutchins, Bonnie Dorr, Martha Palmer

Transcript of Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John...

Introduction to Machine Translation

Mitch Marcus

CIS 530Some slides adapted from slides by

John Hutchins, Bonnie Dorr, Martha Palmer

CIS 530 - Intro to NLP 2

Why use computers in translation? Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs

Any one of these may justify machine translation or computer aids

CIS 530 - Intro to NLP 3

The Early History of NLP (Hutchins):MT in the 1950s and 1960s

Sponsored by government bodies in USA and USSR (also CIA and KGB)• assumed goal was fully automatic quality output (i.e. of publishable quality)

[dissemination]

• actual need was translation for information gathering [assimilation] Survey by Bar-Hillel of MT research:

• criticised assumption of FAHQT as goal• demonstrated ‘non-feasibility’ of FAHQT (without ‘unrealisable’ encyclopedic

knowledge bases)• advocated “man-machine symbiosis”, i.e. HAMT and MAHT

ALPAC 1966, set up by disillusioned funding agencies• compared latest systems with early unedited MT output (IBM-GU demo, 1954),

criticised for still needing post-editing• advocated machine aids, and no further support of MT research• but failed to identify the actual needs of funders [assimilation]• therefore failed to see that output of IBM-USAF Translator and Georgetown systems

were used and appreciated

CIS 530 - Intro to NLP 4

Consequences of ALPAC

MT research virtually ended in US identification of actual needs

• assimilation vs. dissemination recognition that ‘perfectionism’ (FAHQT) had neglected:

• operational factors and requirements• expertise of translators• machine aids for translators

henceforth three strands of MT:• translation tools (HAMT, MAHT)• operational systems (post-editing, controlled languages,

domain-specific systems)• research (new approaches, new methods)

computational linguistics born in the aftermath

Machine Translation (Pass 0 – From Intro Lectures)

CIS 530 - Intro to NLP 6

Why use computers in translation? Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs

Any one of these may justify machine translation or computer aids

(next several slides adapted from Language Weaver)

CIS 530 - Intro to NLP 7

Statistical Machine Translation Technology

Spanish/EnglishSpanish/EnglishBilingual TextBilingual Text

English TextEnglish Text

Statistical AnalysisStatistical Analysis Statistical AnalysisStatistical Analysis

Que hambre tengo yoQue hambre tengo yo

SpanishSpanish BrokenBrokenEnglishEnglish

EnglishEnglish

What hunger have I,What hunger have I,Hungry I am so,Hungry I am so,I am so hungry,I am so hungry,Have I that hunger …Have I that hunger …

I am so hungryI am so hungry

CIS 530 - Intro to NLP 8

How A Statistical MT System Learns

CIS 530 - Intro to NLP 9

Translating a New Document

CIS 530 - Intro to NLP 10

v.2.0

v.2.4

v.3.0

Source: Aljazeera, January 8, 2005

Language Weaver

CIS 530 - Intro to NLP 11

Translingual Chat – Instant Messaging

Original Translation

CIS 530 - Intro to NLP 12

Language Weaver (Al Jazeera 8/2007)

LanguageWeaver Demo Website

CIS 530 - Intro to NLP 13

Language Weaver Hybrid Translation Technology

Chinese Source TextSample 1:

 车展,一向是衡量一个国家汽车消费现状和市场潜力的“晴雨表”。本届北京国际车展有 24个国家的 1200余家厂商参展, 8天接待 40余万名参观者,创下了中国车展的新纪录,让人深切地感受到汽车市场启动的信号。     “中国是世界最后一个最大的汽车市场”。多年来,这句话更多地包含着汽车商人的一种希冀。然而如今,越来越多的事实预示着它正在变为现实。     来自本届车展的一组数据很有说服力。《北京青年报》的一份现场调查显示, 35岁以下参观者约占 35%; 62.1%的被访者表示,参观车展主要是为近期买车搜集信息,甚至在展会上就有可能购买或预订合适的产品; 76%的被访者表示最近两年会购买私家车。     今年以来,国内轿车市场的强劲增长让厂家喜上眉梢。据国家统计局公布的数字,前 4个月,全国共生产轿车 26.79万辆,增长 27.6%;特别是 4月份,生 产轿车 9万辆,同比增长 50.5%,创造了十几年来轿车月产增长的最高纪录。从销售看,一季度,全国轿车生产企业共销售轿车18.8万辆,同比增长  22%,产销率达 105%;轿车库存比年初下降 1.1万辆,下降幅度近 25%。 Language Weaver Experimental Syntax MT Sample 1 : The motor show, has always been the' barometer' of a national car consumption and market potential. The Beijing International Auto Show has more than 1,200 exhibitors from 24 countries and 8 days of receiving more than 40 million visitors, setting a new record in China's auto show, are deeply aware of the automobile market signals.  "China is one of the largest automobile market in the world. Over the years, this phrase implies more auto businessmen. But now, more and more facts indicates that it is to become a reality.  Data from the Motor Show is very convincing. The Beijing Qingnian Bao Report on-the-spot investigation showed that about 35 percent of 35-year-old visitors, 62.1 percent of the respondents said that the truck was mainly to buy a car in the near future to collect information, even at the exhibition may purchase or suitable products; 76% of respondents indicated in the past two years to buy private cars.  Since the beginning of this year, the strong growth of the domestic car market. According to the figures released by the National Bureau of Statistics, in the first four months, the country produced 267,900 vehicles, up 27.6 percent; in particular, in April, the production of 90,000 vehicles, an increase of 50.5% over the same period last year, setting a record high for the monthly output growth over the past 10-odd years. In terms of sales in the first quarter, manufacturing enterprises in the country sold 188,000 cars, up 22 percent over the same period of last year, up 10.5 percent; 11,000 vehicles, dropping by nearly 25 percent lower than the beginning of the year.

CIS 530 - Intro to NLP 14

Broadcast Monitoring BBN MAPS & Language Weaver MT

CIS 530 - Intro to NLP 15

CIS 530 - Intro to NLP 16

Three MT Approaches: Direct, Transfer, Interlingual (Vauquois triangle)

Interlingua

SemanticStructure

SemanticStructure

SyntacticStructure

SyntacticStructure

WordStructure

WordStructure

Source Text Target Text

SemanticComposition

SemanticDecomposition

SemanticAnalysis

SemanticGeneration

SyntacticAnalysis

SyntacticGeneration

MorphologicalAnalysis

MorphologicalGeneration

SemanticTransfer

SyntacticTransfer

Direct

CIS 530 - Intro to NLP 17

Examples of Three Approaches

Direct:• I checked his answers against those of the teacher →

Yo comparé sus respuestas a las de la profesora

• Rule: [check X against Y] → [comparar X a Y]

Transfer:• Ich habe ihn gesehen → I have seen him

• Rule: [clause agt aux obj pred] → [clause agt aux pred obj]

Interlingual:• I like Mary→ Mary me gusta a mí

• Rep: [BeIdent (I [ATIdent (I, Mary)] Like+ingly)]

CIS 530 - Intro to NLP 18

Direct MT: Pros and Cons

Pros• Fast• Simple• Inexpensive

Cons• Unreliable• Not powerful• Rule proliferation• Requires too much context• Major restructuring after lexical substitution

CIS 530 - Intro to NLP 19

Transfer MT: Pros and Cons

Pros• Don’t need to find language-neutral rep

• No translation rules hidden in lexicon

• Relatively fast

Cons• N2 sets of transfer rules: Difficult to extend

• Proliferation of language-specific rules in lexicon and syntax

• Cross-language generalizations lost

CIS 530 - Intro to NLP 20

Interlingual MT: Pros and Cons

Pros• Portable (avoids N2 problem)• Lexical rules and structural transformations stated more simply on

normalized representation• Explanatory Adequacy

Cons• Difficult to deal with terms on primitive level: universals?• Must decompose and reassemble concepts• Useful information lost (paraphrase)• (Is thought really language neutral??)

CIS 530 - Intro to NLP 21

MT Challenges: Ambiguity

Syntactic AmbiguityI saw the man on the hill with the telescope

Lexical AmbiguityE: bookS: libro, reservar

Semantic Ambiguity• Homography:

ball(E) = pelota, baile(S)• Polysemy:

kill(E), matar, acabar (S)• Semantic granularity

esperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)

CIS 530 - Intro to NLP 22

MT Challenges: Divergences

• Meaning of two translationally equivalent phrases is distributed differently in the two languages

• Example:- English: [RUN INTO ROOM]- Spanish: [ENTER IN ROOM RUNNING]

CIS 530 - Intro to NLP 23

Divergence E/E’ (Spanish) E/E’ (Arabic)

Categorial be jealous when he returns have jealousy [tener celos] upon his return [ رجوعه [عند

Conflational float come again go floating [ir flotando] return [عاد]

Structural enter the house seek enter in the house [entrar en la casa] search for [ عن [بحثHead Swap run in do something quickly

enter running [entrar corriendo] go-quickly in doing something [اسرع]Thematic I have a headache my-head hurts me [me duele la cabeza] —

Spanish/Arabic Divergences

CIS 530 - Intro to NLP 24

Divergence Frequency

32% of sentences in UN Spanish/English Corpus (5K) 35% of sentences in TREC El Norte Corpus (19K) Divergence Types

• Categorial (X tener hambre X have hunger) [98%]

• Conflational (X dar puñaladas a Z X stab Z) [83%]

• Structural (X entrar en Y X enter Y) [35%]

• Head Swapping (X cruzar Y nadando X swim across Y) [8%]

• Thematic (X gustar a Y Y like X) [6%]

CIS 530 - Intro to NLP 25

MT Lexical Choice- WSD

Iraq lost the battle. Ilakuka centwey ciessta. [Iraq ] [battle] [lost].

John lost his computer. John-i computer-lul ilepelyessta. [John] [computer] [misplaced].

CIS 530 - Intro to NLP 26

WSD with Source Language Semantic Class Constraints

lose1(Agent, Patient: competition) <=> ciessta

lose2 (Agent, Patient: physobj) <=> ilepelyessta

CIS 530 - Intro to NLP 27

Lexical Gaps: English to Chinese

break

smash

shatter

snap

?

da po - irregular pieces

da sui - small pieces

pie duan -line segments

An Gentle Introduction to Statistical MT: 1949 to 1988

CIS 530 - Intro to NLP 29

Warren Weaver – 1949 Memorandum I

Proposes Local Word Sense Disambiguation!

‘If one examines the words in a book, one at a time through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of words. "Fast" may mean "rapid"; or it may mean "motionless"; and there is no way of telling which.

But, if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then, if N is large enough one can unambiguously decide the meaning. . .’

CIS 530 - Intro to NLP 30

Warren Weaver – 1949 Memorandum II

Proposes Interlingua for Machine Translation!

‘Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication—the real but as yet undiscovered universal language—and—then re-emerge by whatever particular route is convenient.’

CIS 530 - Intro to NLP 31

Warren Weaver – 1949 Memorandum III

Proposes Machine Translation using Information Theory!

‘It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?’

Weaver, W. (1949): ‘Translation’. Repr. in: Locke, W.N. and Booth, A.D. (eds.) Machine translation of languages: fourteen essays (Cambridge, Mass.: Technology Press of the Massachusetts Institute of Technology, 1955), pp. 15-23.

CIS 530 - Intro to NLP 32

IBM Adopts Statistical MT Approach I (early 1990s)

‘In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center.’

CIS 530 - Intro to NLP 33

IBM Adopts Statistical MT Approach II

‘The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text.

As wacky as this perspective might sound, it's no stranger than the view that an English sentence gets corrupted into an acoustic signal in passing from the person's brain to his mouth, and this perspective is now essentially universal in automatic speech recognition.’

CIS 530 - Intro to NLP 34

The Channel Model for Machine Translation

this and following 3 out of 4 slides from original 1990 IBM MT paper

CIS 530 - Intro to NLP 35

Noisy Channel - Why useful?

Word reordering in translation handled by P(S)• P(S) factor frees P(T | S) from worrying about word order in

the “Source” language

Word choice in translation handled by P (T|S)• P(T| S) factor frees P(S) from worrying about picking the

right translation

CIS 530 - Intro to NLP 36

An Alignment

distortion

fertility

CIS 530 - Intro to NLP 37

Fertilities and Lexical Probabilities for not

CIS 530 - Intro to NLP 38

Fertilities and Lexical Probabilities for hear

CIS 530 - Intro to NLP 39

Schematic of Translation Model

fertility

translation

distortion

null cepts

from What's New in Statistical Machine Translation, Kevin Knight and Philipp Koehn, Tutorial at HLT/NAACL 2003

CIS 530 - Intro to NLP 40

How do we evaluate MT?

Human-based Metrics• Semantic Invariance• Pragmatic Invariance• Lexical Invariance• Structural Invariance• Spatial Invariance• Fluency• Accuracy: Number of Human Edits required

—HTER: Human Translation Error Rate

• “Do you get it?” Automatic Metrics: Bleu

CIS 530 - Intro to NLP 41

BiLingual Evaluation Understudy (BLEU —Papineni, 2001)

Automatic Technique, but …. Requires the pre-existence of Human (Reference)

Translations Compare n-gram matches between candidate translation

and 1 or more reference translations

CIS 530 - Intro to NLP 42

Bleu Metric

Chinese-English Translation Example:

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

CIS 530 - Intro to NLP 43

Bleu Metric

Chinese-English Translation Example:

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.