CMSC 723 / LING 723: Computational Linguistics I September 5, 2007: Dorr Part I: MT (cont), MT...

57
CMSC 723 / LING 723: Computational Linguistics I September 5, 2007: Dorr Part I: MT (cont), MT Evaluation (J & M, 24) Part II: Reg. Expressions, FSAutomata (J&M 2) Prof. Bonnie J. Dorr Co-Instructor: Nitin Madnani TA: Hamid Shahri

Transcript of CMSC 723 / LING 723: Computational Linguistics I September 5, 2007: Dorr Part I: MT (cont), MT...

CMSC 723 / LING 723: Computational Linguistics I

September 5, 2007: Dorr

Part I: MT (cont), MT Evaluation (J & M, 24)Part II: Reg. Expressions, FSAutomata (J&M 2)

Prof. Bonnie J. DorrCo-Instructor: Nitin Madnani

TA: Hamid Shahri

MT Challenges: Ambiguity

Syntactic AmbiguityI saw the man on the hill with the telescope

Lexical AmbiguityE: bookS: libro, reservar

Semantic Ambiguity– Homography:

ball(E) = pelota, baile(S)– Polysemy:

kill(E), matar, acabar (S)– Semantic granularity

esperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)

Language Typology: MT Divergences (Dorr, 1990)

Meaning of two translationally equivalent phrases is distributed differently in the two languages

Example:– English: [RUN INTO ROOM]– Spanish: [ENTER IN ROOM RUNNING]

Divergence E/E’ (Spanish) E/E’ (Arabic)

Categorial be jealous when he returns have jealousy [tener celos] upon his return [ رجوعه [عند

Conflational float come again go floating [ir flotando] return [عاد]

Structural enter the house seek enter in the house [entrar en la casa] search for [ عن [بحثHead Swap run in do something quickly

enter running [entrar corriendo] go-quickly in doing something [اسرع]Thematic I have a headache my-head hurts me [me duele la cabeza] —

Spanish/Arabic Divergences (Dorr, Habash, and Hwa, 2002)

[Arg1 [V]] [Arg1 [MotionV] Modifier(v)]“The boat floated’’ “The boat went floating’’

Approximate IL Approach

Tap into richness of TL resources

Use some, but not all, components of IL representation

Generate multiple sentences that are statistically pared down

Approximating IL: Handling Divergences

PrimitivesSemantic RelationsLexical Information

Generation Heavy Hybrid MT (GHMT): Nizar Habash 2003

Interlingual vs. Approximate IL

Interlingual MT:– primitives & relations– bi-directional lexicons– analysis: compose IL– generation: decompose IL

Approximate IL– hybrid symbolic/statistical design– overgeneration with statistical ranking– uses dependency rep input and structural expansion

for “deeper” overgeneration

Mapping from Input Dependency to English Dependency Tree

Knowledge Resources in English only: (LVD, Dorr, 2001;CATVAR, Habash & Dorr, 2003).

Goal

GIVEV

MARY KICKN JOHN

ThemeAgent

[CAUSE GO]

Goal

KICKV

MARY JOHN

Agent

[CAUSE GO]

Mary le dio patadas a John → Mary kicked John

Statistical Extraction

Mary kicked John . [-0.670270 ]

Mary gave a kick at John . [-2.175831]

Mary gave the kick at John . [-3.969686]

Mary gave an kick at John . [-4.489933]

Mary gave a kick by John . [-4.803054]

Mary gave a kick to John . [-5.045810]

Mary gave a kick into John . [-5.810673]

Mary gave a kick through John . [-5.836419]

Mary gave a foot wound by John . [-6.041891]

Mary gave John a foot wound . [-6.212851]

Benefits of Approximate IL Approach

Explaining behaviors that appear to be statistical in nature

“Re-sourceability”: Re-use of already existing components for MT from new languages.

Application to monolingual alternations

What Resources are Required?

Deep TL resourcesRequires SL parser and tralexTL resources are richer: LVD

representations, CatVar database

Constrained overgenerationhttp://clipdemos.umiacs.umd.edu/catvar/http://clipdemos.umiacs.umd.edu/englcslex/

Divergence Frequency (as measured by Habash and others, 2003)

32% of sentences in UN Spanish/English Corpus (5K) 35% of sentences in TREC El Norte Corpus (19K) Divergence Types

– Categorial (X tener hambre X have hunger) [98%]

– Conflational (X dar puñaladas a Z X stab Z) [83%]

– Structural (X entrar en Y X enter Y) [35%]

– Head Swapping (X cruzar Y nadando X swim across Y) [8%]

– Thematic (X gustar a Y Y like X) [6%]

Language Divergences Impact Bilingual Alignment for Statistical MT

Word-level alignments of bilingual texts are an integral part of Statistical MT models

Divergences present a great challenge to the alignment task

Common divergence types can be found in multiple language pairs, systematically identified, and resolved

What is alignment?

I began to eat the fish

Yo empecé a comer el pescado

Why is this a hard problem?

I run into the room

Yo entro en el cuarto corriendo

Divergences!English: [RUN INTO ROOM]Spanish: [ENTER IN ROOM RUNNING]

What can be done?

Divergence Detection:Increase the number of aligned

wordsDecrease multiple alignments

DUSTer Approach: Divergence Unraveling

I run into the roomE:

I move-in running the roomE:

Yo entro en el cuarto corriendoS:

Word-Level Alignment (1): Test Setup

run

John into

room

John

enter

room

running

Ex: John ran into the room → John entered the room running

Divergence Detection: Categorize English sentences into one of 5 divergence types

Divergence Correction: Apply appropriate

structural transformation [E → E]

Word-Level Alignment Results

Number of aligned words:– English-Spanish: aligned words increased from

82.8% to 86%– English-Arabic: aligned words increased from

61.5% to 88.1%Multiple Alignments:

– English-Spanish: number of links went from 1.35 to 1.16

– English-Arabic: number of links increased from 1.48 to 1.72

Divergence Unraveling Conclusions

Divergence handling shows promise for improvement of automatic alignment

Conservative lower bound on divergence frequency

Effective solution: syntactic transformation of English

Validity of solution shown through alignment experiments

How do we evaluate MT?

Human-based Metrics– Invariance: Semantic, Pragmatic, Lexical, Structural, Spatial– Fluency– Accuracy– Adequacy– Edit cost of post-editing– Informativeness: “Do you get it?”

Automatic Metrics:– Bleu– NIST– METEOR– Precision & Recall– TER, HTER– GTM

BiLingual Evaluation Understudy (BLEU —Papineni, 2001)

Automatic Technique, but ….Requires the pre-existence of Human (Reference)

TranslationsApproach:

– Produce corpus of high-quality human translations– Judge “closeness” numerically (word-error rate)– Compare n-gram matches between candidate translation

and 1 or more reference translations

http://www.research.ibm.com/people/k/kishore/RC22176.pdf

Bleu Comparison

Chinese-English Translation Example:

Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

How Do We Compute Bleu Scores?

Key Idea: A reference word should be considered exhausted after a matching candidate word is identified.

• For each word compute: (1) candidate word count(2) maximum ref count

• Add counts for each candidate word using the lower of the two numbers .

• Divide by number of candidate words..

Modified Unigram Precision: Candidate #1

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

It(1) is(1) a(1) guide(1) to(1) action(1) which(1) ensures(1) that(2) the(4) military(1) always(1) obeys(0) the commands(1) of(1) the party(1)

What’s the answer??????

17/18

Modified Unigram Precision: Candidate #2

It(1) is(1) to(1) insure(0) the(4) troops(0) forever(1) hearing(0) the activity(0) guidebook(0) that(2) party(1) direct(0)

What’s the answer??????

8/14

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Modified Bigram Precision: Candidate #1It is(1) is a(1) a guide(1) guide to(1) to action(1) action which(0) which ensures(0) ensures that(1) that the(1) the military(1) military always(0) always obeys(0) obeys the(0) the commands(0) commands of(0) of the(1) the party(1)

What’s the answer??????

10/17

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Modified Bigram Precision: Candidate #2

Reference 1: It is a guide to action that ensures that themilitary will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

It is(1) is to(0) to insure(0) insure the(0) the troops(0) troops forever(0) forever hearing(0) hearing the(0) the activity(0) activity guidebook(0) guidebook that(0) that party(0) party direct(0)

What’s the answer??????

1/13

Catching Cheaters

Reference 1: The cat is on the mat

Reference 2: There is a cat on the mat

the(2) the the the(0) the(0) the(0) the(0)

What’s the unigram answer?

2/7

What’s the bigram answer?

0/7

Part II of Lecture 2

Regular Expressions and Finite State Automata (J&M, 2)

Regular Expressions and Finite State Automata

REs: Language for specifying text stringsSearch for document containing a string

– Searching for “woodchuck”

• Finite-state automata (FSA)(singular: automaton)

• How much wood would a woodchuck chuck if a woodchuck would chuck wood?

– Searching for “woodchucks” with an optional final “s”

Regular Expressions

Basic regular expression patternsPerl-based syntax (slightly different from

other notations for regular expressions)Disjunctions /[wW]oodchuck/

Regular Expressions

Ranges [A-Z]

Negations [^Ss]

Regular Expressions

Optional characters ? ,* and +– ? (0 or 1)

• /colou?r/ color or colour

– * (0 or more)• /oo*h!/ oh! or Ooh! or Ooooh!

*+

Stephen Cole Kleene

– + (1 or more)

• /o+h!/ oh! or Ooh! or Ooooh!

Wild cards .- /beg.n/ begin or began or begun

Regular Expressions

Anchors ^ and $– /^[A-Z]/ “Ramallah, Palestine”

– /^[^A-Z]/ “¿verdad?” “really?”

– /\.$/ “It is over.”

– /.$/ ?

Boundaries \b and \B– /\bon\b/ “on my way” “Monday”

– /\Bon\b/ “automaton”

Disjunction |– /yours|mine/ “it is either yours or mine”

Disjunction, Grouping, Precedence

Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ *)*/

Precedence– Parenthesis ()– Counters * + ? {}– Sequences and anchors the ^my end$– Disjunction |

REs are greedy!

Perl Commands

While ($line=<STDIN>){if ($line =~ /the/){

print “MATCH: $line”;}

}

Writing correct expressions

Exercise: Write a Perl regular expression to match the English article “the”:

/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

A more complex example

Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Advanced operators

Substitutions and Memory

Substitutionss/colour/color/s/colour/color/g

s/([0-9]+)/<\1>/

/the (.*)er they were, the \1er they will be/

/the (.*)er they (.*), the \1er they \2/

Substitute as many times as possible!

Case insensitive matching

s/colour/color/i

Memory (\1, \2, etc. refer back to matches)

Eliza [Weizenbaum, 1966]

User: Men are all alike

ELIZA: IN WHAT WAY

User: They’re always bugging us about something or other

ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?

User: Well, my boyfriend made me come here

ELIZA: YOUR BOYFRIEND MADE YOU COME HERE

User: He says I’m depressed much of the time

ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza-style regular expressions

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Step 1: replace first person references with second person referencess/\bI(’m| am)\b /YOU ARE/g

s/\bmy\b /YOUR/g

S/\bmine\b /YOURS/g

Step 2: use additional regular expressions to generate replies

Step 3: use scores to rank possible transformations

Finite-state Automata

Finite-state automata (FSA)Regular languagesRegular expressions

Finite-state Automata (Machines)

/baa+!/

q0 q1 q2 q3 q4

b a a !

a

state transitionfinalstate

baa! baaa! baaaa! baaaaa! ...

Input Tape

a b a ! b

q0

0 1 2 3 4

b a a !a

REJECT

Input Tape

b a a a

q0 q1 q2 q3 q3 q4

!

0 1 2 3 4

b a a !a

ACCEPT

Finite-state Automata

Q: a finite set of N states – Q = {q0, q1, q2, q3, q4}

: a finite input alphabet of symbols = {a, b, !}

q0: the start stateF: the set of final states

– F = {q4}(q,i): transition function

– Given state q and input symbol i, return new state q' (q3,!) q4

State-transition Tables

Input

State b a !

0 1 0 0

1 0 2 0

2 0 3 0

3 0 3 4

4: 0 0 0

D-RECOGNIZE

function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end

Adding a failing state

q0 q1 q2 q3 q4

b a a !

a

qFa

!

b

! b ! b

b

a

!

Adding an “all else” arc

q0 q1 q2 q3 q4

b a a !

a

qF

== =

=

Languages and Automata

Can use FSA as a generator as well as a recognizer

Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. – L(M) = {baa!, baaa!, baaaa!, …}

Regular languages vs. non-regular languages

Languages and Automata

Deterministic vs. Non-deterministic FSAs

Epsilon () transitions

Using NFSAs to accept strings

Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.

Look-ahead: look ahead in inputParallelism: look at alternatives in parallel

Using NFSAs

Input

State b a ! 0 1 0 0 0

1 0 2 0 0

2 0 2,3 0 0

3 0 0 4 0

4 0 0 0 0

Readings for next time

Python Readings (see syllabus).