MT Book Ch. 7: Optimization

Ch. 7: Optimization

M2 Yuichiro Sawai

1 15/07/09

Overview

15/07/09 2

• MT decoding

• Need to find w that assigns higher scores to be@er translaBons (e, d) •  Be@er translaBons = translaBons with lower error

f: source sentence, e: target sentence, d: derivaBon w: weight vector, h(・): feature funcBon

Loss MinimizaBon • Given parallel corpus (F, E), find w that minimizes loss funcBon l(・)

•  e.g., l(F, E; w) = 1 – BLEU(E, decodew(F)) •  λ is a regularizaBon constant to avoid overfiUng

15/07/09 3

regularizaBon term

Problems to Consider 1.  Search space is vast •  impossible to consider all candidates •  correct translaBon is rarely possible

2.  ApproximaBon of error funcBon •  Error metrics (e.g. BLEU) are not differenBable •  Split corpus-‐level metrics into sentence level

3.  How to calculate argmin wTh

15/07/09 4

Batch Learning • Given parallel corpus (F, E), iniBalize w and iteraBvely

1.  decode whole corpus F with current w, and get k-‐best lists C 2.  opBmize w 3.  loop unBl convergence

•  vs. online learning •  opBmize w per sentence

15/07/09 5

Minimum Error Rate Training (MERT) • Given error funcBon error(E, Ê), directly minimize it •  E: reference translaBons, Ê: system translaBons •  e.g. error(E, Ê) = 1 – BLEU(E, Ê)

•  In other words,

•  Since error(・) is not differenBable w.r.t. w, gradient-‐based method is not applicable •  Instead, use Powell’s method •  gradients not required

15/07/09 6

Powell’s Method •  IteraBvely, fix a direcBon, and find opBmal w in that direcBon •  Applicable when gradients are not available

15/07/09 7

w0

w1 w2

w3

x1

x2

OpBmizaBon in One DirecBon •  1-‐best translaBon parameterized by scalar γ

15/07/09 8

bm: one-‐hot vector with mth dim = 1

intercept slope

γ

wh + γh

c1

c2

c4 c3

Candidates with highest score are selected

envelope

γ

error

c1

c3

c4

e.g.) f = 黒い猫を見た e = I saw a black cat c1 = I saw black cat c2 = saw a black cat …

Corpus-‐level Error •  Sentence-‐level losses are summed to get corpus-‐level error

15/07/09 9

sentence 1 sentence 2

add

sentence-‐level error

sentence-‐level envelope

mulB-‐sentence error

γ* Find γ that minimizes overall error!

Problems of Powell’s Method •  SensiBve to iniBalizaBon of w • Not suitable for high-‐dimensional feature vectors

15/07/09 10

Sojmax Loss •  TranslaBon probability

•  Loss is negaBve likelihood of oracle translaBons

where oracle translaBons are

• Gradient-‐based methods (e.g. L-‐BFGS) are applicable

15/07/09 11

Max Margin Loss

15/07/09 12

• Make sure distances between correct translaBons and incorrect translaBons are large

•  For example:

• OpBmizaBon methods for SVM are applicable (e.g. SMO)

for all oracle and non-‐oracle pairs …

penalize when diff in error is greater than diff in score

f: 黒い猫を見た, e (correct): I saw a black cat

e* (oracle) I saw black cat 0.1 0.4 e (system) see red dog 0.9 0.3

error score (=wTh)

large small! bad!

Pairwise Ranking OpBmizaBon (PRO) •  Parameter esBmaBon as ranking problem

•  Classifier learns w to rank candidates by error • Generate training examples from pairs of candidates •  posiBve example: h(cand1) – h(cand2) = (-‐4, 6) •  negaBve example: h(cand3) – h(cand1) = (3, -‐7) •  wT{h(cand1) – h(cand2)} > 0 ⇔ wTh(cand1) > wTh(cand2)

• Off-‐the-‐shelf linear binary classifiers can be used

15/07/09 13

f: 黒い猫を見た, e (correct): I saw a black cat

e (cand1) I see black cat 0.3 (-‐1, 2) ??? e (cand2) see black dog 0.7 (3, -‐4) ??? e (cand3) see red dog 0.9 (2, -‐5) ???

error score (=wTh) h

Minimum Bayes Risk

15/07/09 14

• Minimize expected loss

where •  γ = 0: all candidates are equally likely •  γ = 1: sojmax •  γ→∞: highest scoring candidate with probability 1 (MERT)

• DifferenBable and considers many candidates <e,d>

Sentence-‐level BLEU

•  Sentence-‐level error funcBons are needed for opBmizaBon •  BLEU is corpus-‐level metric

•  4-‐gram precision is ojen 0 on sentence level •  varies from human judgments

•  Sentence-‐level error •  Linear BLEU •  (Expected BLEU)

15/07/09 15

Linear BLEU

•  Linear approximaBon of change in BLEU

c: sum of sentence lengths mn: # matched n-‐grams

•  Add one sentence: (c, mn) -‐> (c’, mn’)

•  Linear BLEU error of candidate e

15/07/09 16

log BLEU

(c,mn) (c’,m’n)

Δ

# matched n-‐grams in e

MT Book Ch. 7: Optimization

Engineering

Transcript of MT Book Ch. 7: Optimization