An Operation Sequence Model for Explainable Neural Machine ...

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 175–186Brussels, Belgium, November 1, 2018. c©2018 Association for Computational Linguistics

175

An Operation Sequence Model for Explainable Neural MachineTranslation

Felix Stahlberg and Danielle Saunders and Bill ByrneDepartment of Engineering

University of Cambridge, UK{fs439,ds636,wjb31}@cam.ac.uk

Abstract

We propose to achieve explainable neural ma-chine translation (NMT) by changing the out-put representation to explain itself. We presenta novel approach to NMT which generatesthe target sentence by monotonically walkingthrough the source sentence. Word reorder-ing is modeled by operations which allow set-ting markers in the target sentence and movea target-side write head between those mark-ers. In contrast to many modern neural mod-els, our system emits explicit word alignmentinformation which is often crucial to practi-cal machine translation as it improves explain-ability. Our technique can outperform a plaintext system in terms of BLEU score under therecent Transformer architecture on Japanese-English and Portuguese-English, and is within0.5 BLEU difference on Spanish-English.

1 Introduction

Neural machine translation (NMT) mod-els (Sutskever et al., 2014; Bahdanau et al.,2015; Gehring et al., 2017; Vaswani et al., 2017)are remarkably effective in modelling the distri-bution over target sentences conditioned on thesource sentence, and yield superior translationperformance compared to traditional statisticalmachine translation (SMT) on many languagepairs. However, it is often difficult to extract acomprehensible explanation for the predictionsof these models as information in the networkis represented by real-valued vectors or matri-ces (Ding et al., 2017). In contrast, the translationprocess in SMT is ‘transparent’ as it can identifythe source word which caused a target wordthrough word alignment. Most NMT modelsdo not use the concept of word alignment. It istempting to interpret encoder-decoder attentionmatrices (Bahdanau et al., 2015) in neural modelsas (soft) alignments, but previous work has found

that the attention weights in NMT are oftenerratic (Cheng et al., 2016) and differ significantlyfrom traditional word alignments (Koehn andKnowles, 2017; Ghader and Monz, 2017). Wewill discuss the difference between attention andalignment in detail in Sec. 4. The goal of thispaper is explainable NMT by developing a trans-parent translation process for neural models. Ourapproach does not change the neural architecture,but represents the translation together with itsalignment as a linear sequence of operations. Theneural model predicts this operation sequence,and thus simultaneously generates a translationand an explanation for it in terms of alignmentsfrom the target words to the source words thatgenerate them. The operation sequence is “self-explanatory”; it does not explain an underlyingNMT system but is rather a single representationproduced by the NMT system that can be used togenerate translations along with an accompanyingexplanatory alignment to the source sentence.We report competitive results of our methodon Spanish-English, Portuguese-English, andJapanese-English, with the benefit of producinghard alignments for better interpretability. Wediscuss the theoretical connection between ourapproach and hierarchical SMT (Chiang, 2005)by showing that an operation sequence can beseen as a derivation in a formal grammar.

2 A Neural Operation Sequence Model

Our operation sequence neural machine transla-tion (OSNMT) model is inspired by the operationsequence model for SMT (Durrani et al., 2011),but changes the set of operations to be more ap-propriate for neural sequence models. OSNMT isnot restricted to a particular architecture, i.e. anyseq2seq model such as RNN-based, convolutional,or self-attention-based models (Bahdanau et al.,

176

2015; Vaswani et al., 2017; Gehring et al., 2017)could be used. In this paper, we use the recentTransformer model architecture (Vaswani et al.,2017) in all experiments.

In OSNMT, the neural seq2seq model learnsto produce a sequence of operations. An OS-NMT operation sequence describes a translation(the ‘compiled’ target sentence) and explains eachtarget token with a hard link into the source sen-tence. OSNMT keeps track of the positions of asource-side read head and a target-side write head.The read head monotonically walks through thesource sentence, whereas the position of the writehead can be moved from marker to marker in thetarget sentence. OSNMT defines the following op-erations to control head positions and produce out-put words.

• POP SRC: Move the read head right by onetoken.

• SET MARKER: Insert a marker symbol intothe target sentence at the position of the writehead.

• JMP FWD: Move the write head to the nearestmarker right of the current head position inthe target sentence.

• JMP BWD: Move the write head to the nearestmarker left of the current head position in thetarget sentence.

• INSERT(t): Insert a target token t into thetarget sentence at the position of the writehead.

Tab. 1 illustrates the generation of a Japanese-English translation in detail. The neural seq2seqmodel is trained to produce the sequence of opera-tions in the first column of Tab. 1. The initial stateof the target sentence is a single marker symbolX1. Generative operations like SET MARKER orINSERT(t) insert a single symbol left of the cur-rent marker (highlighted). The model begins witha SET MARKER operation, which indicates thatthe translation of the first word in the source sen-tence is not at the beginning of the target sentence.Indeed, after “translating” the identities ‘2000’and ‘hr’, in time step 6 the model jumps backto the marker X2 and continues writing left of‘2000’. The translation process terminates whenthe read head is at the end of the source sentence.The final translation in plain text can be obtained

by removing all markers from the (compiled) tar-get sentence.

2.1 OSNMT Represents AlignmentsThe word alignment can be derived from theoperation sequence by looking up the positionof the read head for each generated target to-ken. The alignment for the example in Tab. 1 isshown in Fig. 1. Note that similarly to the IBMmodels (Brown et al., 1993) and the OSM forSMT (Durrani et al., 2011), our OSNMT can onlyrepresent 1:n alignments. Thus, each target tokenis aligned to exactly one source token, but a sourcetoken can generate any number of (possibly non-consecutive) target tokens.

2.2 OSNMT Represents HierarchicalStructure

We can also derive a tree structure from the op-eration sequence in Tab. 1 (Fig. 2) in which eachmarker is represented by a nonterminal node withoutgoing arcs to symbols inserted at that marker.The target sentence can be read off the tree bydepth-first search traversal (post-order).

More formally, synchronous context-free gram-mars (SCFGs) generate pairs of strings by pairingtwo context-free grammars. Phrase-based hierar-chical SMT (Chiang, 2005) uses SCFGs to modelthe relation between the source sentence and thetarget sentence. Multitext grammars (MTGs) are ageneralization of SCFGs to more than two outputstreams (Melamed, 2003; Melamed et al., 2004).We find that an OSNMT sequence can be inter-preted as sequence of rules of a tertiary MTG Gwhich generates 1.) the source sentence, 2.) thetarget sentence, and 3.) the position of the targetside write head. The start symbol of G is

[(S), (X1), (P1)]T (1)

which initializes the source sentence stream witha single nonterminal S, the target sentence withthe initial marker X1 and the position of the writehead with 1 (P1). Following Melamed et al. (2004)we denote rules in G as

[(α1), (α2), (α3)]T → [(β1), (β2), (β3)]

T (2)

where α1, α2, α3 are single nonterminals orempty, β1, β2, β3 are strings of terminals and non-terminals, and αi → βi for all i ∈ {1, 2, 3} withnonempty αi are the rewriting rules for each of

177

Operation Source sentence Target sentence (compiled)2000 hrの安定動作を確認した X1

1 SET MARKER 2000 hrの安定動作を確認した X2 X1

2 2000 2000 hrの安定動作を確認した X2 2000 X1

3 POP SRC 2000 hr の安定動作を確認した X2 2000 X1

4 hr 2000 hr の安定動作を確認した X2 2000 hr X1

5 POP SRC 2000 hr の安定動作を確認した X2 2000 hr X1

6 JMP BWD 2000 hr の安定動作を確認した X2 2000 hr X1

7 SET MARKER 2000 hr の安定動作を確認した X3 X2 2000 hr X1

8 of 2000 hr の安定動作を確認した X3 of X2 2000 hr X1

9 POP SRC 2000 hrの安定動作を確認した X3 of X2 2000 hr X1

10 JMP BWD 2000 hrの安定動作を確認した X3 of X2 2000 hr X1

11 stable 2000 hrの安定動作を確認した stable X3 of X2 2000 hr X1

12 POP SRC 2000 hrの安定動作を確認した stable X3 of X2 2000 hr X1

13 operation 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr X1

14 POP SRC 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr X1

15 JMP FWD 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr X1

16 JMP FWD 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr X1

17 was 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr was X1

18 POP SRC 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr was X1

19 POP SRC 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr was X1

20 confirmed 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr was confirmed X1

21 POP SRC 2000 hrの安定動作を確認した stable operation X3 of X2 2000 hr was confirmed X1

Table 1: Generation of the target sentence “stable operation of 2000 hr was confirmed” from the sourcesentence “2000 hr の安定動作を確認した”. The neural model produces the linear sequenceof operations in the first column. The positions of the source-side read head and the target-side writehead are highlighted. The marker in the target sentence produced by the i-th SET MARKER operation isdenoted with ‘Xi+1’;X1 is the initial marker. We denote INSERT(t) operations as t to simplify notation.

Figure 1: The translation and the alignment derivedfrom the operation sequence in Tab. 1.

the three individual components which need tobe applied simultaneously. POP SRC extends thesource sentence prefix in the first stream by onetoken.

POP SRC : ∀s ∈ Vsrc :

(S)()()

→(sS)()

()

(3)

where Vsrc is the source language vocabulary. Ajump from marker Xi to Xj is realized by replac-

Figure 2: Target-side tree representation of the op-eration sequence in Tab. 1.

ing Pi with Pj in the third grammar component:

JMP : ∀i, j ∈ N : [(), (), Pi]T → [(), (), (iPj)]

T

(4)where N = {k ∈ N|k ≤ n} is the set ofthe first n natural numbers for a sufficiently largen. The generative operations (SET MARKER and

178

Derivation OSNMT[(S), (X1), P1]

T

Eq. 3→ [(2000 S), (X1), P1]T SET MARKER

Eq. 5→ [(2000 S), (X2X1), (P1)]T 2000

Eq. 6→ [(2000 S), (X2 2000 X1), (P1)]T POP SRC

Eq. 3→

(2000 hr S)(X2 2000 X1)

(P1)

hr

Eq. 6→

(2000 hr S)(X2 2000 hr X1)

(P1)

POP SRC

Eq. 3→

(2000 hrの S)(X2 2000 hr X1)

(P1)

JMP BWD

Eq. 4→

(2000 hrの S)(X2 2000 hr X1)

(1 P2)

SET MARKER

Eq. 5→

(2000 hrの S)(X3X2 2000 hr X1)

(1 P2)

of

Eq. 6→

(2000 hrの S)(X3 of X2 2000 hr X1)

(1 P2)

...

...

Table 2: Derivation in G for the example of Tab. 1.

INSERT(t)) insert symbols into the second com-ponent.

SET MARKER : ∀i ∈ N :

()(Xi)(Pi)

→ ()(Xi+1Xi)

(Pi)

(5)

INSERT : ∀i ∈ N , t ∈ Vtrg :

()(Xi)(Pi)

→ ()(tXi)(Pi)

(6)

where Vtrg is the target language vocabulary. Theidentity mapping Pi → Pi in the third componentenforces that the write head is at marker Xi. Wenote that G is not only context-free but also reg-ular in the first and third components (but not inthe second component due to Eq. 5). Rules ofthe form in Eq. 6 are directly related to alignmentlinks (cf. Fig. 1) as they represent the fact that tar-get token t is aligned to the last terminal symbolin the first stream. We formalize removing mark-ers/nonterminals at the end by introducing a spe-cial nonterminal T which is eventually mapped tothe end-of-sentence symbol EOS:

[(S), (), ()]T → [(T ), (), ()]T (7)

[(T ), (), ()]T → [(EOS), (), ()]T (8)

∀i ∈ N : [(T ), (Xi), ()]T → [(T ), (ε), ()]T (9)

∀i ∈ N : [(T ), (), (Pi)]T → [(T ), (), (ε)]T (10)

Tab. 2 illustrates that there is a 1:1 correspon-dence between a derivation in G and an OSNMToperation sequence. The target-side derivation(the second component in G) is structurally sim-ilar to a binarized version of the tree in Fig. 2.However, we assign scores to the structure via thecorresponding OSNMT sequence which does notneed to obey the usual conditional independenceassumptions in hierarchical SMT. Therefore, eventhough G is context-free in the second component,our scoring model for G is more powerful as it con-ditions on the OSNMT history which potentiallycontains context information. Note that OSNMTis deficient (Brown et al., 1993) as it assigns non-zero probability mass to any operation sequence,not only those with derivation in G.

We further note that subword-based OSNMTcan potentially represent any alignment to any tar-get sentence as long as the alignment does notviolate the 1:n restriction. This is in contrast tophrase-based SMT where reference translationsoften do not have a derivation in the SMT systemdue to coverage problems (Auli et al., 2009).

2.3 Comparison to the OSM for SMTOur OSNMT set of operations (POP SRC,SET MARKER, JMP FWD, JMP BWD, andINSERT(t)) is inspired by the original OSMfor SMT (Durrani et al., 2011) as it also repre-sents the translation process as linear sequenceof operations. However, there are significantdifferences which make OSNMT more suitablefor neural models. First, OSNMT is monotone onthe source side, and allows jumps on the targetside. SMT-OSM operations jump in the sourcesentence. We argue that source side monotonicitypotentially mitigates coverage issues of neuralmodels (over- and under-translation (Tu et al.,2016)) as the attention can learn to scan thesource sentence from left to right. Another majordifference is that we use markers rather than gaps,and do not close a gap/marker after jumping to it.This is an implication of OSNMT jumps beingdefined on the target side since the size of a spanis unknown at inference time.

179

Algorithm 1 Align2OSNMT(a, x, y)

1: holes← {(0,∞)}2: ops← 〈〉 {Initialize with empty list}3: head← 04: for i← 1 to |x| do5: for all j ∈ {j|aj = i} do6: hole idx← holes.find(j)7: d← hole idx− head8: if d < 0 then9: ops.extend(JMP BWD.repeat(−d))

10: end if11: if d > 0 then12: ops.extend(JMP FWD.repeat(d))13: end if14: head← hole idx15: (s, t)← holes[head]16: if s 6= j then17: holes.append((s, j − 1))18: head← head+ 119: ops.append(SET MARKER)20: end if21: ops.append(yj)22: holes[head]← (j + 1, t)23: end for24: ops.append(SRC POP)25: end for26: return ops

3 Training

We train our Transformer model as usual byminimising the negative log-likelihood of the tar-get sequence. However, in contrast to plain textNMT, the target sequence is not a plain sequenceof subword or word tokens but a sequence of op-erations. Consequently, we need to map the targetsentences in the training corpus to OSNMT repre-sentations. We first run a statistical word alignerlike Giza++ (Och and Ney, 2003) to obtain analigned training corpus. We delete all alignmentlinks which violate the 1:n restriction of OSNMT(cf. Sec. 2). The alignments together with the tar-get sentences are then used to generate the refer-ence operation sequences for training. The algo-rithm for this conversion is shown in Alg. 1.1 Notethat an operation sequence represents one specificalignment, which means that the only way for anOSNMT sequence to be generated correctly is if

1A Python implementation is available at https://github.com/fstahlberg/ucam-scripts/blob/master/t2t/align2osm.py.

Corpus Language pair # SentencesScielo Spanish-English 587KScielo Portuguese-English 513KWAT Japanese-English 1M

Table 3: Training set sizes.

both the word alignment and the target sentenceare also correct. Thereby, the neural model learnsto align and translate at the same time. However,there is spurious ambiguity as one alignment canbe represented by different OSNMT sequences.For instance, simply adding a SET MARKER op-eration at the end of an OSNMT sequence doesnot change the alignment represented by it.

4 Results

We evaluate on three language pairs: Japanese-English (ja-en), Spanish-English (es-en), andPortuguese-English (pt-en). We use the ASPECcorpus (Nakazawa et al., 2016) for ja-en and thehealth science portion of the Scielo corpus (Nevesand Neveol, 2016) for es-en and pt-en. Train-ing set sizes are summarized in Tab. 3. Weuse byte pair encoding (Sennrich et al., 2016)with 32K merge operations for all systems (jointencoding models for es-en and pt-en and sepa-rate source/target models for ja-en). We trainedTransformer models (Vaswani et al., 2017)2 un-til convergence (250K steps for plain text, 350Ksteps for OSNMT) on a single GPU using Ten-sor2Tensor (Vaswani et al., 2018) after removingsentences with more than 250 tokens. Batchescontain around 4K source and 4K target tokens.Transformer training is very sensitive to the batchsize and the number of GPUs (Popel and Bojar,2018). Therefore, we delay SGD updates (Saun-ders et al., 2018) to every 8 steps to simulate 8GPU training as recommended by Vaswani et al.(2017). Based on the performance on the ja-en devset we decode the plain text systems with a beamsize of 4 and OSNMT with a beam size of 8 usingour SGNMT decoder (Stahlberg et al., 2017). Weuse length normalization for ja-en but not for es-en or pt-en. We report cased multi-bleu.plBLEU scores on the tokenized text to be compara-ble with the WAT evaluation campaign on ja-en.3.

2We follow the transformer base configuration anduse 6 layers, 512 hidden units, and 8 attention heads in boththe encoder and decoder.

3http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=2&o=4

https://github.com/fstahlberg/ucam-scripts/blob/master/t2t/align2osm.py



http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=2&o=4

http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/list.php?t=2&o=4

180

BLEUMethod es-en pt-enAlign on subword level 36.7 38.1Convert word level alignments 37.1 38.4

Table 4: Generating training alignments on thesubword level.

Type FrequencyValid 92.49%Not enough SRC POP 7.28%Too many SRC POP 0.22%Write head out of range 0.06%

Table 5: Frequency of invalid OSNMT sequencesproduced by an unconstrained decoder on the ja-en test set.

Generating training alignments As outlined inSec. 3 we use Giza++ (Och and Ney, 2003) to gen-erate alignments for training OSNMT. We experi-mented with two different methods to obtain align-ments on the subword level. First, Giza++ can di-rectly align the source-side subword sequences totarget-side subword sequences. Alternatively, wecan run Giza++ on the word level, and convert theword alignments to subword alignments in a post-processing step by linking subwords if the wordsthey belong to are aligned with each other. Tab. 4compares both methods and shows that convertingword alignments is marginally better. Thus, weuse this method in all other experiments.

Constrained beam search Unconstrained neu-ral decoding can yield invalid OSNMT sequences.For example, the JMP FWD and JMP BWD opera-tions are undefined if the write head is currentlyat the position of the last or first marker, respec-tively. The number of SRC POP operations mustbe equal to the number of source tokens in or-der for the read head to scan the entire sourcesentence. Therefore, we constrain these opera-tions during decoding. We have implemented theconstraints in our publicly available SGNMT de-coding platform (Stahlberg et al., 2017). How-ever, these constraints are only needed for a smallfraction of the sentences. Tab. 5 shows that evenunconstrained decoding yields valid OSNMT se-quences in 92.49% of the cases.

Comparison with plain text NMT Tab. 6 com-pares our OSNMT systems with standard plaintext models on all three language pairs. OSNMTperforms better on the pt-en and ja-en test sets, but

BLEUes-en pt-en ja-en

Representation dev testPlain 37.6 37.5 28.3 28.1OSNMT 37.1 38.4 28.1 28.8

Table 6: Comparison between plain text andOSNMT on Spanish-English (es-en), Portuguese-English (pt-en), and Japanese-English (ja-en).

slightly worse on es-en. We think that more engi-neering work such as optimizing the set of oper-ations or improving the training alignments couldlead to more consistent gains from using OSNMT.However, we leave this to future work since themain motivation for this paper is explainable NMTand not primarily improving translation quality.

Alignment quality Tab. 7 contains exampletranslations and subword-alignments generatedfrom our Portuguese-English OSNMT model.Alignment links from source words consisting ofmultiple subwords are mapped to the final sub-word, visible for the words ‘temperamento’ inthe first example and ‘pennisetum’ in the secondone. The length of the operation sequences in-creases with alignment complexity as operationsequences for monotone alignments consist onlyof INSERT(t) and SRC POP operations (example1). However, even complex mappings are cap-tured very well by OSNMT as demonstrated by thethird example. Note that OSNMT can representlong-range reorderings very efficiently: the move-ment from ‘para’ in the first position to ‘to’ in thetenth position is simply achieved by starting theoperation sequence with ‘SET MARKER to’ and aJMP BWD operation later. The first example in par-ticular demonstrates the usefulness of such align-ments as the wrong lexical choice (‘abroad’ ratherthan ‘body shape’) can be traced back to the sourceword ‘exterior‘.

For a qualitative assessment of the alignmentsproduced by OSNMT we ran Giza++ to align thegenerated translations to the source sentences, en-forced the 1:n restriction of OSNMT, and usedthe resulting alignments as reference for comput-ing the alignment error rate (Och and Ney, 2003,AER). Fig. 3 shows that as training proceeds, OS-NMT learns to both produce high quality transla-tions (increasing BLEU score) and accurate align-ments (decreasing AER).

As mentioned in the introduction, a light-weightway to extract 1:n alignments from a vanilla atten-

181

Operation sequence: SRC POP ab road SRC POP as SRC POP an indicator SRC POP of SRC POP performanceSRC POP and SRC POP SRC POP temper ament SRC POPReference: the body shape as an indicative of performance and temperament

Operation sequence: behavior SRC POP of SRC POP SET MARKER clones SRC POP SRC POP SRC POPSRC POP SRC POP JMP BWD pen n is et um SRC POP JMP FWD subjected SRC POP to SRC POP SET MARKERperiods SRC POP SRC POP JMP BWD SET MARKER restriction SRC POP JMP BWD SET MARKER waterSRC POP JMP BWD controlled SRC POPReference: response of pennisetum clons to periods of controlled hidric restriction

Operation sequence: SET MARKER to SRC POP analyze SRC POP these SRC POP data SRC POP JMP BWDSET MARKER should be SRC POP used SRC POP JMP BWD SET MARKER methodologies SRC POP¿ JMP BWDappropriate SRC POP JMP FWD JMP FWD JMP FWD . SRC POPReference: to analyze these data suitable methods should be used .

Table 7: Examples of Portuguese-English translations together with their (subword-)alignments inducedby the operation sequence. Alignment links from source words consisting of multiple subwords weremapped to the final subword in the training data, visible for ‘temperamento’ and ‘pennisetum’.

AER (in %)Representation Alignment extraction dev testPlain LSTM forced decoding 63.9 63.7Plain LSTM forced decoding with supervised attention (Liu et al., 2016, Cross Entropy loss) 54.9 54.7OSNMT OSNMT 24.2 21.5

Table 8: Comparison between OSNMT and using the attention matrix from forced decoding with arecurrent model.

0

5

10

15

20

25

30

50K 100K 150K 200K 250K 20%

30%

40%

50%

60%

70%

80%

90%

100%

Dev s

et B

LE

U

Alig

nm

ent E

rror

Rate

Training iterations

BLEUAER

Figure 3: AER and BLEU training curves for OS-NMT on the Japanese-English dev set.

tional LSTM-based seq2seq model is to take themaximum over attention weights for each targettoken. This is possible because, unlike the Trans-former, LSTM-based models usually only havea single soft attention matrix. However, in ourexperiments, LSTM-based NMT was more than4.5 BLEU points worse than the Transformer onJapanese-English. Therefore, to compare AERsunder comparable BLEU scores, we used theLSTM-based models in forced decoding mode onthe output of our plain text Transformer modelfrom Tab. 6. We trained two different LSTM mod-els: one standard model by optimizing the like-

182

(a) Layer 4, head 1; attending to the source side readhead.

(b) Layer 2, head 3; attending to the right trigram contextof the read head.

Figure 4: Encoder-decoder attention weights.

lihood of the training set, and a second one withsupervised attention following Liu et al. (2016).Tab. 8 shows that the supervised attention lossof Liu et al. (2016) improves the AER of theLSTM model. However, OSNMT is able to pro-duce much better alignments since it generates thealignment along with the translation in a single de-coding run.

OSNMT sequences contain target words insource sentence order An OSNMT sequencecan be seen as a sequence of target words insource sentence order, interspersed with instruc-tions on how to put them together to form a flu-ent target sentence. For example, if we stripout all SRC POP, SET MARKER, JMP FWD, andJMP BWD operations in the OSNMT sequence inthe second example of Tab. 7 we get:

behavior of clones pennisetum sub-jected to periods restriction water con-trolled

The word-by-word translation back to Por-tugese is:

comportamento de clones pennisetumsubmetidos a perıodos restricao hıdricacontrolada

This restores the original source sentence (cf.Tab. 7) up to unaligned source words. There-fore, we can view the operations for control-ling the write head (SET MARKER, JMP FWD, andJMP BWD) as reordering instructions for the targetwords which appear in source sentence word orderwithin the OSNMT sequence.

Role of multi-head attention In this paper, weuse a standard seq2seq model (the Transformer ar-chitecture (Vaswani et al., 2017)) to generate OS-NMT sequences from the source sentence. Thismeans that our neural model is representation-agnostic: we do not explicitly incorporate the no-tion of read and write heads into the neural archi-tecture. In particular, neither in training nor in de-coding do we explicitly bias the Transformer’s at-tention layers towards consistency with the align-ment represented by the OSNMT sequence. OurTransformer model has 48 encoder-decoder atten-tion matrices due to multi-head attention (8 headsin each of the 6 layers). We have found that manyof these attention matrices have strong and in-terpretable links to the translation process repre-sented by the OSNMT sequence. For example,Fig. 4a shows that the first head in layer 4 followsthe source-side read head position very closely:at each SRC POP operation the attention shifts by

183

one to the next source token. Other attention headshave learned to take other responsibilities. For in-stance, head 3 in layer 2 (Fig. 4b) attends to thetrigram right of the source head.

5 Related Work

Explainable and interpretable machine learning isattracting more and more attention in the researchcommunity (Ribeiro et al., 2016; Doshi-Velez andKim, 2017), particularly in the context of natu-ral language processing (Karpathy et al., 2015; Liet al., 2016; Alvarez-Melis and Jaakkola, 2017;Ding et al., 2017; Feng et al., 2018). These ap-proaches aim to explain (the predictions of) an ex-isting model. In contrast, we change the targetrepresentation such that the generated sequencesthemselves convey important information aboutthe translation process such as the word align-ments.

Despite considerable consensus about the im-portance of word alignments in practice (Koehnand Knowles, 2017), e.g. to enforce constraintson the output (Hasler et al., 2018) or to preservetext formatting, introducing explicit alignment in-formation to NMT is still an open research prob-lem. Word alignments have been used as supervi-sion signal for the NMT attention model (Mi et al.,2016; Chen et al., 2016; Liu et al., 2016; Alkhouliand Ney, 2017). Cohn et al. (2016) showed how toreintroduce concepts known from traditional sta-tistical alignment models (Brown et al., 1993) likefertility and agreement over translation directionto NMT. Some approaches to simultaneous trans-lation explicitly control for reading source tokensand writing target tokens and thereby generatemonotonic alignments on the segment level (Yuet al., 2016, 2017; Gu et al., 2017). Alkhouli et al.(2016) used separate alignment and lexical modelsand thus were able to hypothesize explicit align-ment links during decoding. While our motivationis very similar to Alkhouli et al. (2016), our ap-proach is very different as we represent the align-ment as operation sequence, and we do not useseparate models for reordering and lexical trans-lation.

The operation sequence model for SMT (Dur-rani et al., 2011, 2015) has been used in a numberof MT evaluation systems (Durrani et al., 2014;Peter et al., 2016; Durrani et al., 2016) and forpost-editing (Pal et al., 2016), often in combina-tion with a phrase-based model. The main differ-

ence to our OSNMT is that we have adapted theset of operations for neural models and are able touse it as stand-alone system, and not on top of aphrase-based system.

Our operation sequence model has some simi-larities with transition-based models used in otherareas of NLP (Stenetorp, 2013; Dyer et al., 2015;Aharoni and Goldberg, 2017). In particular, ourPOP SRC operation is very similar to the step ac-tion of the hard alignment model of Aharoni andGoldberg (2017). However, Aharoni and Gold-berg (2017) investigated monotonic alignments formorphological inflections whereas we use a largeroperation/action set to model complex word re-orderings in machine translation.

6 Conclusion

We have presented a way to use standard seq2seqmodels to generate a translation together withan alignment as linear sequence of operations.This greatly improves the interpretability of themodel output as it establishes explicit alignmentlinks between source and target tokens. However,the neural architecture we used in this paper isrepresentation-agnostic, i.e. we did not explicitlyincorporate the alignments induced by an opera-tion sequence into the neural model. For futurework we are planning to adapt the Transformermodel, for example by using positional embed-dings of the source read head and the target writehead in the Transformer attention layers.

Acknowledgments

This work was supported in part by the U.K. En-gineering and Physical Sciences Research Council(EPSRC grant EP/L027623/1). We thank JoannaStadnik who produced the recurrent translationand alignment models during her 4th year project.

ReferencesRoee Aharoni and Yoav Goldberg. 2017. Morphologi-

cal inflection generation with hard monotonic atten-tion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 2004–2015, Vancouver,Canada. Association for Computational Linguistics.

Tamer Alkhouli, Gabriel Bretschner, Jan-Thorsten Pe-ter, Mohammed Hethnawi, Andreas Guta, and Her-mann Ney. 2016. Alignment-based neural machinetranslation. In Proceedings of the First Conferenceon Machine Translation, pages 54–65, Berlin, Ger-many. Association for Computational Linguistics.

184

Tamer Alkhouli and Hermann Ney. 2017. Biasingattention-based recurrent neural networks using ex-ternal alignment information. In Proceedings of theSecond Conference on Machine Translation, pages108–117, Copenhagen, Denmark. Association forComputational Linguistics.

David Alvarez-Melis and Tommi Jaakkola. 2017. Acausal framework for explaining the predictions ofblack-box sequence-to-sequence models. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 412–421. Association for Computational Linguistics.

Michael Auli, Adam Lopez, Hieu Hoang, and PhilippKoehn. 2009. A systematic analysis of translationmodel search spaces. In Proceedings of the FourthWorkshop on Statistical Machine Translation, pages224–232. Association for Computational Linguis-tics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR, Toulon,France.

Peter E. Brown, Stephen A. Della Pietra, VincentJ. Della Pietra, and Robert L. Mercer. 1993. Themathematics of statistical machine translation: Pa-rameter estimation. Computational Linguistics,19(2).

Wenhu Chen, Evgeny Matusov, Shahram Khadivi,and Jan-Thorsten Peter. 2016. Guided alignmenttraining for topic-aware neural machine translation.arXiv preprint arXiv:1607.01628.

Yong Cheng, Shiqi Shen, Zhongjun He, Wei He,Hua Wu, Maosong Sun, and Yang Liu. 2016.Agreement-based joint training for bidirectionalattention-based neural machine translation. In Pro-ceedings of the Twenty-Fifth International JointConference on Artificial Intelligence, IJCAI’16,pages 2761–2767. AAAI Press.

David Chiang. 2005. A hierarchical phrase-basedmodel for statistical machine translation. In Pro-ceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05),pages 263–270, Ann Arbor, Michigan. Associationfor Computational Linguistics.

Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy-molova, Kaisheng Yao, Chris Dyer, and GholamrezaHaffari. 2016. Incorporating structural alignmentbiases into an attentional neural translation model.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 876–885, San Diego, California. Associationfor Computational Linguistics.

Yanzhuo Ding, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Visualizing and understanding neuralmachine translation. In Proceedings of the 55th An-nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 1150–1159. Association for Computational Linguistics.

Finale Doshi-Velez and Been Kim. 2017. Towards arigorous science of interpretable machine learning.

Nadir Durrani, Fahim Dalvi, Hassan Sajjad, andStephan Vogel. 2016. QCRI machine translationsystems for IWSLT 16. In International Workshopon Spoken Language Translation. Seattle, WA, USA.

Nadir Durrani, Barry Haddow, Philipp Koehn, andKenneth Heafield. 2014. Edinburgh’s phrase-basedmachine translation systems for WMT-14. In Pro-ceedings of the Ninth Workshop on Statistical Ma-chine Translation, pages 97–104, Baltimore, Mary-land, USA. Association for Computational Linguis-tics.

Nadir Durrani, Helmut Schmid, and Alexander Fraser.2011. A joint sequence translation model with in-tegrated reordering. In Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, pages1045–1054, Portland, Oregon, USA. Association forComputational Linguistics.

Nadir Durrani, Helmut Schmid, Alexander Fraser,Philipp Koehn, and Hinrich Schutze. 2015. The op-eration sequence model—combining n-gram-basedand phrase-based statistical machine translation.Computational Linguistics, 41(2):157–186.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,Pedro Rodriguez, and Jordan Boyd-Graber. 2018.Pathologies of neural models make interpretationsdifficult. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing. Association for Computational Linguistics.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. Convolutionalsequence to sequence learning. ArXiv e-prints.

Hamidreza Ghader and Christof Monz. 2017. Whatdoes attention in neural machine translation pay at-tention to? In Proceedings of the Eighth Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers), pages 30–39.Asian Federation of Natural Language Processing.

Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-tor O.K. Li. 2017. Learning to translate in real-timewith neural machine translation. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume

185

1, Long Papers, pages 1053–1062. Association forComputational Linguistics.

Eva Hasler, Adria de Gispert, Gonzalo Iglesias, andBill Byrne. 2018. Neural machine translation de-coding with terminology constraints. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies. Associa-tion for Computational Linguistics.

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015.Visualizing and understanding recurrent networks.arXiv preprint arXiv:1506.02078.

Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In Pro-ceedings of the First Workshop on Neural MachineTranslation, pages 28–39. Association for Compu-tational Linguistics.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky.2016. Visualizing and understanding neural modelsin NLP. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 681–691. Association for Computa-tional Linguistics.

Lemao Liu, Masao Utiyama, Andrew Finch, and Ei-ichiro Sumita. 2016. Neural machine translationwith supervised attention. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages3093–3102, Osaka, Japan. The COLING 2016 Or-ganizing Committee.

I. Dan Melamed. 2003. Multitext grammars and syn-chronous parsers. In Proceedings of the 2003 Hu-man Language Technology Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics.

I. Dan Melamed, Giorgio Satta, and Benjamin Welling-ton. 2004. Generalized multitext grammars. In Pro-ceedings of the 42nd Annual Meeting of the Associ-ation for Computational Linguistics (ACL-04).

Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016.Supervised attentions for neural machine translation.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages2283–2288, Austin, Texas. Association for Compu-tational Linguistics.

Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian scientific paper excerpt corpus. In LREC,pages 2204–2208, Portoroz, Slovenia.

Mariana L Neves and Aurelie Neveol. 2016. The Sci-elo corpus: a parallel corpus of scientific publica-tions for biomedicine. In LREC, pages 2942–2948,Portoroz, Slovenia.

Franz Josef Och and Hermann Ney. 2003. A systematiccomparison of various statistical alignment models.Computational Linguistics, 29(1):19–51.

Santanu Pal, Marcos Zampieri, and Josef van Genabith.2016. USAAR: An operation sequential model forautomatic statistical post-editing. In Proceedings ofthe First Conference on Machine Translation, pages759–763, Berlin, Germany. Association for Compu-tational Linguistics.

Jan-Thorsten Peter, Andreas Guta, Nick Rossenbach,Miguel Graca, and Hermann Ney. 2016. The RWTHAachen machine translation system for IWSLT2016. In International Workshop on Spoken Lan-guage Translation. Seattle, WA, USA.

Martin Popel and Ondrej Bojar. 2018. Trainingtips for the transformer model. arXiv preprintarXiv:1804.00247.

Marco Ribeiro, Sameer Singh, and Carlos Guestrin.2016. “Why should I trust you?”: Explaining thepredictions of any classifier. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Demonstrations, pages 97–101. Association forComputational Linguistics.

Danielle Saunders, Felix Stahlberg, Adria de Gispert,and Bill Byrne. 2018. Multi-representation en-sembles and delayed SGD updates improve syntax-based NMT. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics. Association for Computational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.

Felix Stahlberg, Eva Hasler, Danielle Saunders, andBill Byrne. 2017. SGNMT – A flexible NMT de-coding platform for quick prototyping of new mod-els and search strategies. In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing: System Demonstrations,pages 25–30. Association for Computational Lin-guistics. Full documentation available at http://ucam-smt.github.io/sgnmt/html/.

Pontus Stenetorp. 2013. Transition-based dependencyparsing using recursive neural networks. In NIPSWorkshop on Deep Learning. Citeseer.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Z. Ghahramani, M. Welling, C. Cortes,N. D. Lawrence, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems27, pages 3104–3112. Curran Associates, Inc.

http://ucam-smt.github.io/sgnmt/html/

http://ucam-smt.github.io/sgnmt/html/

186

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 76–85.Association for Computational Linguistics.

Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-cois Chollet, Aidan N Gomez, Stephan Gouws,Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, NikiParmar, et al. 2018. Tensor2tensor for neural ma-chine translation. arXiv preprint arXiv:1803.07416.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 6000–6010. Curran Asso-ciates, Inc.

L Yu, P Blunsom, C Dyer, E Grefenstette, and T Ko-cisky. 2017. The neural noisy channel. In Proceed-ings of the 5th International Conference on LearningRepresentations (ICLR), Toulon, France. Computa-tional and Biological Learning Society.

Lei Yu, Jan Buys, and Phil Blunsom. 2016. Online seg-ment to segment neural transduction. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 1307–1316,Austin, Texas. Association for Computational Lin-guistics.

An Operation Sequence Model for Explainable Neural Machine ...

Documents

Transcript of An Operation Sequence Model for Explainable Neural Machine ...