A Lightweight and High Performance Monolingual Word Aligner

A Lightweight and High Performance Monolingual Word Aligner

Xuchen Yao, Benjamin Van Durme,(Johns Hopkins)

Chris Callison-Burch and Peter Clark (UPenn) (Vulcan)

2013-8-6 ACL 2013, Sofia 2

monolingual word alignment

• Aligning one sentence pair from RTE2

• Premise: Linda Johnson, who lives with her husband, Charles, and two cats in ... , said Katrina has ...

• Hypothesis: Linda Johnson is married to Charles

• alignment contributed by Brockett (2007)

2013-8-6 ACL 2013, Sofia 3

monolingual vs. bilingual aligment

• less training data (labeled or unlabeled), but more lexical resources

• semantic relatedness: cued by distributional word similaries

• the same grammar shared by source/target sentences

2013-8-6 ACL 2013, Sofia 4





2013-8-6 ACL 2013, Sofia 5





2013-8-6 ACL 2013, Sofia 6

a discriminative model

• first proposed by Blunsom and Cohn (2006):

• s, t: source (observation), target sentence• a: target word indices (0 to target length), state 0

is NULL state for deletion.• f(): feature functions

2013-8-6 ACL 2013, Sofia 7





2013-8-6 ACL 2013, Sofia 8





2013-8-6 ACL 2013, Sofia 9

2013-8-6 ACL 2013, Sofia 10

desired Viterbi decoding path

2013-8-6 ACL 2013, Sofia 11





2013-8-6 ACL 2013, Sofia 12

features

• string similarity– Jaro Winkler, Dice Sorensen, Hamming, Jaccard,

Levenshtein, NGram overlapping and common prefix matching

• POS tags matching• WordNet

– hypernym, hyponym, synonym, derived form, entailing, causing, members of, have member, substances of, have substances, parts of, have part

2013-8-6 ACL 2013, Sofia 13

features





2013-8-6 ACL 2013, Sofia 14

features





2013-8-6 ACL 2013, Sofia 15

features

• positional– offset difference between src/tgt word

• context– whether neighboring words are similar– helps to align functional words

• distortion (Markov feature)– how far apart are two aligned target words

2013-8-6 ACL 2013, Sofia 16

features




2013-8-6 ACL 2013, Sofia 17

features




2013-8-6 ACL 2013, Sofia 18

Implementation: jacana-alignsource code at http://code.google.com/p/jacana

• lightweight: only used a POS tagger and WordNet

• written in Scala, optimize with LBFGS

• platform independent, compiles to a .jar file, fully interoperable with Java

• high performance? -> evaluation

2013-8-6 ACL 2013, Sofia 19

Baselines

• GIZA++• Tree Edit Distance (with stem/wordnet matching)• MANLI

– MacCartney, B.; Galley, M. & Manning, C. D., A Phrase-Based Alignment Model for Natural Language

Inference, EMNLP 2008

• MANLI-constraint (decoding with ILP)– Thadani, K. & McKeown, K. Optimal and syntactically-informed decoding for

monolingual phrase-based alignment. ACL 2011

2013-8-6 ACL 2013, Sofia 20

Baselines






2013-8-6 ACL 2013, Sofia 21

Baselines






2013-8-6 ACL 2013, Sofia 22

Baselines






2013-8-6 ACL 2013, Sofia 23

performance in F1

10.3%

2013-8-6 ACL 2013, Sofia 24

performance in F1

0.8%

3.3%

2013-8-6 ACL 2013, Sofia 25

performance in speed(seconds per sentecne)

• when sentences are more balanced, jacana-align is about 20x faster

corpus sentence pair length

MANLI-approx. MANLI-exact jacana-align

RTE2 29/11 1.67s 0.08s 0.025s

FUSION 27/27 61.96s 2.45s 0.096s20x 20x

2013-8-6 ACL 2013, Sofia 26

performance in speed(seconds per sentecne)

• the speed of jacana-align is not as sensitive to sentence length increase

corpus sentence pair length

MANLI-approx. MANLI-exact jacana-align

RTE2 29/11 1.67s 0.08s 0.025s

FUSION 27/27 61.96s 2.45s 0.096s30x 30x 4x

2013-8-6 ACL 2013, Sofia 27

Conclusion

• state-of-the-art monolingual word aligner– in accuracy– in speed

• open source, use it and hack it!

thank youwith a demo

A Lightweight and High Performance Monolingual Word Aligner

Documents

Transcript of A Lightweight and High Performance Monolingual Word Aligner