ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub...

37
ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) [email protected] ISL, Carnegie Mellon Univ. July 08, 2002
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    229
  • download

    3

Transcript of ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub...

Page 1: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

ABC---A Phrase-to-Phrase Alignment Method

Integrating monolingual and bilingual information in sub sentential phrase alignment

Ying Zhang (Joy) [email protected]

ISL, Carnegie Mellon Univ.

July 08, 2002

Page 2: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Overview

The advantage of phrase to phrase alignment Existing methods Algorithm Integrating bilingual information with

monolingual information Experiments and results Discussion and future work

Page 3: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

SMT and sub-sentential alignment

Statistical Machine Translation (SMT) system is based on the noise channel model

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

tgtPtgtsrcP

srcP

tgtPtgtsrcP

srctgtPnTranslatio

tgt

tgt

tgt

Translation Model

Language Model

Page 4: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

SMT and sub-sentential alignment (Cont.)

Through sub-sentential alignment, we are training the Translation Model (TM)

In our system, TM contains word to word, or phrase to phrase transducers. E.g.

Page 5: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Why phrases?

Mismatch between languages

Page 6: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Why phrases? (Cont.)

Phrases encapsulate the context of words– Tense: e.g.

Word to word alignment

Phrase to phrase alignment

Page 7: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Why phrases? (Cont.)

Local reordering– E.g. Relative clauses in Chinese

Which still needs global reordering, which is our future work

Page 8: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Why phrases? (Cont.)

For languages need word segmentation, such as Chinese– The word segmenter can not segment the sentence

perfectly, due to the incomplete coverage of word list and segmentation ambiguity

– Previous work (Zhang 2001) tries to identify phrases in the corpus using only monolingual information and augment the word list with new phrases found

Precision: Hard to decide on phrase boundary Prediction: Phrase identified may not occur in the future

testing data

Page 9: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Why phrases? (Cont.)

Example of using phrases to soothe word segmentation failure

Page 10: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Some alignment algorithms

IBM models (Brown 93) HMM alignment: phrase to phrase (Vogel 96) Competitive links: word to word (Melamed 97) Flow network (Gaussier 98) Bitext Map (Melamed 01)

Page 11: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Algorithm

Given a sentence pair (S,T),

S=<s1,s2,…,si,…sm>

T=<t1,t2,…,tj,…,tn>, where si/tj are src/tgt words.

Given an m*n matrix B, where

B(i,j)= co-occurrence(si,tj)=

N=a+b+c+d;

)()()()(

)(),(

22

dcbadbca

bcadNts ji

tj ~tj

si a b

~si c d

Page 12: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Algorithm (Cont.)

Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or one tgt/src phrase (adjacent word sequence)

Legal segmentation, imperfect alignment Illegal segmentation, perfect alignment

Page 13: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Algorithm (Cont.)

While(still has row or column not aligned){

Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells;

Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd; ColStart,ColEnd]

Mark all the cells in the region as aligned

}

Output the aligned regions as phrases

Page 14: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Algorithm (Cont.)

Expand cell[i,j] with sim_threshcurrent aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j]

While(still ok to expand){

if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north

if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south

… //expand to east

… //expand to west

}

Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh

Page 15: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Algorithm (Cont.)

Expand to North

Expand to South

Expand to EastExpand to West

Page 16: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Find the best similarity threshold

Simlarity_threshold is critical in this algorithm

The algorithm described above used ONE Simlarity_threshold value for all region expansions in the matrix, and the same ONE value for all sentence pairs

Ideally, it is better to use different threshold values for each region and find the global best segmentation for one matrix

– A search tree, combinational explosion

Page 17: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Find the best similarity threshold (Cont.)

One practical solution:For one matrix B:

For(st=0.1;st<=0.9;st+=0.1){

find segmentation of B given similarity_threshold = st;

}

Select the solution with the highest performance(solution)

lsAlignedCelji

lsAlignedCelji

jiB

solutionePerformanc

],[

],[

1

),(

)(

Page 18: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Integrating monolingual information

Motivation:– Use more information in the alignment– Easier for aligning phrases– There is much more monolingual data than bilingual

data

Pittsburgh Los AngelesSomerset

Union town Santa Monica

Santa Clarita

Corona

Page 19: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Integrating monolingual information (Cont.)

Given a sentence pair (S,T),

S=<s1,s2,…,si,…sm> and T=<t1,t2,…,tj,…,tn>, where si/tj are src/tgt words.

Construct an m*m matrix A, where A(i,j) = collocation(si, sj); Only A(i,i-1) and A(i,i+1) have values

Construct an n*n matrix C, where C(i,j) = collocation(ti, tj); Only C(j-1,j) and A(j+1,j) have values

Construct an m*n matrix B, where

B(i,j)= co-occurrence(si, tj).

Page 20: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Integrating monolingual information (Cont.)

Normalization:– Assign self2self value α(si) A(i,i), 0<=α(si)<=1

– Assign self2self value β(tj) C(j,j), 0<= β(tj)<=1– Normalize A so that:

))(1()1,()1,(

)1,()1,(' isiiAiiA

iiAiiA

))(1()1,()1,(

)1,()1,(' isiiAiiA

iiAiiA

)(),(' isiiA

j

jiA 1),('

Page 21: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Integrating monolingual information (Cont.)

– Normalize C so that:

– Normalize B so that:

i

jiC 1),('

))(1(),1(),1(

),1(),1(' jtjjCjjC

jjCjjC

))(1(),1(),1(

),1(),1(' jtjjCjjC

jjCjjC

)(),(' jtjjC

m

i

n

j

jiB1 1

1),('

Page 22: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Integrating monolingual information (Cont.)

Calculating new src-tgt matrix B’

OK. That’s it! Yes, that’s the whole story!

''''' CBAB

Page 23: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Example

With pure bilingual information:

After integration with monolingual information:the development of shanghai's pudong

shanghai 0.134 0.053 0.059 7.016 10.196pudong 0.076 0.061 0.029 9.963 38.152development 2.993 5.482 1.851 1.954 8.213

the development of shanghai's pudongshanghai 0.225 0.056 0.061 7.477 0.413pudong 0.047 0.021 0 0.261 53.949development 0.116 9.641 0.024 0.014 0.106

Page 24: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Visualization

Left: Using pure bilingual information

Right: Integrated with monolingual information

Page 25: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

What Is the Self2self Value?

Take a look at:

What actually happens is:

stands for how much word Si should “make use of” its neighbours’ relation with the target words.

For content words, self2self value should be higher, and for function words, it should be lower.

BAB '

jiijiiiii

iiiji

iiii

iiiji BB

AA

AB

AA

AB ,,1

1,1,

1,,1

1,1,

1,, )1()1('

)1( i

Page 26: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

How To Set the Self2self Values

Well, this is tricky Before June evaluation I set α = 0.6 for all src

words and β = 0.48 for all tgt words– Not good– “the” should have lower self2self value and

“Pittsburgh” should have a higher self2self value

Page 27: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Calculating Self2self Values

Observation: Source language content words tend to align to a few target words with high scores while function words tend to align to many target words with low scores

“has”“the”

“beijing”“computer”“bus”

“in”

Page 28: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Calculating Self2self Values (Cont.)

Calculating the entropy of a word over the distribution of normalized co-occurrence scores

– Given word si, for all the possible co-occurred word tj, their co-occurrence score is C(i,j),

– Let

– Define

Map the score linearly to a value between 0~1 Better map the scores to a range narrower than 0~1. E.g.

0.45~0.85, why?

j

ij jiC

jiCstprob

),(

),()|(

))|((log)|()( ijj

iji stprobstprobsH

Page 29: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

A Modification to the Segmentation Algorithm

Original algorithm calculates A*B*C only once In the modified version:

– Set B[i,j] to 0 for all aligned cells when a new aligned region is found

– Re-calculate A*B*C Motivation:

– Since we found an aligned region, the boundary of this phrase is known. It should not affect the unaligned neighbors

More computationally expensive Experiments showed better performance

Page 30: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Updating Bilingual Information by Iteration

Using EM to update the bilingual co-occurrence scores– Doesn’t help too much

Page 31: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Results

The Dev-test on small data track (3540 sentence pair training data + 10K glossary)

NIST Bleu

Baseline(IBM1+Gloss) 6.0097 0.1231

Original Algorithm 6.3673 (+5.9%) 0.1478 (+20.0%)

Modified Algorithm 6.4310 (+7.0%) 0.1507 (+22.4%)

After LM-fill NIST Bleu

Baseline(IBM1+Gloss)+LM 6.3775 0.1417

Original Algorithm+LM 6.6754(+4.7%) 0.1611(+13.7%)

Modified Algorithm+LM 6.7987(+6.6%) 0.1712(+20.8%)

Page 32: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Results (Cont.)

No LM-fill With LM-fill

NIST Bleu NIST Bleu

Baseline(IBM1+Gloss) 6.0097 0.1231 6.3775 0.1417

HMM+IBM1+Gloss 6.1802 0.1305 6.4750 0.1459

ARV+IBM1+Gloss 6.3636 0.1473 6.7405 0.1681

JOY+IBM1+Gloss 6.4310 0.1507 6.7987 0.1712

ARV+JOY+IBM1+Gloss 6.5117 0.1569 6.8790 0.1776

Page 33: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Conclusion

Simple Efficient

– Unlike stochastic bracketing (Wu 95) which is O(m3n3), the algorithm of segmenting the matrix is linear O(min(m,n)). The construction of A*B*C is O(m*n);

Effective– Improved the translation quality from baseline (NIST=6.0097,

Bleu=0.1231) to (NIST=6.4310, Bleu=0.1507) on small data track dev-test

Page 34: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Future work

Find a better segmentation algorithm (dynamic threshold)

Find a method which is mathematically more sound for self2self values

Investigate the possibility of using trigram or distance bi-gram monolingual information

Page 35: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

References

Peter F. Brown, Stephen A. Della Pietra, Vin-cent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machinetranslation: Parameter estimation. Computa-tional Linguistics, 19 (2) :263-311.

Gaussier, E. (1998) Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora. In Proceedings of COLING-ACL-98, Montreal, pp. 444-450.

I. Dan Melamed. "A Word-to-Word Model of Translational Equivalence". In Procs. of the ACL97. pp 490--497. Madrid Spain, 1997.

I. Dan Melamed (2001). Empirical Methods for Exploiting Parallel Texts MIT Press. Stephan Vogel, Hermann Ney, and Christoph Till-mann. 1996. HMM-based word alignment in statistical

translation. In COLING '96: The 16th Int. Conf. on Computational Linguistics, pages 836-841, Copenhagen, August.

Dekai Wu, An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words, ACL, June 1995 Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for

Mandarin-English EBMT". MT Summit VIII, Sep. 2001.

Page 36: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Acknowledgement

I would like to thank Stephan Vogel, Jian Zhang, Jie Yang, Jerry Zhu, Ashish and other people for their valuable advice and suggestions during this work.

Page 37: ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu.

07/09/2002Copyright. Joy, [email protected]

Questions and Comments