인공지능 연구실 정 성 원 Statistical Alignment and Machine Translation.
-
Upload
barbara-webb -
Category
Documents
-
view
230 -
download
3
Transcript of 인공지능 연구실 정 성 원 Statistical Alignment and Machine Translation.
인공지능 연구실정 성 원
Statistical Alignment and Machine Translation
2
Contents• Machine Translation
• Text Alignment– Length-based methods– Offset alignment by signal processing
techniques– Lexical methods of sentence alignment
• Word Alignment
• Statistical Machine Translation
3
Different Strategies for MT (1)Different Strategies for MT (1)Interlingua
(knowledge representation)
English(semantic representation)
French(semantic representation)
English(syntactic parser)
French(syntactic parser)
English Text(word string)
French Text(word string)word-for-word
syntactic transfer
semantic transfer
(knowledge-basedtranslation)
4
Different Strategies for MT (2)Different Strategies for MT (2)• Machine Translation : important but hard problem• Why is ML Hard?
– word for word• Lexical ambiguity• Different word order
– syntactic transfer approach• Can solve problems of word order• Syntactic ambiguity
– semantic transfer approaches• can fix cases of syntactic mismatch• Unnatural, unintelligible• interlingua
5
MT & Statistical Methods
• In theory, each of the arrows in prior figure can be implemented based on a probabilistic model.– Most MT systems are a mix of prob. and non-prob.
components.
• Text alignment – Used to create lexical resources such as bilingual
dictionaries and parallel grammars, to improve the quality of MT
– More work on text alignment than on MT in statistical NLP.
6
Text Alignment
• Parallel texts or bitexts– Same content is available in several languages– Official documents of countries with multiple official languages -> li
teral, consistent
• Alignment– Paragraph to paragraph, sentence to sentence, word to word
• Usage of aligned text– Bilingual lexicography– Machine translation– Word sense disambiguation– Multilingual information retrieval– Assisting tool for translator
7
Aligning sentences and paragraphs(1)
• Problems– Not always one sentence to one sentence– Reordering– Large pieces of material can disappear
• Methods– Length based vs. lexical content based– Match corresponding point vs. form sentence bead
8
Aligning sentences and paragraphs(2)
9
Aligning sentences and paragraphs(3)
• BEAD : n:m grouping– S, T : text in two language
s– S = (s1, s2, … , si)– T = (t1, t2, … , tj) – 0:1, 1:0, 1:1, 2:1, 1:2, 2:2,
2:3, 3:2 …– Each sentence can occur in
only one bead– No crossing
s1
.
.
.
.
.
.
.si
t1
.
.
.
.
.
.
.tj
S Tb1
b2
b3
b4
b5
.
.bk
10
Dynamic Programming(1)
V01
V13
V12
V11
V23
V22
V21
V32
V31
V43
V42
V41
V51
3
8
6
9
12
8 6
3
4
9
6
8
5
8
7
8
5
6
5
3
2
2
4
3
4
6
3V0
V1 V2 V3 V4
V5
11
Dynamic Programming(2)• 가장 짧은 길 계산
3)( ;6)( ;4),()(
),()(
)}(),({min)(
)}(),({min)(
)}(),({min)(
)}(),({min)()(
43min42min514141min
5144min
4min43313min
3min32312min
2min21311min
1min1013101minmin
vdvdvvdvd
vvdvd
vdvvdvd
vdvvdvd
vdvvdvd
vdvvdvdPf
jj
jjiji
jjiji
jjiji
iii
이므로그런데
dmin(vij)i
1 2 3 4 5
j
1 22(v12) 20(v21) 11(v32) 5(v43) 4(v51)
2 14(v22) 12(v31) 6(v41) 6(v51)
3 18(v22) 10(v32) 3(v51)
12
Length-based methods
• Rationale– Short sentence -> short sentence– Long sentence -> long sentence– Ignore the richer information but quite effective
• Length– # of words or # of characters
• Pros– Efficient (for similar languages)– rapid
13
Gale and Church (1)• Find the alignment A ( S, T : parallel texts )
• Decompose the aligned texts into a sequence of aligned beads (B1,…Bk)
• The method – length of source and translation sentences measured in characters– similar language and literal translations– used for Union Bank of Switzerland(USB) Corpus
• English, French, German• aligned paragraph level
),,(maxarg),/(maxarg TSAPTSAPAA
K
kkBPTSAP
1
)(),,(
14
Gale and Church (2)
• D(i,j) : the lowest cost alignment between sentences s1,…,si and t1,…,tj
),,,2:2(cos)2,2(
),,1:2(cos)1,2(
),,2:1(cos)2,1(
),1:1(cos)1,1(
),0:1(cos),1(
),1:0(cos)1,(
min),(
11
1
1
jjii
jii
jji
ji
i
j
ttssaligntjiD
tssaligntjiD
ttsaligntjiD
tsaligntjiD
saligntjiD
taligntjiD
jiD
15
Gale and Church (3)
S1
S2
S3
S4
L1 alignment 1
cost(align(s1, s2, t1)) t1
cost(align(s3, t2)) t2
cost(align(s3, t2)) t3
+
+
L1 alignment 2
t1 cost(align(s1, t1))
t2 cost(align(s2, t2))
cost(align(s3, ))
+
+
t3 cost(align(s4, t3))
L2
16
Gale and Church (4)• l1, l2 : the length in characters of the sentences of
each language in the bead• 두 언어 사이의 character 의 길이 비
– normal distribution ~ (, s2)
• average 4% error rate• 2% error rate for 1:1 alignments
2112 /)( slll
)|()(log
)),,,(|(log),(cos 22121
alignPalignP
sllalignPllt
tP cos
17
Other Researches• Brown et.al(1991c)
– 대상 : Canadian Hansard(English , French)
– 방법 : Comparing sentence lengths in words rather than characters
– 목적 : produce an aligned subset of the corpus
– 특징 : EM algorithm
• Wu(1994)
– 대상 : Hong Kong Hansard(English, Cantonese)
– 방법 : Gale and Church(1993) Method
– 결과 : not as clearly met when dealing with unrelated language
– 특징 : use lexical cues
18
Offset alignment by signal processing techniques
• Showing roughly what offset in one text aligns with what offset in the other.
• Church(1993)– 배경 : noisy text(OCR output)– 방법
• character sequence level 에서 cognate 정의 -> 순수한 cognate + proper name + numbers
• dot plot method(character 4-grams)
– 결과 : very small error rate– 단점
• different character set• no or extremely few identical character sequences
19
DOT-PLOT
a a c g g c t t a c g
g ● ● ●
g ● ● ●
c ● ● ●
t ● ●
t ● ●
t ● ●
c ● ● ●
g ● ● ●
g ● ● ●
a a c g g c t t a c g
g
g ●
c ● ●
t
t ●
t ●
c ●
g ●
g ●
Uni-gram bi—gram
20
Fung and Mckeown• 조건
– without having found sentence boundary– in only roughly parallel texts– with unrelated language
• 대상 : English and Cantonese• 방법 :
– arrival vector– small bilingual dictionary
• A word offset : (1,263,267,519) => arrival vector : (262,4,252).
• Choose English, Cantonese word pairs of high similarity => small bilingual dictionary => anchor of text alignment
• Strong signal in a line along the diagonal in dot plot => good alignment
21
Lexical methods of sentence alignment(1)
• Align beads of sentences in robust ways using lexical information
• Kay and Röscheisen(1993)– 특징 : lexical cues, a process of convergence
– 알고리즘• Set initial anchors• until most sentences are aligned
– Form an envelope of possible alignments
– Choose pairs of words that tend to co-occur in these potential partial alignment
– Find pairs of source and target sentences which contain many possible lexical correspondences.
22
Lexical methods of sentence alignment(2)
• 96% coverage after four passes on Scientific American articles
• 7 errors after 5 passes on 1000 Hansard sentences
• 단점 – computationally intensive
– pillow shaped envelope => text moved, deleted
23
Lexical methods of sentence alignment(3)
• Chen(1993)– Similar to the model of Gale and Church(1993)
– Simple translation model is used to estimate the cost of a alignment.
– 대상• Canadian Hansard, European Economic Community proc
eedings.(millions of sent.)
– Estimated error rate : 0.4 %• most of errors are due to sentence boundary detection me
thod => no further improvement
24
Lexical methods of sentence alignment(4)
• Haruno and Yamazaki(1996)– Align structurally different languages.– A variant of Kay and Roscheisen(1993)– Do lexical matching on content words only
• POS tagger– To align short texts, use an online dictionary– Knowledge-rich approach– The combined methods
• good results on even short texts between very different languages
25
Word Alignment
• 용도– terminology databases, bilingual dictionaries
• 방법– text alignment -> word alignment– χ 2 measure– EM algorithm
• Use of existing bilingual dictionaries
26
Statistical Machine Translation(1)
• Noisy channel model in MT– Language model– Translation model– Decoder
Language ModelP(e)
Translation ModelP(f/e)
Decoderê = arg maxe P(e/f)
e f ê
27
• Translation model– compute p(f/e) by summing the probabilities of all alignments
e: English sentence l : the length of e in words f : French sentence m : the length of f fj : word j in f
aj : the position in e that fj is aligned with
eaj : the word in e that fj is aligned with
p(wf/we) : translation prob.
Z : normalization constant
l
a
m
jaj
l
a m
jefP
ZefP
0 10
)/(...1
)/(1
.
.fj
.
.
.
.
.
.eaj
...
f e
Statistical Machine Translation(2)
28
• Translation probability : p(wf/we)– Assume that we have a corpus of aligned sentences.
– EM algorithm
)/()(maxarg)(
)/()(maxarg)/(maxarg
^
efPePfP
efPePfepe
eee
• Decoder
search space is infinite => stack search
vvw
ww
ef
fwewtsfeefww
ef
f
ef
fe
ef
z
zwwPstepM
wwPzstepE
wwPoftioninitializaRandom
,
,
,..),(,
)/(:
)/(:
)/(
Statistical Machine Translation(3)
29
Statistical Machine Translation(4)
• Problems– distortion
– fertility : The number of French words one English word generate.
• Experiment– 48% of French sentences were decoded correctly
– incorrect decodings
– ungrammatical decodings
30
Statistical Machine Translation(5)• Detailed Problems
– model problems• Fertility is asymmetric• Independence assumption• Sensitivity to training data• Efficiency
– lack of linguistic knowledge• No notion of phrase• Non-local dependencies• Morphology• Sparse data problems