Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language Takashi...
-
Upload
anna-williamson -
Category
Documents
-
view
227 -
download
5
Transcript of Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Language Takashi...
Building Bilingual Lexicons Using Building Bilingual Lexicons Using Lexical Translation ProbabilitiesLexical Translation Probabilities
via Pivot Languagevia Pivot LanguageTakashi Tsunakawa1
Naoaki Okazaki1
Jun’ichi Tsujii1,2
1
1Department of Computer Science, Graduate School of Information Science and
Technology,University of Tokyo
2School of Computer Science, University of Manchester /
National Centre for Text Mining
LREC 2008 29 May, 2008
IntroductionIntroductionBuilding bilingual lexicons via pivot
languages
2
odometer
pedometer
计步器E-J lexicon
オドメーターペドメータ
ペドメーター
万歩計
歩数計
C-E lexicon(jìbùqì)
CHINESE
ENGLISH
JAPANESE
(pedomēta)(odomētā)
(pedomētā)(hosūkei)
(mampokei)
IntroductionIntroductionBuilding bilingual lexicons via pivot languages
3
计步器(jìbùqì)
(1)オドメーター (odomētā)(2)ペドメータ (pedomēta) ,ペドメー
ター (pedomētā) ,歩数計 (hosūkei) ,万歩計 (mampokei)
odometer pedometer
Creative Commons Attribution ShareAlike 2.0 Licenseby skippy13
Advantages of the pivotal approachAdvantages of the pivotal approach
Constructing Japanese-Chinese lexicon from Japanese-English and English-Chinese lexicons through English terms J-E and E-C lexicons are well-supported for
many terms and domains, compared to J-C lexicons
Especially for technical terms, there are few J-C lexicons because technical terms are first written by English in most cases
The pivotal approach could help us to (semi-) automatically find J-C translation term pairs
4
MismatchMismatch problem problem
Chinese terms English terms Japanese terms
全球变暖(qúanqíu-bìannŭan)
global heating (n/a)
(n/a) global warming 地球温暖化(chikyū-ondanka)
5
We cannot find a Chinese-Japanese term pair that does not share the identical English translations.
Chinese terms English terms Japanese terms
全球变暖 global heatingglobal warming
地球温暖化
Is it possible to generate thefollowing lexical item?
Merging Two Bilingual LexiconsMerging Two Bilingual Lexicons“Exact merging”
cannot merge pairs that do not share the identical English translations mismatch problem
Challenges to merge more terms “Word-based merging” “Alignment-based merging”
6
Word-based mergingWord-based mergingTokenize a term into word tokens, andTranslate each word by the bilingual lexicon
7
Chinese terms English terms Japanese terms
全球变暖 global heating (n/a)
(n/a) global warming 地球温暖化
(n/a) global 地球
(n/a) heating 温暖化
全球变暖 global heating
地球 温暖化(qúanqíu-bìannŭan)
(chikyū - ondanka)
Alignment-based merging:Alignment-based merging:OverviewOverview
Align each word, Calculate word translation probabilities, and Translate each word by the probabilities
8
Chinese terms English terms Japanese terms
全球 变暖 global heating (n/a)
(n/a) global warming 地球 温暖化
(n/a) heating 温暖化全球 变暖
global heating 地球
global heating
温暖化
warming
温暖化
Alignment-based merging:Alignment-based merging:OverviewOverview
9
Word-by-word
translationMerging word pairs &
re-calculating probabilities
(Add term frequencies on Web)
Alignment-based mergingAlignment-based mergingApply word alignment
(GIZA++) (Och & Ney, 2003) for all term pairs
Calculate word translation probabilities from co-occurrence frequencies
10
For both of the bilingual lexicons, source(f)-pivot(p) and pivot(p)-target(e)
)(
);,();|(
,)(
);,();|(
e
pepepeep
p
fpfpfppf
wC
awwCawwp
wC
awwCawwp
Alignment-based mergingAlignment-based mergingCalculate word translation
probabilities from a target-language word to a source-language word (Utiyama & Isahara, 2007):
11
pwpeepfppf
fppeefef
awwpawwp
aawwpwwp
);|();|(
),;|()|(
Alignment-based mergingAlignment-based merging Calculate the translation probabilities (scores)
based on the noisy channel model (Brown et al., 1990)
12
iieife
efefe
wwpwp
wwpwpww
)|()(
)|()()|Pr(
,,
The language model p(we) is calculated by using the number of Web searching results (Google) of the term we
p(we) ∝ (hit count of we) Generate the merged lexicon with translation
probabilities are greater than zero. New_Lexicon = {(wf,we)|Pr(we|wf)>0 and
Pr(wf|we) > 0}
Experimental settingsExperimental settings Used lexicons: Bilingual lexicons that consist
of technical terms C-E : Wanfang Data E-C & C-E Science and
Technology Dictionary J-E: JST Machine Translation Dictionary By “exact merging,” we can translate about
22% of Japanese (or Chinese) terms
13
Lexicon # of terms (J)
# of terms (E)
# of terms (C)
J-E 465,563 416,578
C-E 429,766 439,795
# of distinct E terms
777,344
C-J by “exact merging”
103,437(22.2%)
68,996 98,537(22.4%)
Experimental resultsExperimental results Utilization ratio
Alignment-based merging drastically improved the utilization ratio, and the size of merged lexicon also increased
Accuracy (by manual evaluation)
MRR: Mean Reciprocal Rank (Voorhees, 1999) calculates the mean of reciprocal ranks over all source terms
Prec1: Precision of the highest ranked terms Prec10: Precision that the 10-best outputs include the
correct one
14
Method # of terms (J)(Utilization ratio
of J)
# of terms (C)(Utilization ratio
of C)
Exact merging 103,437 (22.2%) 98,537 (22.4%)
Word-based merging 124,945 (26.8%) 167,929 (38.1%)
Alignment-based merging
438,976 (94.2%) 342,229 (77.8%)
Source-Target MRR Prec1 Prec10
Japanese-to-Chinese 0.242 0.14 0.46
Chinese-to-Japanese 0.258 0.20 0.40
Experimental results: Examples (1/2)Experimental results: Examples (1/2) A Chinese-to-Japanese example of “ 角膜 实质 炎” (keratitis parenchymatosa)
15
Japanese translation
J-to-E literal translation
Score Log10
prob.
Hitcount
角膜 実質 炎 kerato- parenchymatitis
0.057 -2.89 432 OK
角膜 的 炎 kerato- inflammation 0.00457 -3.34 10
角膜 物質 炎 kerato- material inflammation
0 -2.24 0
角膜 物質 関節
kerato- material joint 0 -2.49 0
角膜 実 炎 kerato- real inflammation
0 -2.63 0
角膜 物質 性 kerato- materiality 0 -2.66 0
角膜 材料 炎 kerato- stuff inflammation
0 -2.66 0
角膜 物質 高安
kerato- material high-low
0 -2.83 0
角膜 物質 胃腸
kerato- material stomach
0 -2.87 0
(jiăomó - shízhì - yán)
Experimental results: Examples (2/2)Experimental results: Examples (2/2) A J-to-C example of “ 発育 状態” (growth
status)
16
Chinese translation
C-to-E literal translation
Score
Log10
prob.
Hitcount
的 状态 state of 7249 -2.43 1960000
发展 状态 development state 6593 -1.58 252000
发展 条件 development condition 6001 -2.05 674000
的 条件 condition of 3159 -2.90 2510000
发展 国家 development country 2715 -2.57 998000
生长 状态 growing state 2688 -1.51 87900 OK
生长 条件 growing condition 2248 -1.98 216000
增长 状态 rising state 1343 -1.72 69800 OK
开发 条件 development condition 1260 -2.78 192000
(hatsuiku - jōtai)
ConclusionConclusion Alignment-based merging of two bilingual
lexicons via a pivot language is proposed The alignment-based merging could achieve at
least 75% utilization ratio in our experiments The precision still remains 0.14 (Japanese-to-
Chinese) and 0.20 (Chinese-to-Japanese), which would be improved by sophisticated scoring method
Future directions To choose the correct translation with examining
the context or semantic classes of source and target terms
To evaluate a machine translation system with this lexicon integrated
17
Thank you for your attentionThank you for your attention
Acknowledgments MEXT, Japan Japan Science and Technology Agency (JST),
Japan NICT, Japan Wanfang Data, China
18
Experimental ResultsExperimental ResultsOur system could generate at least
one Japanese translations into 73.4% (385509/525259) of the C-E lexicons
19
传染性 肝炎 病毒 score
感染 性 肝炎 ウイルス -8.29
感染 肝炎 ウィールス -16.58
感染 肝炎 ウイルス -16.60
感染 性 肝炎 ウイルス -17.24
感染 性 肝炎 ウイルス -17.42
伝染 性 肝炎 ウィールス -17.63
伝染 性 肝炎 ウイルス -17.65
(infectious hepatitis virus, 感染性肝炎ウイルス )
大肠 杆菌 噬菌体 score
大腸 菌 ファージ -17.68
大腸 ファージ -17.82
大腸 菌 型 ファージ -18.48
大腸 菌 ファージ の -18.88
大腸 菌 バクテリオファージ
-18.88
コリフォーム ファージ -19.01
大腸 ファージ の -19.02
(coliphage, 大腸菌ファージ)
Japanese reference translation
Chinese input term
Experimental ResultsExperimental Results
20
补码 形式 score
補 形式 -18.38
補 体 形式 -18.47
補 形 -18.63
補完 形式 -18.68
補 体 形 -18.72
追加 形式 -18.81
補完 形 -18.93
補助 形式 -18.95
保健 形式 -18.97
追加 形 -19.05
声 延迟 线存 储器 score
音声 遅延 線 記憶 装置 -17.15
音 遅延 線 記憶 装置 -17.51
音声 遅延 記憶 装置 -17.80
音響 遅延 線 記憶 装置 -17.87
音 遅延 記憶 装置 -18.16
音響 記憶 -18.17
音響 遅延 線 記憶 装置 -18.36
超 音波 遅延 線 記憶 装置
-18,42
音響 貯蔵 -18.50
音響 遅延 記憶 装置 -18.52
(complement form, 補数形式 )
(acoustic delay line storage,音響遅延線記憶装置 )
same character but the meanings are not
identical
Manual evaluationManual evaluation A human evaluator checked the translation results of
200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct
one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466
The average of the inverses of the ranks that are the highest correct translations
21
Terms that the top was correct
Terms that the top was incorrect /Terms that could not be translated
激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)
Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct
one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466
The average of the inverses of the ranks that are the highest correct translations
22
Terms that the top was correct
Terms that the top was incorrect /Terms that could not be translated
激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)
Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct
one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466
The average of the inverses of the ranks that are the highest correct translations
23
Terms that the top was correct
Terms that the top was incorrect /Terms that could not be translated
激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)
Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct
one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466
The average of the inverses of the ranks that are the highest correct translations
24
Terms that the top was correct
Terms that the top was incorrect /Terms that could not be translated
激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)
Manual evaluationManual evaluation A human evaluator checked the translation results of 200 Chinese terms classified in the category of “Computer” by the C-E lexicon Terms that could be translated into Japanese: 181 (90.5%) Terms that the top-10 translations included the correct
one: 135 (67.5%) Terms that the top translation was correct: 73 (36.5%) MRR (mean reciprocal rank) = 0.466
The average of the inverses of the ranks that are the highest correct translations
25
Terms that the top was correct
Terms that the top was incorrect /Terms that could not be translated
激光 存储器 电路 – laser memory circuit – レーザー メモリ 回路虚拟 处理 – dummy treatment – 仮想 処理综合 数字网 – integrated digital network – 総合 ディジタル 網
数 组 元素 – array element – 配列 元素计算机 化 管理 学会 – ICM – 特 発 性 心筋 障害信息量 – information content – 量转镜 式激 光束 影像 记录 仪 – laser beam rotating mirror image recorder – (NO)
ConclusionConclusion We proposed the method using phrase-based SMT for
constructing J-C lexicon from J-E and C-E lexicons. We could obtain J translations for 73.4% of items in the
C-E lexicon, and it outperformed the “exact matching” (22.2%).
36.5% of the top J translations were correct and that 67.5% of the top-10 J translations included the correct one. We could apply this method for support of manual
construction of bilingual dictionaries and use this lexicon for MT.
Future work Parameter optimization of SMT by using existing J-C lexicons Chinese character similarity considering each similarity
between individual characters More sophisticated reordering model (considering parts-of-
speech) Other translation directions (EJ, JC, EC)
26
Acquisition of Translation Pairs of Acquisition of Translation Pairs of Technical TermsTechnical Terms
Large-scale translation dictionaries (lexicons) of technical terms are required for translating technical documents
For constructing such dictionaries, we must ask the experts who can deal with both languages It requires huge costs We must support rapid increase of new terms
27
Automatic acquisition of translation candidates of technical terms
• Support for constructing the dictionary • Improvement of the performance of machine translation systems
J-E bilingual lexiconJ-E bilingual lexicon 527,206 translation pairs Numbers of distinct terms : 465,565 J terms, 509,259 E
terms
28
Japanese terms English terms
“ 外装・内装”派 "exterior ・ interior" fraction
(案) (draft)
(案) (plan)
(株) Co.,Ltd.
(株) Inc.
… …
ころがり接触疲労 rolling contact fatigue
ころがり損失 rolling loss
ころがり対偶 rolling pair
ころがり疲れ寿命 rolling fatigue life
C-E bilingual lexiconC-E bilingual lexicon
29
Wanfang Data E-C & C-E Science and Technology Dictionary 525,259 pairs
id Chinese terms Japanese terms Category
1 ……的瞬时值 Instantaneous… 科技 (science and technology)
2 Ⅰ-Ⅴ族化合物半导体 group Ⅰ-Ⅴ compound semiconductor
电子 (electronic)
3 Ⅰ-Ⅵ族化合物半导体 group I-VI compound semiconductor
电子
4 Ⅰ-Ⅶ族化合物半导体 group Ⅰ-Ⅶ compound semiconductor
电子
5 ⅠA族化合物 ⅠA compound 无化 (inorganic chemistry)
…
525259
专利发明 patent 专利 (patent)
Construction of the C-J bilingual Construction of the C-J bilingual lexiconlexiconAttach Japanese translations for
each lexical item of C-E lexicon
30
Chinese terms English terms Japanese terms
……的瞬时值 Instantaneous… 瞬間…
Ⅰ-Ⅴ族化合物半导体 group Ⅰ-Ⅴ compound semiconductor
Ⅰ-V族化合物半導体
Ⅰ-Ⅵ族化合物半导体 group I-VI compound semiconductor
Ⅰ-Ⅵ族化合物半導体
Ⅰ-Ⅶ族化合物半导体 group Ⅰ-Ⅶ compound semiconductor
Ⅰ-Ⅶ族化合物半導体
ⅠA族化合物 ⅠA compound ⅠA族化合物
…
专利发明 patent 特許
Overview of constructing J-C lexiconOverview of constructing J-C lexicon We assume the C-E and J-E lexicons as
parallel corpora, and use them for training data for constructing a J-C SMT system
Word/phrase-level merging in English can be available by applying an SMT approach for the C-E and J-E lexicons
We apply C-J phrase-based SMT for Chinese terms in the C-E lexicon Statistical approaches seem to be effective
because of similarities of semantics and word order between C and J
Easy to introduce other clues such as Chinese character similarity
31
Collecting J-E & C-E translation phrase Collecting J-E & C-E translation phrase pairspairs Apply morphological analyzers, and obtain word alignments by GIZA+
+ (Och and Ney, 2003) for J-E and C-E lexicons Collect phrase pairs by “Grow-diag-final” method (using Moses, Koehn
et al., 2007) and calculate the probabilities by the relative frequencies
32
ころがり 疲れ 寿命
rolling fatigue life
Japanese phrases
English phrases p( e | j ) p( j | e )
ころがり rolling 0.733 0.083
疲れ fatigue 0.973 0.503
寿命 life 0.565 0.210
ころがり 疲れ rolling fatigue 1 1
疲れ 寿命 fatigue life 1 0.545
ころがり 疲れ 寿命
rolling fatigue life 1 1
Merging phrase pairsMerging phrase pairs (Utiyama & (Utiyama & Isahara,Isahara, 2007) (J-E & E-C phrases to J-C 2007) (J-E & E-C phrases to J-C
phrases)phrases)
33
Japanese phrases
English phrases p( e | j ) p( j | e )
ころがり rolling 0.733 0.083
疲れ fatigue 0.973 0.503
寿命 life 0.565 0.210
ころがり 疲れ rolling fatigue 1 1
疲れ 寿命 fatigue life 1 0.545
ころがり 疲れ 寿命
rolling fatigue life 1 1Chinese phrases
English phrases p( e | c )
p( c | e )
侧倾 rolling 0.182 0.029
横摇 rolling 0.5 0.014
… … … …
疲乏 fatigue 1 0.011
… … … …
疲劳 寿命 fatigue life 1 1
Merging phrase pairsMerging phrase pairs (Utiyama & (Utiyama & Isahara,Isahara, 2007) (J-E & E-C phrases to J-C 2007) (J-E & E-C phrases to J-C
phrases)phrases)
f p
p
w weppfe
weppf
eef
wwpwwpZ
wwpwwpZ
wwp
)|()|(
)|()|(1
)|(
34
Japanese phrases
Chinese phrases p( c | j ) p( j | c )
ころがり 侧倾 … 0.015
ころがり 横摇 … 0.042
… … … …
疲れ 疲乏 … 0.297
… … … …
疲れ 寿命 疲劳 寿命 … 0.545
(Ze is a normalized
factor)
Features for learning of the log-linear Features for learning of the log-linear modelmodel
We employ the following features h1-h4 for the log-linear model:
1. Phrase translation prob. where are the i-th phrase pair for the
translation
2. 3-gram language model of the target language
where p(we) is a language model probability from other monolingual corpora
3. Phrase reordering penalty (Koehn et al., 2003)4. Chinese character similarity (Zhang et al.,
2005)
35
M
mfemm
we wwhw
e 1
),(maxargˆ
i
if
iefe wwpwwh ),(log),( )()(
1)()( , i
fie ww
)(log),(2 efe wpwwh
Feature 3: Phrase reordering penaltyFeature 3: Phrase reordering penalty(Koehn et al., 2003)(Koehn et al., 2003)The feature value is the sum of penalties
d defined by the following formula for the phrase pairs we, wf
where ai is the position of the first word of wf and bi-1 is the position of the last word of wf translated in the previous step
36
i
if
iefe
iiif
ie
wwdwwh
bawwd
),(),(
1),()()(
3
1)()(
f1 f2 f3
f4 f5 f6 f7 f8
e1 e2
e3 e4e5 e6
d(e1 e2, f1 f2 f3) = 0d(e3, f8) = – |8 – 3 – 1| = – 4d(e4, f6 f7) = – |6 – 8 – 1| = – 3d(e5 e6, f4 f5) = – |4 – 7 – 1| = – 4h3(e1…e6, f1…f8) = – 11
Feature 4: Chinese character similarityFeature 4: Chinese character similarity Chinese and Japanese writing systems both
have Chinese characters, and their similarity should be a powerful clue to derive the translation phrase pairs (Zhang et al., 2005)
We define the feature value h4 between we and wf as follows:
Differences of Chinese and Japanese forms of characters are ignored
Example : h4( 万歩計 , 计步器 ) = h4( 万歩計 , 計歩器 ) = h4(ABC,CBD) = 1 – 2 / 3 = 0.333
37
h4(we,wf) = 1 –
Edit distance of Chinese characters between we and wf
Max. of the number of characters in we and wf