Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
-
Upload
jinho-d-choi -
Category
Technology
-
view
461 -
download
3
description
Transcript of Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Shumin Wu, Jinho Choi, Martha PalmerInstitute of Cognitive Science, University of Colorado at Boulder
Resource
PropBank- A corpus annotated with verbal propositions and their arguments.- Adds semantic information (semantic roles) to the phrase structures.- e.g. John opened the door with his foot
Parallel Corpus each with PropBank Annotation-Parallel sentences: a sentence s and t are called parallel if t is a translation of s.
Chinese Sentence:大 通道 建设 搞活 了 大 西南 的 物流、 人流、 信息流,big passage construction invigorate big southwest material flow people flow information flow
促进了 沿海 港口 城市 经济 的 发展。promote coastal port city economy develop
Propbank Annotations:
English Sentence:Construction of the main passage has activated the flow of materials , the flow of people and the flow of information in the great southwest , and has promoted development in the coastal port cities’ economies .
PropBank Annotations:
Phrase Structure
PropBank Annotations
搞活 .01:
Arg0: 大通道建设 Arg1: 大西南的物流、人流、信息流
促进 .01:
Arg0: 大通道建设 Arg1: 沿海港口城市经济的发展
activate.01: Arg0: construction of the main passage Arg1: the flow of materials , the flow of people and the flow of information Argm-loc: in the great southwest
promote.01: Arg0: construction of the main passage Arg1: development Argm-loc: in the coastal port cities’ economies
Motivation
Detecting Cross-lingual Semantic Similarity- Align the PropBank annotations between parallel corpus
- Group semantically similar Chinese proposition : generate Chinese semantic resource
- Deduce semantic similarity between the two languages : use semantic mapping to improve word alignment and machine translation
搞活 .01:
Arg0: 大通道建设 Arg1: 大西南的物流、人流、信息流
促进 .01:
Arg0: 大通道建设 Arg1: 沿海港口城市经济的发展
activate.01: Arg0: construction of the main passage Arg1: the flow of materials , the flow of people and the flow of information Argm-loc: in the great southwest
promote.01: Arg0: construction of the main passage Arg1: development Argm-loc: in the coastal port cities’ economies
Word Alignment
Word Alignment:- Given parallel sentences, align words that are semantically close.
- GIZA++: a statistical machine translation toolkit used to train word- alignment models. : provides word to word alignment in each direction : using only GIZA++ to find parallel predicate pairs misses close to 20% of predicate occurrences compared to human annotator:
Percentage of aligned predicates on 200 randomSentences in the Xinhua Corpus
Evaluating English-Chinese Semantic Classes
Chinese semantic classes - No current Chinese verb class resource available- Manually evaluate verb groups (that semantically map to the same English verb) on a scale of 0-3 : score of 0: not related : score of 1: related in context : score of 2: hypernym/hyponym relations : score of 3: direct/dictionary translation
English semantic classes- Use WordNet semantic relations for evaluation- Merging through hypernym relationship Ex: Taxonomy of {decrease, drop, fall}, indicates decrease is hypernym of drop and synonym of fall:
- Sense merging Ex: Taxonomy of {sponsor, hold}, indicates sponsor is the hyponym of support.1, and support.4 is synonym of hold
- Number of semantic classes after merging Ex: Taxonomy of {appear, occur, emerge, exhibit. Even after sense merging, WordNet did not find any relationship between exhibit and the other verbs, resulting 2 semantic classes
Corpus Description
English Chinese Translation Treebank (ECTB)- A parallel corpus between English and Chinese- The corpus is divided into two parts : Xinhua Chinese newswire with literal English translations (4,363 parallel sentences) : Sinorama Chinese news magazine with non-literal English translations (not used for semantic mapping)
Symmetric Predicate Mapping (SPM)
- Based on GIZA++ word alignment- Pair-wise similarity measure between semantic roles based on aligned words of the predicate and arguments : weighs alignment of predicate and main arguments (ARG0, ARG1) more heavily over other arguments : use both Chinese/English and English/Chinese GIZA++ word alignment output to generate a bidirectional similarity measure (harmonic mean of the two)
- Find the best one-to-one mapping (linear assignment problem) using Kuhn-Munkres method:
- Ex:
Matches:搞活 .01 ↔ activate.01, 促进 .01 ↔ promote.01
Results
Alignment GIZA++ Human Annotator
Ch.pred → En.pred 48.1% 60.1%
En.pred → Ch.pred 59.2% 73.8%
Ch.pred ↔ En.pred 53.1% 66.3%
12
2
ECCE
ECCESYM SimSimβ
SimSim)β ( Sim
j
iji
ijCi Ej
ijijSYM xxxSimx
1 ,1 ,maxarg ,
activate.01 promote.01
搞活 .01 0.77 0.25
促进 .01 0.23 0.49
Method precision recall F-score
GIZA++ 84.2% 67.5% 74.9%
SPM 87.0% 88.1% 87.5%
Construction of the main passage hasactivated flow of materials thein great southwest
大 通道 建设 搞活 了 大 西南 的 物流
the
English Semantic Class Results
Experiment Setup- Start with the Chinese verbs in the previous section, retrieve mapped English verbs from the Xinhua corpus (excluding light verbs and single occurrences, and single member verb sets)
Results- Number of English verb sets: 57- Total number of English verbs: 127
Taxonomy tree height of verbs to its lowest common hypernym within each verb set:
Number of sense merges required:
Resulting number of semantic classes for each verb set:
Summary and Acknowledgements
- Exploring symmetric predicate similarity using PropBank predicate-argument structure- Automatically generating English-to-Chinese and Chinese-to-English semantic class mapping- Verifying English semantic class mapping using WordNet
We gratefully acknowledge the support of the National Science Foundation Grants CISE- CRI-0551615, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Chinese Semantic Class Results
Experiment Setup- Choose the 50 most diversely-mapped (to Chinese) English verbs from the Xinhua corpus (excluding light verbs and single occurrences)Results- Total number of Chinese verbs: 218- Average membership of Chinese semantic class: 4.36- Human score: