Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks

1
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks Shumin Wu, Jinho Choi, Martha Palmer Institute of Cognitive Science, University of Colorado at Boulder Resource PropBank - A corpus annotated with verbal propositions and their arguments. - Adds semantic information (semantic roles) to the phrase structures. - e.g. John opened the door with his foot Parallel Corpus each with PropBank Annotation - Parallel sentences: a sentence s and t are called parallel if t is a translation of s. Chinese Sentence: 大 大大 大大 搞搞 大 大 大大 大大 大大大 西 big passage construction invigorate big southwest material flow people flow information flow 搞搞 大 大大 大大 大大 大大 大 大大promote coastal port city economy develop Propbank Annotations: English Sentence: Construction of the main passage has activated the flow of materials , the flow of people and the flow of information in the great southwest , and has promoted development in the coastal port cities’ economies . PropBank Annotations: Phrase Structure PropBank Annotations 搞搞 .01: Arg0: 大大大大大 Arg1: 大大大大大 大大大 大大大 西、、 搞搞 .01: Arg0: 大大大大大 Arg1: 大大大大大大大大大大大 activate.01: Arg0: construction of the main passage Arg1: the flow of materials , the flow of people and the flow of information Argm-loc: in the great southwest promote.01: Arg0: construction of the main passage Arg1: development Argm-loc: in the coastal port cities’ economies Motivation Detecting Cross-lingual Semantic Similarity - Align the PropBank annotations between parallel corpus - Group semantically similar Chinese proposition : generate Chinese semantic resource - Deduce semantic similarity between the two languages : use semantic mapping to improve word alignment and machine translation 搞搞 .01: Arg0: 大大大大大 Arg1: 大大大大大 大大大 大大大 西、、 搞搞 .01: Arg0: 大大大大大 Arg1: 大大大大大大大大大大大 activate.01: Arg0: construction of the main passage Arg1: the flow of materials , the flow of people and the flow of information Argm-loc: in the great southwest promote.01: Arg0: construction of the main passage Arg1: development Argm-loc: in the coastal port cities’ economies Word Alignment Word Alignment: - Given parallel sentences, align words that are semantically close. - GIZA++: a statistical machine translation toolkit used to train word- alignment models. : provides word to word alignment in each direction : using only GIZA++ to find parallel predicate pairs misses close to 20% of predicate occurrences compared to human annotator: Percentage of aligned predicates on 200 random Sentences in the Xinhua Corpus Evaluating English-Chinese Semantic Classes Chinese semantic classes - No current Chinese verb class resource available - Manually evaluate verb groups (that semantically map to the same English verb) on a scale of 0-3 : score of 0: not related : score of 1: related in context : score of 2: hypernym/hyponym relations : score of 3: direct/dictionary translation English semantic classes - Use WordNet semantic relations for evaluation - Merging through hypernym relationship Ex: Taxonomy of {decrease, drop, fall}, indicates decrease is hypernym of drop and synonym of fall: - Sense merging Ex: Taxonomy of {sponsor, hold}, indicates sponsor is the hyponym of support.1, and support.4 is synonym of hold - Number of semantic classes after merging Ex: Taxonomy of {appear, occur, emerge, exhibit. Even after sense merging, WordNet did not find any relationship between exhibit and the other verbs, resulting 2 semantic classes Corpus Description English Chinese Translation Treebank (ECTB) - A parallel corpus between English and Chinese - The corpus is divided into two parts : Xinhua Chinese newswire with literal English translations (4,363 parallel sentences) : Sinorama Chinese news magazine with non-literal English translations (not used for semantic mapping) Symmetric Predicate Mapping (SPM) - Based on GIZA++ word alignment - Pair-wise similarity measure between semantic roles based on aligned words of the predicate and arguments : weighs alignment of predicate and main arguments (ARG0, ARG1) more heavily over other arguments : use both Chinese/English and English/Chinese GIZA++ word alignment output to generate a bidirectional similarity measure (harmonic mean of the two) - Find the best one-to-one mapping (linear assignment problem) using Kuhn-Munkres method: - Ex: Matches: 搞搞 .01 ↔ activate.01, 搞搞 .01 ↔ promote.01 Results Alignment GIZA++ Human Annotator Ch.pred → En.pred 48.1% 60.1% En.pred → Ch.pred 59.2% 73.8% Ch.pred ↔ En.pred 53.1% 66.3% 1 2 2 EC CE EC CE SYM Sim Sim β Sim Sim ) β ( Sim j ij i ij C i E j ij ij SYM x x x Sim x 1 , 1 , max arg , activate.01 promote.01 搞搞 .01 0.77 0.25 搞搞 .01 0.23 0.49 Method precision recall F-score GIZA++ 84.2% 67.5% 74.9% SPM 87.0% 88.1% 87.5% Construc tion o f th e mai n passag e has activat ed flow of materials th e i n gre at southwe st 搞搞 搞搞 搞搞 西 搞搞 th e English Semantic Class Results Experiment Setup - Start with the Chinese verbs in the previous section, retrieve mapped English verbs from the Xinhua corpus (excluding light verbs and single occurrences, and single member verb sets) Results - Number of English verb sets: 57 - Total number of English verbs: 127 Taxonomy tree height of verbs to its lowest common hypernym within each verb set: Number of sense merges required: Resulting number of semantic classes for each verb set: Summary and Acknowledgements - Exploring symmetric predicate similarity using PropBank predicate-argument structure - Automatically generating English-to-Chinese and Chinese-to-English semantic class mapping - Verifying English semantic class mapping using WordNet We gratefully acknowledge the support of the National Science Foundation Grants CISE- CRI-0551615, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Chinese Semantic Class Results Experiment Setup - Choose the 50 most diversely-mapped (to Chinese) English verbs from the Xinhua corpus (excluding light verbs and single occurrences) Results - Total number of Chinese verbs: 218 - Average membership of Chinese semantic class: 4.36 - Human score:

description

This paper suggests a method for detecting cross-lingual semantic similarity using parallel PropBanks. We begin by improving word alignments for verb predicates generated by GIZA++ by using information available in parallel PropBanks. We applied the Kuhn-Munkres method to measure predicate-argument matching and improved verb predicate alignments by an F-score of 12.6%. Using the enhanced word alignments we checked the set of target verbs aligned to a specific source verb for semantic consistency. For a set of English verbs aligned to a Chinese verb, we checked if the English verbs belong to the same semantic class using an existing lexi- cal database, WordNet. For a set of Chinese verbs aligned to an English verb we manually checked semantic similarity between the Chi- nese verbs within a set. Our results show that the verb sets we generated have a high correla- tion with semantic classes. This could poten- tially lead to an automatic technique for gen- erating semantic classes for verbs.

Transcript of Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks

Page 1: Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks

Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks

Shumin Wu, Jinho Choi, Martha PalmerInstitute of Cognitive Science, University of Colorado at Boulder

Resource

PropBank- A corpus annotated with verbal propositions and their arguments.- Adds semantic information (semantic roles) to the phrase structures.- e.g. John opened the door with his foot

Parallel Corpus each with PropBank Annotation-Parallel sentences: a sentence s and t are called parallel if t is a translation of s.

Chinese Sentence:大 通道 建设 搞活 了 大 西南 的 物流、 人流、 信息流,big passage construction invigorate big southwest material flow people flow information flow

促进了 沿海 港口 城市 经济 的 发展。promote coastal port city economy develop

Propbank Annotations:

English Sentence:Construction of the main passage has activated the flow of materials , the flow of people and the flow of information in the great southwest , and has promoted development in the coastal port cities’ economies .

PropBank Annotations:

Phrase Structure

PropBank Annotations

搞活 .01:

Arg0: 大通道建设 Arg1: 大西南的物流、人流、信息流

促进 .01:

Arg0: 大通道建设 Arg1: 沿海港口城市经济的发展

activate.01: Arg0: construction of the main passage Arg1: the flow of materials , the flow of people and the flow of information Argm-loc: in the great southwest

promote.01: Arg0: construction of the main passage Arg1: development Argm-loc: in the coastal port cities’ economies

Motivation

Detecting Cross-lingual Semantic Similarity- Align the PropBank annotations between parallel corpus

- Group semantically similar Chinese proposition : generate Chinese semantic resource

- Deduce semantic similarity between the two languages : use semantic mapping to improve word alignment and machine translation

搞活 .01:

Arg0: 大通道建设 Arg1: 大西南的物流、人流、信息流

促进 .01:

Arg0: 大通道建设 Arg1: 沿海港口城市经济的发展

activate.01: Arg0: construction of the main passage Arg1: the flow of materials , the flow of people and the flow of information Argm-loc: in the great southwest

promote.01: Arg0: construction of the main passage Arg1: development Argm-loc: in the coastal port cities’ economies

Word Alignment

Word Alignment:- Given parallel sentences, align words that are semantically close.

- GIZA++: a statistical machine translation toolkit used to train word- alignment models. : provides word to word alignment in each direction : using only GIZA++ to find parallel predicate pairs misses close to 20% of predicate occurrences compared to human annotator:

Percentage of aligned predicates on 200 randomSentences in the Xinhua Corpus

Evaluating English-Chinese Semantic Classes

Chinese semantic classes - No current Chinese verb class resource available- Manually evaluate verb groups (that semantically map to the same English verb) on a scale of 0-3 : score of 0: not related : score of 1: related in context : score of 2: hypernym/hyponym relations : score of 3: direct/dictionary translation

English semantic classes- Use WordNet semantic relations for evaluation- Merging through hypernym relationship Ex: Taxonomy of {decrease, drop, fall}, indicates decrease is hypernym of drop and synonym of fall:

- Sense merging Ex: Taxonomy of {sponsor, hold}, indicates sponsor is the hyponym of support.1, and support.4 is synonym of hold

- Number of semantic classes after merging Ex: Taxonomy of {appear, occur, emerge, exhibit. Even after sense merging, WordNet did not find any relationship between exhibit and the other verbs, resulting 2 semantic classes

Corpus Description

English Chinese Translation Treebank (ECTB)- A parallel corpus between English and Chinese- The corpus is divided into two parts : Xinhua Chinese newswire with literal English translations (4,363 parallel sentences) : Sinorama Chinese news magazine with non-literal English translations (not used for semantic mapping)

Symmetric Predicate Mapping (SPM)

- Based on GIZA++ word alignment- Pair-wise similarity measure between semantic roles based on aligned words of the predicate and arguments : weighs alignment of predicate and main arguments (ARG0, ARG1) more heavily over other arguments : use both Chinese/English and English/Chinese GIZA++ word alignment output to generate a bidirectional similarity measure (harmonic mean of the two)

- Find the best one-to-one mapping (linear assignment problem) using Kuhn-Munkres method:

- Ex:

Matches:搞活 .01 ↔ activate.01, 促进 .01 ↔ promote.01

Results

Alignment GIZA++ Human Annotator

Ch.pred → En.pred 48.1% 60.1%

En.pred → Ch.pred 59.2% 73.8%

Ch.pred ↔ En.pred 53.1% 66.3%

12

2

ECCE

ECCESYM SimSimβ

SimSim)β ( Sim

j

iji

ijCi Ej

ijijSYM xxxSimx

1 ,1 ,maxarg ,

activate.01 promote.01

搞活 .01 0.77 0.25

促进 .01 0.23 0.49

Method precision recall F-score

GIZA++ 84.2% 67.5% 74.9%

SPM 87.0% 88.1% 87.5%

Construction of the main passage hasactivated flow of materials thein great southwest

大 通道 建设 搞活 了 大 西南 的 物流

the

English Semantic Class Results

Experiment Setup- Start with the Chinese verbs in the previous section, retrieve mapped English verbs from the Xinhua corpus (excluding light verbs and single occurrences, and single member verb sets)

Results- Number of English verb sets: 57- Total number of English verbs: 127

Taxonomy tree height of verbs to its lowest common hypernym within each verb set:

Number of sense merges required:

Resulting number of semantic classes for each verb set:

Summary and Acknowledgements

- Exploring symmetric predicate similarity using PropBank predicate-argument structure- Automatically generating English-to-Chinese and Chinese-to-English semantic class mapping- Verifying English semantic class mapping using WordNet

We gratefully acknowledge the support of the National Science Foundation Grants CISE- CRI-0551615, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Chinese Semantic Class Results

Experiment Setup- Choose the 50 most diversely-mapped (to Chinese) English verbs from the Xinhua corpus (excluding light verbs and single occurrences)Results- Total number of Chinese verbs: 218- Average membership of Chinese semantic class: 4.36- Human score: