Iterative Translation Disambiguation for Cross Language Information Retrieval
description
Transcript of Iterative Translation Disambiguation for Cross Language Information Retrieval
Iterative Translation Disambiguation forCross Language Information Retrieval
Christof Monz and Bonnie J. DorrInstitute for Advanced Computer Studies
University of MarylandSIGIR 2005
INTRODUCTION Query translation requires access to some f
orm of translation dictionary Use machine translation system to translate th
e entire query into the target language Use of a dictionary to produce a number of targ
et-language translations for words or phrases in the source language
Use of a parallel corpus to estimate the probabilities that w in the source language translates into w’ in target language
INTRODUCTION An approach that does not require a parallel
corpus to induce translation probabilities a machine-readable dictionary (without any ran
kings or frequency statistics) a monolingual corpus in the target language
TRANSLATION SELECTION Translation ambiguity is very common Apply word sense disambiguation
For most languages the appropriate resources do not exist
Word-sense disambiguation is a non-trivial enterprise
TRANSLATION SELECTION Our approach uses co-occurrences between te
rms to modeling context for the problem of word selection
Ex. S1=>t11,t21,t31 S2=>t21,t22 S3=>t31
TRANSLATION SELECTION Computing co-occurrence statistics fo
r a larger number of terms induces a data-sparseness issue Use very large corpora (Web) Apply smoothing techniques
ITERATIVE DISAMBIGUATION Only examine pairs of terms in order to gather p
artial evidence for the likelihood of a translation in a given context
ITERATIVE DISAMBIGUATION Assume that ti1 occurs more frequently wit
h tj1 than any other pair of candidates between a translation for si and sj
On the other hand, assume that ti1 and tj1 do not co-occur with tk1 at all, but ti2 and tj2 do
Which should be preferred: ti1 and tj1 or ti2 and tj2
ITERATIVE DISAMBIGUATION Associate with each translation candidate a w
eight (t is a translation candidate for si)
Each term weight is recomputed based on two different inputs the weights of the terms that link to the term (WL(t,t’)=link weight between t and t’ )
ITERATIVE DISAMBIGUATION Normalize term weights
The iteration stops if the changes in term weights become smaller than some threshold (WT=the vector of all term weights Vk=kth element in the vector)
ITERATIVE DISAMBIGUATION There are a number of ways to compute the asso
ciation strength between two terms MI
Dice coefficient
log likelihood
ITERATIVE DISAMBIGUATION Example
EXPERIMENT Set-Up Test Data
CLEF 2003 English to German bilingual data
Contains 60 topics, four of which were removed by the CLEF organizers, as no relevant documents
Each topic has a title, a description, and a narrative field, for our experiments, we used only the title field to formulate the queries
EXPERIMENT Set-Up Morphological normalization
Since the dictionary only contains base forms, the words in the topics must be mapped to their respective base forms as well
Compounds are very frequent in German Instead of de-compounding, we use character 5-
grams, an approach that yields almost the same retrieval performance as decompounding
EXPERIMENT Set-Up Ex. Topics
Intermediate results of the query formulation process
EXPERIMENT Set-Up Retrieval Model - Lnu.ltc weighting scheme
we used sl=0.1 ,pv=the average number of unique words per document, uwd= refers to the number of unique words in document d, w(i) = weight of term i
Experimental Results
Experimental Results Individual average precision decreases for a nu
mber of queries 6% of all English query terms were not in the dictiona
ry Unknown words are treated as proper names, and th
e original word from the source language is included in the target language query
Ex. the word Women is falsely considered a proper noun, although faulty translations of this type affect both the baseline system and the run using term weights, the latter is affected more severely
RELATEDWORK Pirkola’s approach does not consider dis
ambiguation at all Jang’s approach use MI to re-compute tr
anslation probabilities for cross-language retrieval Only considers mutual information between c
onsecutive terms in the query they do not compute the translation probabilit
ies in an iterative fashion
RELATEDWORK Adriani’s approach is similar to the approa
ch by Jang does not benefit from using multiple iterations.
Gao use a decaying mutual-information score in combination with syntactic dependency relations We did not consider distances between words
RELATEDWORK Maeda compare a number of co-occurrenc
e statistics with respect to their usefulness for improving retrieval effectiveness They consider all pairs of possible translations
of words in the query use co-occurrence information to select transl
ations of words from the topic for query formulation, instead of re-weighting them
RELATEDWORK Kikui
Only need a dictionary and monolingual resources in the target language
Computes the coherence between all possible combinations of translation candidates of the source terms
CONCLUSIONS introduced a new algorithm for computing
topic dependent translation probabilities for cross-language information retrieval
We experimented with different term association measures, experimental results show Log Likelihood Ratio has the strongest positive impact on retrieval effectiveness
CONCLUSIONS An important advantage of our approach is t
hat it only requires a bilingual dictionary and a monolingual corpus
An issue that remains open at this point is the computation of query terms that are not covered by the bilingual dictionary