Extracting multilingual lexicons from parallel corpora · PDF fileTUFIŞ, BARBU, ION:...
Transcript of Extracting multilingual lexicons from parallel corpora · PDF fileTUFIŞ, BARBU, ION:...
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 1
Extracting multilingual lexicons from parallel
corpora
DAN TUFIŞ1, ANA MARIA BARBU2 AND RADU ION3
1,2,3 Romanian Academy (RACAI),13, “13 Septembrie”, 050711, Bucharest 5, Romania
[email protected], [email protected], [email protected]
Abstract. The paper describes our recent developments in automatic extraction of translation
equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple
baseline iterative method, and two non-iterative more elaborated versions. While the baseline
algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the
use of different working hypotheses which may be motivated by different kinds of applications
and to some extent by the languages concerned. The first two algorithms rely on cross-lingual
POS preservation, while with the third one POS invariance is not an extraction condition. The
evaluation of the algorithms was conducted on three different corpora and several pairs of
languages.
Keywords: alignment, evaluation, lemmatization, tagging, translation equivalence
1 INTRODUCTION
Automatic Extraction of bilingual lexicons from parallel texts might seem a futile
task, given that more and more bilingual lexicons are printed nowadays and they can be
easily turned into machine-readable lexicons. However, if one considers only the
possibility of automatic enriching the presently available electronic lexicons, with very
limited manpower and lexicographic expertise, the problem reveals a lot of potential.
The scientific and technological advancement in many domains is a constant source of
new-term coinage and therefore keeping up with multilingual lexicography in such areas
1 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 2
is very difficult unless computational means are used. On the other hand, translation
bilingual lexicons appear to be quite different from the corresponding printed lexicons,
meant for human users. The marked difference between printed bilingual lexicons and
bilingual lexicons as needed for automatic translation is not really surprising. Traditional
lexicography deals with translation equivalence (the underlying concept of bilingual
lexicography) in an inherently discrete way. What is to be found in a printed dictionary
or lexicon (bi- or multilingual) is just a set of general basic translations. In the case of
specialised registers, general lexicons are usually not very useful.
The recent interest in semantic markup of texts, motivated by the Semantic Web
technologies, raises the issue of exploiting the markup existing in one language text to
automatically generate the semantic annotations in the second language parallel text.
Finding the lexical correspondencies in a parallel text creates the possiblity of
bidirectional import of semantic annotations that might exist in either of the two parallel
texts.
The basic concept in extracting translation lexicons is the notion of translation
equivalence relation (Gale and Church, 1991). One of the widely accepted definitions
(Melamed, 2001) of the translation equivalence defines it as a (symmetric) relation that
holds between two different language texts, such that expressions appearing in
corresponding parts of the two texts are reciprocal translations. These expressions are
called translation equivalents. A parallel text, or a bitext, having its translation
equivalents linked is an aligned bitext. Translation equivalence may be defined at
various granularity levels: paragraph, sentence, lexical. Automatic detection of the
translation equivalents in a bitext is increasingly more difficult as the granularity
becomes finer. Here we are concerned with the finest alignment granularity, namely the
lexical one. If not stated otherwise, in the rest of the paper by translation equivalents we
will mean lexical translation equivalents.
2 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 3
Most approaches to automatic extraction of translation equivalents roughly fall into two
categories. The hypotheses-testing methods such as (Gale and Church, 1991; Smadja et
al., 1996) rely on a generative device that produces a list of translation equivalence
candidates (TECs), each of them being subject to an independence statistical test. The
TECs that show an association measure higher than expected under the independence
assumption are assumed to be translation-equivalence pairs (TEPs). The TEPs are
extracted independently one of another and therefore the process might be characterised
as a local maximisation (greedy) one. The estimating approach (e.g. Brown et al., 1993;
Kupiec, 1993; Hiemstra, 1997) is based on building a statistical bitext model from data,
the parameters of which are to be estimated according to a given set of assumptions. The
bitext model allows for global maximisation of the translation equivalence relation,
considering not individual translation equivalents but sets of translation equivalents
(sometimes called assignments). There are pros and cons for each type of approach,
some of them discussed in (Hiemstra, 1997).
Our method comes closer to the hypotheses-testing approach. It generates first a list of
translation equivalent candidates and then successively extracts the most likely
translation-equivalence pairs. The extraction process does not need a pre-existing
bilingual lexicon for the considered languages. Yet, if such a lexicon exists, it can be
used to eliminate spurious candidate translation-equivalence pairs and thus to speed up
the process and increase its accuracy.
2 CORPUS ENCODING
In our experiments, we used three parallel corpora. The largest one, henceforth
“NAACL2003”, is bilingual (Romanian and English), contains 866,036 words in the
English part and 770635 words in the Romanian part, and consists mainly of journalistic
texts. The raw texts in this corpus have been collected and provided by Rada Mihalcea
from the University of North Texas for the purpose of the Shared Task on word-
3 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 4
alignment organised by Rada Mihalcea and Ted Pedersen at the HLT-NAACL2003
workshop on “Building and Using Parallel Texts: Data Driven Machine Translation and
Beyond“ (see http://www.cs.unt.edu/~rada/wpt/). The smallest parallel text, henceforth
“VAT”, is 3-lingual (French, Dutch and English), contains about 42,000 words per
language), and is a legal text (the EEC Sixth VAT Directive -77/388 EEC VAT). It was
built for the FF-POIROT European project (http://www.starlab.vub.ac.be/
research/projects/poirot/), as a testbed for multilingual term extraction and alignment.
The third corpus, henceforth “1984”, is the result of the MULTEXT-EAST and
CONCEDE European projects and it is based on Orwell’s novel Nineteen Eighty-Four,
translated in 6 languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene)
with the English original included as the hub. Each translation was aligned to the
English hub, thus yielding 6 bitexts. From these 6 integral bitexts, containing on average
100,000 words per language, we selected only the sentences that were 1:1 aligned with
the English original, thus obtaining the 7-language parallel corpus with an average of
80,000 words per language. Out of the three corpora this is the most accurate, being
hand validated (Erjavec & Ide, 1998; Tufiş et al. 1998).
The input to our algorithms is represented by a parallel corpus encoded according to a
simplified version of XCES specification (http://www.cs.vassar.edu/XCES). This
encoding requires preliminary pre-processing of each monolingual part of the parallel
corpus and, afterwards, the sentence alignment of all monolingual texts. The aligned
fragments of text, in two or more languages present in the parallel corpus, make a
translation unit. Each translation unit consists of several segments, one per language. A
segment is made of one uniquely identified sentence. Each sentence is made up of one or
more tokens for which the lemma and the morpho-syntactic code are explicitly encoded
as tag attributes. More often than not, a token corresponds to what is generally called a
word, but this is not always the case. Depending on the lexical resources used in the
(monolingual) text segmentation, a multiword expression may be treated as a single
4 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 5
lexical token and encoded as such. As an example of the encoding used by our
algorithms, Figure 1 shows the translation unit "Ozz.42" of the “1984” corpus: <tu id="Ozz.42"> <seg lang="en"> <s id="Oen.1.1.10.2"> <tok lemma="there" ana="Pt3">There</tok> <tok lemma="be" ana="Vmis-p">were</tok> <tok lemma="no" ana="Dg">no</tok> <tok lemma="window" ana="Ncnp">windows</tok> <tok lemma="in" ana="Sp">in</tok> <tok lemma="it" ana="Pp3ns">it</tok> <tok lemma="at_all" ana="Rmp">at all</tok> <c>.</c> </s> </seg> <seg lang="ro"> <s id="Oro.1.2.10.2"> <tok lemma="nu" ana="Qz">Nu</tok> <tok lemma="avea" ana="Vmii3s">avea</tok> <tok lemma="deloc" ana="Rgp">deloc</tok> <tok lemma="fereastră" ana="Ncfp-n">ferestre</tok> <c>.</c> </s> </seg> <seg lang="sl"> <s id="Osl.1.2.11.2"> <tok lemma="okno" ana="Ncnpg">Oken</tok> <tok lemma="na" ana="Spsl">na</tok> <tok lemma="on" ana="Pp3nsl--n-n">njem</tok> <tok lemma="sploh" ana="Q">sploh</tok> <tok lemma="biti" ana="Vcip3s--y">ni</tok> <tok lemma="biti" ana="Vcps-sna">bilo</tok> <c>.</c> </s> </seg> <seg lang="cs"> <s id="Ocs.1.1.10.2"> <tok lemma="mít" ana="Vmps-snay----n">Nemělo</tok> <tok lemma="vůbec" ana="Rgp">vůbec</tok> <tok lemma="okno" ana="Ncnpa">okna</tok> <c>.</c> </s> </seg> <seg lang="bg"> <s id="Obg.1.1.9.2"><tok lemma="то" ana="PP3">То</tok> <tok lemma="изобщо" ana="RG"> изобщо</tok> <tok lemma="нямам" ana="VMII3S">нямаше</tok> <tok lemma="прозорец" ana="NCMP-N"> прозорци</tok>
5 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 6
<c>.</c> </s> </seg> <seg lang="et"> <s id="Oet.1.2.10.2"> <tok lemma="see" ana="Pd--s3">Sel</tok> <tok lemma="ei" ana="Va------y">ei</tok> <tok lemma="olema" ana="Vmii---ay">olnud</tok> <tok lemma="üks" ana="Pi--s1">ühtki</tok> <tok lemma="aken" ana="Nc-s1">akent</tok> <c>.</c> </s> </seg> <seg lang="hu"> <s id="Ohu.1.2.10.2"> <tok lemma="egyáltalán" ana="Rg">Egyáltalán</tok> <tok lemma="nem" ana="Rg">nem</tok> <tok lemma="van" ana="Vmis3p---n">voltak</tok> <tok lemma="õ" ana="P---sp">rajta</tok> <tok lemma="ablak" ana="Nc-pn">ablakok</tok> <c>.</c> </s> </seg> </tu>
Figure 1: Corpus encoding for the translation equivalence extraction algorithms
The next section briefly describes the pre-processing steps, used by the corpus
generation module that provides the input for the translation equivalence extraction
algorithms.
3 PRELIMINARY PROCESSING
3.1 SEGMENTATION; WORDS AND MULTIWORD LEXICAL TOKENS
A lexical item is usually considered to be a space- or punctuation-delimited string of
characters. However, especially in multilingual studies, it is convenient, and frequently
linguistically motivated, to consider some sequences of traditional words as making up a
single lexical unit. For translation purposes considering multiword expressions as single
lexical units is a regular practice justified both by conceptual and computational reasons.
The recognition of multiword expressions as single lexical tokens, and the splitting of
6 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 7
single words into multiple lexical tokens (when it is the case) are generically called text
segmentation and the program that performs this task is called a segmenter or a
tokenizer. In the following we will refer to words and multiword expressions as lexical
tokens or, simply, tokens.
The multilingual segmenter we used, MtSeg1, is a public domain tool
(http://tokww.lpl.univ-aix.fr/projects/multext/MtSeg/) and was developed by Philippe di
Cristo within the MULTEXT project. The segmenter is able to recognise dates,
numbers, various fixed phrases, to split clitics or contractions, etc. We implemented a
collocation extractor, based on NSP, an n-gram statistical package
(http://www.d.umn.edu/~tpederse/nsp.html) developed by Ted Pedersen. The list of
generated n-grams is subject to a regular expression filtering that considers language-
specific constituency restrictions. After validation, the new multi-word expressions may
be added to the segmenter’s resources. A complementary approach to overcome the
inherent incompleteness of the language specific tokenization resources is largely
described in (Tufiş, 2001).
3.2 SENTENCE ALIGNMENT
We used a slightly modified version of Gale and Church’s CharAlign sentence aligner
(Gale and Church, 1993). In general, sentence alignments of all bitexts of our
multilingual corpora are of the type 1:1, i.e. in most cases (more than 95%) one sentence
is translated as one sentence. In the following we will refer to the alignment units as
translation units (TU). In general, sentence alignment is a highly accurate process, but
in our corpora, alignment is error-free, either because of manual validation and
correction (“1984”) or because the raw texts were published already aligned by their
authors (“VAT” and “NACL2003”).
7 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 8
3.3 TAGGING AND LEMMATIZATION
For highly inflectional languages, morphological variation may generate diffusion of the
statistical evidence for translation equivalents. In order to avoid data sparseness we
added a tagging and lemmatization phase as a front-end pre-processing of the parallel
corpus. For instance, the English adjective “unacceptable”, occurring nine times in one
of our corpora, has been translated in Romanian by nine different word-forms,
representing inflected forms (singular/plural, masculine/feminine) of three adjectival
lemmas (inacceptabil, inadmisibil, intolerabil): inacceptabil, inacceptabile, inadmisibil,
inadmisibile, inadmisibilă, inadmisibilului, intolerabil, intolerabilă and intolerabilei.
Without lemmatization all translation pairs would be “hapax-legomena” pairs and thus
their statistical recognition and extraction would be hampered. The lemmatization
ensured sufficient evidence for the algorithm to extract all the three translations of the
English word.
The monolingual lexicons developed within MULTEXT-EAST contain, for each word-
form, its lemma and the morpho-syntactic codes that apply for the current word-form.
With these monolingual resources lematisation is a by-product of tagging: knowing the
word-form and its associated tag, the lemma extraction, for those words that are in the
lexicon, is just a matter of lexicon lookup; for unknown words, the lemma is implicitly
set to the word-form itself, unless a lemmatiser is available. Erjavec and Ide (1998)
provide a description of the MULTEXT-EAST lexicon encoding principles. A detailed
presentation of their application to Romanian is given in (Tufiş et al., 1997).
For morpho-syntactic disambiguation we use a tiered-tagging approach with combined
language models (Tufiş, 1999) based on TnT - a trigram HMM tagger (Brants, 2000).
For Romanian, this approach has been shown to provide an average accuracy of more
than 98.5%. The tiered-tagging model relies on two different tagsets. The first one,
which is best suited for statistical processing, is used internally while the latter (used in a
8 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 9
morpho-syntactic lexicon and in most cases more linguistically motivated) is used in the
tagger’s output. The mapping between the two tagsets is in most cases deterministic (via
a lexicon lookup) or, in the rare cases where it is not, a few regular expressions may
solve the non-determinism. The idea of tiered tagging works not only for very fine-
grained tagsets, but also for very low-information tagsets, such as those containing only
part of speech. In such cases the mapping from the hidden tagset to the coarse-grained
tagset is strictly deterministic. In (Tufiş, 2000) we showed that using the coarse-grained
tagset directly, (14 non-punctuation tags) gave a 93% average accuracy, while using a
tiered tagging and combined language model approach (92 non-punctuation tags in the
hidden tagset) the accuracy was never below 99.5%.
4 LEXICONS EXTRACTION
4.1 UNDERLYING ASSUMPTIONS
Extracting translation equivalents from parallel corpora is a very complex task that can
easily turn into a computationally intractable enterprise. Fortunately, there are several
assumptions one can consider in order to simplify the problem and lower the
computational complexity of its solution. Yet, we have to mention that these empirical
simplifications usually produce information loss and/or noisy results. Post-processing, as
we will describe in section 5.3, may significantly improve both precision and recall by
eliminating some wrong translation equivalence pairs and finding some good ones,
previously undiscovered.
The assumptions we used in our basic algorithm are the following:
• a lexical token in one half of the TU corresponds to at most one non-empty lexical
unit in the other half of the TU; this is the 1:1 mapping assumption which underlies
the work of many other researchers (e.g. Kay and Röscheisen, 1993; Melamed,
9 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 10
2001; Ahrenberg et al. 2000; Hiemstra, 1997; Brew and McKelvie, 1996). When this
hypothesis does not hold, the result is a partial translation. However, remember that
a lexical token could be a multiple-word expression tokenized as such by an
adequate segmenter; non-translated tokens are not of interest here.
• a lexical token, if used several times in the same TU, is used with the same meaning;
this assumption is explicitly used also by (Melamed, 2001) and implicitly by all the
previously mentioned authors; the rationale for this assumption comes from the
pragmatics of regular natural language communication: the reuse of a lexical token,
in the same sentence and with a different meaning, generates extra cognitive load on
the recipient and thus is usually avoided; exceptions from this communicative
behavior, more often than not, represent either bad style or a game of words.
• a lexical token in one part of a TU can be aligned to a lexical token in the other part
of the TU only if the two tokens have the same part-of-speech; this is one very
efficient way to cut off the combinatorial complexity and avoid dealing with
irregular ways of cross-POS translations; as we will show in the section 4.4 this
assumption can be nicely circumvented without too high a price in computational
performance;
• although word order is not an invariant of translation, it is not random either; when
two or more candidate translation pairs are equally scored, the one containing tokens
which are closer in relative position are preferred. This preference is also used in
(Ahrenberg et al. 2000).
Based on sentence alignment, POS tagging and lemmatisation, the first step is to
compute a list of translation equivalence candidates (TECL). By collecting all the tokens
of the same POSk (in the order they appear in the text and removing duplicates) in each
10 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 11
part of TUj (the jth translation unit) one builds the ordered sets LSjPOSk and LTj
POSk. For
each POSi let TUjPOSi be defined as LSj
POSi⊗LTjPOSi, with ‘⊗’ representing the Cartesian
product operator. Then, CTUj (candidates in the jth TU) is defined as follows:
CTUj= . With these notations and considering that there are n
translation units in the whole bitext, TECL is defined as: TECL = .
Υposofno
i
jPOSiTU
..
1=
Υn
j
jCTU1=
TECL contains a lot of noise and many TECs are very improbable so that a filtering is
necessary. Any filtering would eliminate many wrong TECs but also some good ones.
The ratio between the number of good TECs rejected and the number of wrong TECs
rejected is the criterion we used in deciding which test to use and what would be the
threshold score below which any TEC will be removed from TECL. After various
empirical tests we decided to use the loglikelihood test (LL) with the threshold set to 9.
4.2 THE BASELINE ALGORITHM (BASE)
Our baseline is a simple iterative algorithm and has some similarities to the algorithm
presented in (Ahrenberg et al. 2000) but unlike it, our algorithm avoids computing
various probabilities (or, more precisely, probability estimates) and scores (t-score).
Based on the TECL, an initial Sm* Tn contingency table (TBL0) is constructed for each
POS (see Figure 2), with Sm the number of token types in the first part of the bitext and
Tn the number of token types in the other part of the bitext.
11 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 12
Figure 2. Contingency table with counts for TECs at step K
TT1 … TTn
TS1 n11 … n1n n1*
… …
…
…
…
TSm nm1 … nmn nm*
n*1 … n*n n**
The rows of the table are indexed by the distinct source tokens and the columns are
indexed by the distinct target tokens (of the same POS). Each cell (i,j) contains the
number of occurrences in TECL of <TSi, TTj>. All the pairs <TSi, TTj> that at step k
satisfy the equation below (EQ1) are recorded as TEPs and removed from the
contingency table TBLk (the cells (i,j) are zeroed) thus obtaining a new contingency
table TBLk+1.
(EQ1) { })n(n)n(n qp, | TTTP pj ijiq ijTj Sik ≥∧≥∀>∈<= kTBL
Equation (EQ1) expresses the common intuition that in order to select <TSi, TTj> as a
translation equivalence pair, the number of associations of TSi with TTj must be higher
than (or at least equal to) any other TTp (p≠j). The same holds the other way around. One
of the main deficiencies of the BASE algorithm is that it is quite sensitive to what
(Melamed, 2001) calls indirect associations. If <TSi, TTj> has a high association score
and TTj collocates with TTk, it might very well happen that <TSi, TTk> also gets a high
association score. Although, as observed by Melamed, the indirect associations
generally have lower scores than the direct (correct) ones, they could receive higher
scores than many correct pairs and this not only generates wrong translation equivalents,
but also eliminates from further considerations several correct pairs. To weaken this
sensitivity, we had to additionally impose an occurrence threshold for the selected pairs,
so that the equation (EQ1) became:
12 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 13
(EQ2) { })3n()n(n)n(n qp, | TTTP ijpj ijiq ijTj Sik ≥∧≥∧≥∀>∈<= kTBL
This modification significantly improved the precision (more than 98%) but seriously
degraded the recall, more than 75% of correct pairs being missed. The BASE
algorithm’s sensitivity to the indirect associations, and thus the necessity of an
occurrence threshold, is explained by the fact that it looks at the association scores
globally, not checking whether the tokens in a TEC are both in the same TU.
4.3 A BETTER EXTRACTION ALGORITHM (BETA)
To diminish the influence of indirect associations and thus remove the occurrence
threshold, we modified the BASE algorithm so that the maximum score is not
considered globally but within each of the TUs. This brings BETA closer to the
competitive linking algorithm described in (Melamed, 2001). The competing pairs are
only the TECs generated from the current TU out of which the pair with the best LL-
score (computed, as before, from the entire corpus) is the first selected. Based on the 1:1
mapping hypothesis, any TEC containing the tokens in the winning pair is discarded.
Then, the next best scored TEC in the current TU is selected and again the remaining
pairs that include one of the two tokens in the selected pair are discarded. The multiple-
step control in BASE, where each TU was scanned several times (once in each
iteration), is not necessary anymore. The BETA algorithm will see each TU unit only
once but the TU is processed until no further TEPs can be reliably extracted or the TU is
emptied. This modification improves both the precision and the recall as compared to
the BASE algorithm. When two or more TEC pairs of the same TU share the same
token, and they are equally scored, the algorithm has to make a decision and choose only
one of them, in accordance with the 1:1 mapping hypothesis. We used two heuristics:
string similarity scoring and relative distance.
13 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 14
The similarity measure, COGN(TS, TT), is very similar to the XXDICE score described in
(Brew and McKelvie, 1996). If TS is a string of k characters α1α2 . . . αk and TT is a
string of m characters β1β2 . . . βm then we construct two new strings T’S and T’T by
inserting special displacement characters into TS and TT where necessary. The
displacement characters will cause both T’S and T’T to have the same length p (max (k,
m)≤p<k+m) and the maximum number of positional matches. Let δ(αi) be the number of
displacement characters that immediately precede the character αi which matches the
character βi, and δ(βi) be the number of displacement characters that immediately
precede the character βi which matches the character αi. Let q be the number of
matching characters. With these notations, equation EQ3 defines the COGN(TS, TT)
similarity measure as follows:
(EQ3)
⎪⎪⎩
⎪⎪⎨
⎧
≤
>+
−+∑=
2 q if 0
2q if |)()(|1
2
=)T ,COGN(T 1TSmk
q
i ii βδαδ
Using the COGN test as a filtering device is a heuristic based on the cognate conjecture
which says that when the two tokens of a translation pair are orthographically similar,
they are very likely to have similar meanings (i.e. they are cognates). The threshold for
the COGN(TS, TT) test was empirically set to 0.42. This value depends on the pair of
languages in the bitext. The actual implementation of the COGN test includes a
language-dependent normalisation step, which strips some suffixes, discards the
diacritics, reduces some consonant doubling, etc. This normalisation step was hand
written, but, based on available lists of cognates, it could be automatically induced.
The second filtering condition, DIST(TS, TT) considers relative distance between the
tokens in a pair and is defined as follows (where n and m are indexes of TS and TT in the
considered TU):
14 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 15
if ((<TS, TT>∈ LSjposk ⊗LTj
posk)&(TS is the n-th element in LSjposk)&(TT is the m-th
element in LTjposk)) then DIST(TS, TT)=|n-m|
The COGN(TS, TT) test is a more reliable heuristic than DIST(TS, TT), so that the TEC
with the highest similarity score is the preferred one. If the similarity score is irrelevant,
the weaker filter DIST(TS, TT) gives priority to the pairs with the smallest relative
distance between the constituent tokens.
The main use, up to now, of the BETA algorithm was in the European project
BALKANET (Stamou et al. 2002) aimed at building a EuroWordNet-like lexical
ontology. We used this algorithm for automatic acquisition of bilingual Romanian-
English resources and also for consistency checking of the interlingual projection of the
consortium monolingual wordnets. The multilinguality of EuroWordNet and its
BALKANET extension is ensured by linking monolingual synsets to interlingual
records that correspond to the Princeton Wordnet synsets. If two or more monolingual
wordnets are consistently projected over the interlingual index, then translation
equivalents extracted from a parallel corpus should be (ideally) projected over the same
interlingual record, or, (more realistically) onto interlingual records that correspond to
closely related meanings (according to a given metric). For this particular use, POS
identity of the translation equivalents was a definite requirement. However, in general,
imposing POS identity on the translation equivalents is too restrictive for a series of
multilingual applications. On the other hand, in the vast majority of cases, the cross-
lingual variation of the POS for translation equivalents is not arbitrary. This observation
led us to the implementation of TREQ, an improved translation-equivalents extractor,
more general than BASE.
15 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 16
4.4 A FURTHER ENHANCED EXTRACTION ALGORITHM (TREQ)
Besides algorithmic developments to be discussed in this section, TREQ has been
equipped with a graphical user interface which integrates additional functionality for
exploiting parallel corpora (editing the parallel corpora, generating word alignment
maps, multi-word term extraction, building multi-lingual and multi-word terminological
glossaries, etc.).
In section 4.1 we described four simplifying assumptions used in the implementation of
the translation-equivalents extraction procedures. The implementation of TREQ
dispenses with two of them, namely the assumption that the translation equivalence
preserves the POS and the assumption that repeated tokens in a sentence have the same
meaning.
4.4.1 Meta-categories
As noted before, when translation equivalents have different parts of speech this
alternation is not arbitrary and it can be generalized. TREQ allows the user to define for
each language pair the possible POS alternations. A set of grammar categories in one
language that could be mapped by the translation-equivalence relation over one or more
categories in the other language is called a meta-category. The user defines for each
language the meta-categories and then specifies their interlingual correspondence. For
instance, English participles and gerunds are often translated with Romanian nouns or
adjectives and vice versa. So, for this pair of languages we defined, in both languages,
the meta-category MC1 subsuming common nouns, adjectives and (impersonal) verbs,
and stipulated that if the source lexical token belongs to the MC1, than its translation
equivalent should belong to the same meta-category. Another example of a meta-
category we found useful, MC2, subsumes the following pronominal adjectives:
demonstrative, indefinite and negative. These types of adjectives are used differently in
the two languages (e.g. a negative adjective allowed in Romanian by the negative
16 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 17
concord phenomenon has as equivalent an indefinite or even demonstrative adjective in
English). For uniformity, any category not explicitly included in a user-defined meta-
category is considered the single subsumed category of a meta-category automatically
generated. The cross-lingual mapping of these meta-categories is equivalent to the POS
identity. For instance, the abbreviations, which in our multilingual corpora are labeled
with the tag X, are subsumed in this way by the MC30 meta-category. In order not to
lose information from the tagged parallel corpora, TREQ adds the meta-category
(actually a number) as a prefix to the actual tag of each token. The search space (TECL)
is computed as described in section 4.1, the only modification being that instead of POS
the meta-category prefix is used. Figure 3 shows the English and Romanian segments
from Figure 2 with the meta-category prefix added to the token tags. <tu id="Ozz.42"> <seg lang="en"> <s id="Oen.1.1.10.2"> <tok lemma="there" ana="22+Pt3">There</tok> <tok lemma="be" ana="1+Vmis-p">were</tok> <tok lemma="no" ana="2+Dg">no</tok> <tok lemma="window" ana="1+Ncnp">windows</tok> <tok lemma="in" ana="5+Sp">in</tok> <tok lemma="it" ana="13+Pp3ns">it</tok> <tok lemma="at_all" ana="14+Rmp">at all</tok> <c>.</c> </s> </seg> <seg lang="ro"> <s id="Oro.1.2.10.2"> <tok lemma="nu" ana="7+Qz">Nu</tok> <tok lemma="avea" ana="1+Vmii3s">avea</tok> <tok lemma="deloc" ana="14+Rgp">deloc</tok> <tok lemma="fereastră" ana="1+Ncfp-n">ferestre</tok> <c>.</c> </s> </seg> </tu>
Figure 3: Corpus encoding using meta-categories for the POS tagging
17 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 18
As the TECL becomes much larger with the introduction of meta-categories, the
memory-based book-keeping mechanisms were optimized to release unnecessarily
occupied memory and take advantage, in case of large parallel corpora, of virtual
memory (disk resident).
Besides accounting for real POS alternations in translation, the meta-category has the
advantage that it overcomes some tagging errors which could also result in POS
alternations. But probably the most important advantage of the meta-category
mechanism is the possibility of working with very different tagsets. In (Tufiş et al. 2003)
we describe a system (based on TREQ) participating in a shared task on Romanian-
English word-alignment. The English parts of the training and evaluation data were
tagged using the Penn TreeBank tagset while the Romanian parts were tagged using the
MULTEXT-EAST tagset. Using meta-categories was a very convenient way of coping
with the different encodings and granularities of the two tagsets.
Finally, we should observe that the algorithm by no means requires that the meta-
categories with the same cross-lingual identifier subsume the same grammatical
categories in the two languages; and also, that defining a meta-category that subsumes
all the categories in the languages considered is equivalent to completely ignoring the
POS information (thus tagging becomes unnecessary).
4.4.2 Repeated tokens
The second simplifying hypothesis which was dropped in the TREQ implementation
was to assume that the same token (with the same POS tag), used several times in the
same sentence, has the same meaning. Based on this assumption, in the previous
versions, only one occurrence of the token was preserved. As this hypothesis didn’t save
significant computational resources we decided to keep all the repeated tokens. This
modification slightly improved the precision of the algorithm allowing the extraction of
translation pairs that appeared only in one translation unit, but several times. Also, when
18 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 19
the tokens repeated in one language were translated differently (by synonyms) in the
other language, not purging the duplicates allowed extraction of translation pairs
(synonymic) which otherwise were lost.
4.4.3 Other improvements
We evaluated the cognate conjecture for Romanian-English pair of languages and found
it to be correct in more than 98% of cases when the similarity threshold was 0.68. We
also noted that many candidates, rejected either because of low loglikelihood score or
because they occurred only once, were cognates. Therefore, we modified the algorithm
to include also in the list of extracted translation equivalents all the candidates which, in
spite of failing the loglikelihood test, have a cognate score above the 0.68 threshold.
This change improved both precision and recall (see next section).
4.4.4 The Graphical User Interface
The Graphical User Interface has been developed mainly for the purpose of validation
and correction (in context) of the translation equivalents, a task committed to linguists
without (too much) computer training. Besides the lexical translation equivalents
extraction, the Graphical User Interface incorporates several other useful corpus
management and mining utilities:
a) selecting a corpus from a collection of several corpora;
b) editing and correcting the tokenization, tagging or lemmatization;
c) updating accordingly the extracted lexicons;
d) extracting compound-term translations in one language based on an inventory of
compound terms in the other language;
e) extracting multi-word collocations (monolingually) for updating the segmenter’s
resources for the languages concerned.
Figure 4 exemplifies parameters setting for the extraction process: the parallel corpus,
the language pairs, the statistic method used for independence-hypothesis testing, the
19 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 20
test threshold, the type of alignment (either by POS or by meta-categories), and, in the
case of POS alignment, which grammatical categories are of interest for the extracted
lexicon.
Figure 4: Parameters setting for a GUI-TREQ translation extraction session
Figure 5 displays the results of the extraction process. By displaying the running texts as
pairs of aligned sentences in two languages, the Graphical User Interface facilitates
evaluation in context of the extracted translation equivalents. If you point to a word in
either language, its translation equivalent in the other language is displayed.
20 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 21
Figure 5: The “1984” corpus: Romanian-English translation equivalents extracted from the OZZ.113 translation unit
A detailed presentation of the facilities and operation procedures are given in the TREQ
user manual (Mititelu, 2003).
5 EXPERIMENTS AND EVALUATION
We conducted translation equivalents extraction experiments on the three corpora
mentioned before (“1984”, “VAT” and “NAACL2003”) and for various pairs of
languages.
21 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 22
The bilingual lexicons extracted from the integral bitexts for English-Estonian, English-
Hungarian, English-Romanian and English-Slovene were evaluated by native speakers
of the languages paired to English and having a good command of English. The
evaluation protocol specified that all the translation pairs are to be judged in context, so
that if one pair is found to be correct in at least one context, then it should be judged as
correct. The evaluation was done for both the BASE and BETA algorithms but on
different scales. The BASE algorithm was run on all the 6 integral bitexts with the
English hub and 4 of out of the 6 bilingual lexicons were hand-validated. The lexicons
contained all parts of speech defined in the MULTEXT-EAST lexicon specifications
except for interjections, particles and residuals. The BETA and TREQ algorithms were
run on the Romanian-English partial bitext extracted from the “1984” 7-language
parallel corpus and we validated only the noun pairs. We also re-ran the BASE
algorithm, for comparison reasons, on the Romanian-English partial bitext.
The translation equivalents extracted from the “VAT” corpus by means of TREQ were
not explicitly evaluated, but were used in a multilingual term-extraction experiment for
the purposes of the FF-POIROT European project. The preliminary comparative
evaluation conducted by native speakers of French and Dutch, with excellent command
of English, showed that both precision (approx. 80%) and recall (approx. 75%) of our
results are significantly better than those of other extractors used in the comparison.
Since we don’t yet have the details of this evaluation, we will not go into further details.
The bilingual lexicon extracted from the “NAACL2003” corpus by TREQ has been
evaluated based on the test data used by the organisers of the HLT-NAACL2003 Shared
Task on word-alignment. The test text has been manually aligned at word level. This
valuable data and the program that computes precision, recall and F-measure of any
alignment against a gold standard have been graciously made public after the closing of
the shared task competition. From the word-aligned bitext used for evaluation we
22 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 23
removed the null alignments (words not translated in either part of the bitext) and
purged the duplicate translation pairs, and thus obtained the gold standard Romanian-
English lexicon. The evaluation considered all the words. The tables below give an
overview of the corpora and the gold standard alignment text we used for the evaluation
of the translation-equivalents extractors.
Language BU CZ EN ET HU RO SI
No. of tokens* 72020 66909 87232 66058 68195 85569 76177
No. of word forms* 15093 17659 9192 16811 19250 14023 16402
No. of lemmas* 8225 8677 6871 8403 9729 6987 7157
Figure 6. The "1984" corpus overview * the counts refer only to 1:1 aligned sentences and do not include interjections, particles and
residuals
Language EN FR NL
No. of occurrences 41722 45458 40594
No. of word forms* 3473 3961 3976
No. of lemmas* 2641 2755 3165
Figure 7. The "VAT" corpus overview
Language EN RO Language EN RO
No. of tokens 866036 770653 No. of tokens 4940 4563
No. of word forms* 27598 48707 No. of word forms* 1517 1787
No. of lemmas* 19139 23134 No. of lemmas* 1289 1370
Figure 8. Overview of the "NAACL2003" corpus (left) and the word-aligned bitext (right)
5.1 THE EVALUATION OF THE BASE ALGORITHM
For validation purposes we limited the number of iteration steps to 4. The extracted
lexicons contain adjectives (A), conjunctions (C), determiners (D), numerals (M), nouns
23 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 24
(N), pronouns (P), adverbs (R), prepositions (S) and verbs (V). Figure 9 shows the
evaluation results provided by human evaluators2. The precision (Prec) was computed
as the number of correct TEPs divided by the total number of extracted TEPs. The recall
(considered for the non-English language in the bitext) was computed two ways: the first
one, Rec*, took into account only the tokens processed by the algorithm (those that
appeared at least three times). The second one, Rec, took into account all the tokens
irrespective of their frequency counts. Rec* is defined as the number of source lemma
types in the correct TEPs divided by the number of lemma types in the source language
with at least 3 occurrences. Rec is defined as the number of source lemma types in the
correct TEPs divided by the number of lemma types in the source language. The F-
measure is defined as 2*Prec*Rec/(Prec+Rec) and we consider it to be the most
informative score.
The rationale for showing Rec* is to estimate the proportion of the missed tokens out of
the considered ones. This might be of interest when precision is of the utmost
importance.
Bitext 4 Steps ET-EN HU-EN RO-EN SI-EN
Entries 1911 1935 2227 1646
Prec/Rec/ F-measure
96.18/18.79/31.16
96.89/19.27/ 32.14
98.38/25.21/ 40.13
98.66/22.69/ 36.89
Rec* 57.86 56.92 58.75 57.92 Figure 9. “1984” integral bitexts; partial evaluation of the BASE algorithm after 4 iteration
steps with the occurrence threshold set to 3
The lexicons evaluation was fully performed for Estonian, Hungarian and Romanian and
partially for Slovene (the first step was fully evaluated while the rest were evaluated
from randomly selected pairs). As one can see in Figure 9, the precision is higher than
98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for
Estonian. The Rec* measure ranges from 50.92% (Slovene) to 63.90% (Estonian). The
24 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 25
standard recall Rec varies between 19.27% and 32.46% (quite modest, since on average,
the BASE algorithm did not consider 60% of the lemmas). Due to the low Rec value,
the composite F-measure is also low (ranging between 31.16% and 41.13%) in spite of
the very good precision. Our analysis showed that depending on the part of speech the
extracted entries have different accuracy. The noun extraction had the second worst
accuracy (the worst was the adverb), and therefore we considered that an in-depth
evaluation of this case would be more informative than a global evaluation. Moreover,
to facilitate the comparison between the BASE and BETA algorithms, we set no limit
for the number of steps, lowered the occurrence threshold to 2 and extracted only the
noun pairs from the partial Romanian-English bitext included into the “1984” 7-
language parallel corpus. The BASE program stopped after 10 steps with a number of
1673 extracted noun translation pairs, out of which 112 were wrong (see Figure 10).
Compared with the 4 steps run the precision decreased to 93.30%, but both Rec
(36.45%) and F-measure significantly increased showing that the occurrence threshold
set to 2 leads to a better Precision/Recall compromise than 3.
Noun types in text Entries Correct entries
Noun types in correct entries
Prec/Rec/F-measure
3116 1673 1561 1136 93.30/36.45/52.42
Figure 10. “1984” corpus; evaluation of the BASE algorithm with the noun lexicon extracted from the Romanian-English partial bitext; 10 iteration steps, the occurrence threshold set to 2
If the occurrence threshold is removed, because of the indirect association sensitivity the
precision of BASE degrades too much for the lexicon to be really useful.
5.2 THE EVALUATION OF THE BETA ALGORITHM
The BETA algorithm preserves the simplicity of the BASE algorithm but significantly
improves its global performance (F-measure) due to a much better recall (Rec) obtained
at the expense of some loss in precision (Prec). Keeping the occurrence threshold set at
25 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 26
two (that is, ignoring hapax-legomena translation-equivalence candidates) the results of
BETA evaluation on the same data are shown in Figure 11 below:
Noun types in text Entries Correct entries
Noun types in correct entries
Prec/Rec/F-measure
3116 2291 2183 1735 95.28/55.68/70.28
Figure 11. “1984” corpus; partial evaluation of the BETA algorithm (noun lexicon extracted from the partial Romanian-English bitext), the occurrence threshold set to 2
Moreover, indirect association sensitivity is very much reduced so that removing the
occurrence threshold shows even better global results:
Noun types in text Entries Correct entries
Noun types in correct entries
Prec/Rec/F-measure
3116 3128 2516 2114 80.43/67.84/73.60
Figure 12. “1984” corpus; partial evaluation of the BETA algorithm (noun lexicon extracted from the partial Romanian-English bitext), no occurrence threshold
Besides the occurrence threshold, the BETA algorithm offers another way to trade off
Prec for Rec: the COGN similarity score. In the experiments evaluated in Figure 12, the
threshold was set to 0.42.
We should mention that in spite of the general practice in computing recall for bilingual
lexicon-extraction tasks (be it Rec*, or Rec), this is only an approximation of the real
recall. The reason for this approximation is that in order to compute the real recall one
should have a gold standard with all the words aligned by human evaluators. Usually
such a gold standard bitext is not available and the recall is either approximated as
above, or is evaluated on a small sample and the result is taken to be more or less true
for the whole bitext.
26 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 27
5.3 THE EVALUATION OF THE TREQ ALGORITHM
To facilitate comparison with the BASE and BETA algorithms we ran TREQ on the
same data and used the same evaluation procedure for the extracted noun-translation
pairs. The results are shown in Figure 13 and (as expected) they are superior to those
provided by BETA.
Noun types in text Entries Correct entries
Noun types in correct entries
Prec/Rec/F-measure
3116 3001 2525 2084 84.14/66.88/74.52
Figure 13. “1984” corpus; evaluation of the TREQ algorithm (noun lexicon extracted from the Romanian-English partial bitext)
All the previous evaluations were based on an approximation of the recall measure,
motivated by the lack of a gold standard lexicon. As mentioned before, for the purpose
of the shared task on word alignment at NAACL2003 workshop, the organisers created a
short hand-aligned Romanian-English bitext (248 sentences) which was made public
after the competition. We used this word-alignment data to extract a Gold Standard
Romanian-English Lexicon allowing a precise evaluation of the recall. The complete set
of links in the word-aligned bitext contains 7149 links. Each token in either language is
bi-directionally linked to a token representing its translation in the other language or to
the empty string if it was not translated. Removing the empty links we were left with
6195 links representing pairs of translation equivalents: <RO-word EN-word>. Deleting
the links for punctuation, purging the links corresponding to identical lexical pairs and
eliminating the pairs not preserving the meta-category3 we obtained a Gold Standard
Lexicon containing 1706 entries. Out of these entries 1547 are POS-preserving
translation pairs, the rest being legitimate alternations. The Gold Standard Lexicon
includes all the grammatical categories defined in the revised MULTEXT-EAST
specifications for lexicon encoding (Erjavec, 2001). Figure 14 shows the exact
evaluation of the TREQ performances.
27 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 28
No. of entries in
the Gold Standard
No. of entries
extracted
No. of correct
entries
Prec/Rec/F-measure
1706 1308 1041 79.58/61.01/69.06 Figure 14. “NAACL2003” word-aligned bitext; exact evaluation of the TREQ algorithm
The scores of the exact evaluation are significantly lower than expected, compared to the
approximate evaluation procedure used before on the 1984 corpus. Given the scarcity of
statistical evidence data (the NAACL evaluation bitext is almost 20 times smaller than
the bitext extracted from the “1984” corpus) the performance decrease is not surprising.
On the other hand, the exact calculation of the recall shows that considering only lemma
types in one part of the bitext and of the lexicon (as the approximate recall calculation
does) is slightly over-estimating the real recall by ignoring the multiple senses a lemma
might have. If we compute the recall as we did before, it will show an increase of more
than 2% (63.08%) and thus a better F-measure (70.67%).
We mentioned at the beginning of the paper that by adding a post-processing phase to
the basic translation-equivalence extraction procedure, one may further improve the
accuracy and coverage of the extracted lexicons. In the next section we give an overview
of such a post-processing phase, and show how the performance of the translation-
equivalence extraction was improved.
5.3.1 TREQ-AL and word-alignment maps
In (Tufiş et al. 2003) we described our TREQ-AL system which participated in the
Shared Task proposed by the organizers of the workshop on “Building and Using
Parallel Texts: Data Driven Machine Translation and Beyond” at the HLT-NAACL
2003 conference (http://www.cs.unt.edu/~rada/wpt). TREQ-AL builds on TREQ and
generates a word-alignment map for a parallel text (a bitext). The word alignment as it
28 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 29
was defined in the shared task is different and harder than the problem of translation
equivalence as previously addressed. In a lexicon extraction task one translation pair is
considered correct if there is at least one context in which it has been correctly observed.
A multiply-occurring pair would count only once for the final lexicon. This is in sharp
contrast with the alignment task where each occurrence of the same pair counts equally.
The word-alignment task requires that each word (irrespective of its POS) and
punctuation mark in both parts of the bitext be paired to a translation in the other part (or
the null translation if the case). Such a pair is called a link. In a non-null link both
elements of the link are non-empty words from the bi-text. If either the source word or
the target word is not translated in the other language this is represented by a null link.
Finally, the evaluations of the two tasks, even if both use the same measures as precision
or recall, have to be differently judged. The null links in a lexicon extraction task have
no significance, while in a word alignment task they play an important role (in the
Romanian-English gold standard data the null links represent 13.35% of the total
number of links). Being built on TREQ, any improvement in the precision and recall of
the extracted lexicons will have a crucial impact on the precision and recall of the
alignment links produces by TREQ-AL. This is also true the other way around: as
described in (Tufiş et al., 2003), several wrong translation pairs extracted by TREQ are
disregarded by TREQ-AL and moreover, many translation pairs unfound by TREQ are
generated by the alignment of TREQ-AL. This is clearly shown by the scores in Figure
15 as compared to those in Figure 13.
Noun types in text Entries Correct entries
Noun types in correct entries
Prec/Rec/F-measure
3116 3724 3263 2450 87.62/75.08/80.87
Figure 15. “1984” corpus; evaluation of the TREQ-AL algorithm (noun lexicon extracted from the Romanian-English partial bitext)
29 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 30
The first three columns in Figure 16 give the initial evaluation of TREQ-AL on the
shared-task data.
Non-null links only Null links included TREQ-AL lexicon
Precision 81.38% 60.43% 84.42%
Recall 60.71% 62.80% 77.72%
F-measure 69.54% 61.59% 80.93%
Figure 16. Evaluation of TREQ-AL in the “NAACL2003” shared task on word alignment
The error analysis pinpointed some minor programming errors and we were able to fix
them in a short period of time. We also decided to see how an external resource, namely
a bilingual seed lexicon, would improve the performances of TREQ and TREQ-AL. We
used our Romanian WordNet, under development, as a source for a seed bilingual
lexicon. The Romanian WordNet contains 11,000 verb and noun synsets which are
linked to the Princeton Wordnet. From one Romanian synset SRO, containing M literals,
and its equivalent synset in English SEN, containing N literals, we generated M*N
translation pairs, thus producing a bilingual seed lexicon containing about 40,000
entries. This lexicon contains some noise since not all M*N translation pairs obtained
from two linked synsets are expected to be real translation-equivalence pairs4.
In Figure 17 we give the new evaluation results (using the official programs and
evaluation data) of the new versions of TREQ and TREQ-AL.
Non-null links only Null links included TREQ-AL lexicon TREQ lexicon
Precision 84.43% 65.58% 86.68% 79.58
Recall 64.34% 66.08% 81.96% 61.01
F-measure 73.03% 65.83% 84.26% 69.06
Figure 17. Re-evaluation of TREQ-AL and TREQ on the “NAACL2003” shared task without a seed lexicon
30 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 31
As shown in Figure 17, TREQ-AL dramatically improves the performance of TREQ:
precision increased with more than 7% while the recall of TREQ-AL is more than 20%
better when compared to the recall of TREQ.
The evaluation of TREQ-AL when TREQ started with a seed lexicon showed no
improvement of the final extracted dictionary. However, the results for the word-
alignment shared-task improved (apparently the frequency of what was found versus
what was lost made the difference, which is anyway not statistically significant).
Non-null links only Null links included TREQ-AL lexicon
Precision 84.72% 66.07% 86.56%
Recall 64.73% 66.43% 81.85%
F-measure 73.39% 66.25% 84.13%
Figure 18. Re-evaluation of TREQ-AL and TREQ on the “NAACL2003” shared task with an initial bilingual dictionary
Figures 19a and 19b show the performance of all participating teams on the Romanian-
English word alignment shared task. There were two distinct evaluations: the NON-
NULL-alignments only considered the links that represented non-null translations while
the NULL-alignments took into account both the non-null and the null translations.
RACAI.RE.2 is the evaluation of TREQ-AL with an initial seed lexicon and
RACAI.RE.1 is the evaluation of TREQ-AL without an initial seed lexicon. The
systems were evaluated in terms of 3 figures of merit: Fs-measure, Fp-measure, and
AER=Alignment Error Rate. Since the Romanian Gold Standard contains only sure
alignments AER reduces to 1 - Fpmeasure. For all systems that assigned only sure
alignments Fpmeasure = Fsmeasure (see Mihalcea & Pedersen (2003) for further
details).
31 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 32
0,00%10,00%20,00%30,00%40,00%50,00%60,00%70,00%80,00%
RACAI.RE.2
RACAI.RE.1
XRCE.Nole
m-56K.R
E.2
Proalig
n.RE.1
XRCE.Trilex.R
E.3
XRCE.Trilex.R
E.4
XRCE.Base.R
E.1
Ralign
.RE.1
BiBr.RE.1
BiBr.RE.3
BiBr.RE.2
UMD.RE.2
UMD.RE.1
Fourda
y.RE.1
F-measure Sure F-measure Probable AER
Figure 19a. NAACL2003 Shared Task: ranked results of Romanian-English non-NULL alignments
0,00%10,00%20,00%30,00%40,00%50,00%60,00%70,00%
RACAI.RE.2
RACAI.RE.1
XRCE.Trilex.R
E.3
Proalig
n.RE.1
XRCE.Trilex.R
E.4
XRCE.Base.R
E.1
XRCE.Nole
m-56K.R
E.2
Ralign
.RE.1
BiBr.RE.1
BiBr.RE.3
BiBr.RE.2
UMD.RE.2
UMD.RE.1
Fourda
y.RE.1
F-measure Sure F-measure Probable AER
Figure 19b. NAACL2003 Shared Task: ranked results of Romanian-English NULL alignments
32 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 33
6 IMPLEMENTATION, CONCLUSIONS AND FURTHER WORK
The extraction programs, BASE, BETA and TREQ, as well as TREQ-AL, run on both
Windows and Unix machines5. Throughput is very fast: on a Pentium 4 (1.7 GHz) with
512 MB of RAM, extracting the noun bilingual lexicon from “1984” took 109 seconds
(72 s. for TREQ plus 37 s. for TREQ-AL) while the full dictionary was generated in 285
seconds (204 s. for TREQ plus 81 s for TREQ-AL). These figures are comparable to
those reported in (Tufiş and Barbu, 2002) for BETA although the machine on which
those evaluations were conducted was a less powerful Pentium II (233 MHz) processor
accessing 96MB of RAM.
An approach quite similar to our BASE algorithm (also implemented in Perl) is
presented in (Ahrenberg et al, 2000). They used a frequency threshold of 3 and the best
results reported are 92.5% precision and 54.6% partial recall (what we called Rec*). The
BETA and TREQ algorithms exploit the idea of competitive linking underling
Melamed’s extractor (Melamed, 2001), although our program never returns to a visited
translation unit. Melamed’s evaluation is made in terms of accuracy and coverage,
where accuracy is more or less our precision and coverage is defined as percentage of
tokens in the corpus for which a translation has been found. With the best 90%
coverage, the accuracy of his lexicon was 92.8±1.1%. Coverage is a much weaker
evaluation function than recall, especially for large corpora, since it favours frequent
tokens to the detriment of hapax legomena. Melamed (2001) showed that the 4.5% most
frequent translation pair types in the Hansard parallel corpus cover more than 61% of the
tokens in a random sample of 800 sentences. Moreover, the approximation used by
Melamed in computing coverage over-estimates, since it does not consider whether the
translations found for the words in the corpus are correct or not. Based on the Gold
Standard Lexicon, we could compute exact precision, recall, coverage and also the
approximated coverage (Melamed’s way). As Figure 20 shows, in spite of a very small
33 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 34
text, there are significant differences between exact coverage and the approximated
coverage. The differences are much more significant in case of a larger text.
Exact Coverage Estimated Coverage Precision
Romanian 91.91% 96.92%
English 91.98% 97.21% 86.56%
Figure 20: Exact and Estimated Coverage for the lexicon extracted by TREQ-AL from the NACL2003 Gold Standard Alignment
We ran TREQ-AL (without the seed lexicon mentioned before) on the entire
NAACL2003 corpus, extracting a 48287-entry lexicon. Following Melamed’s (2001)
procedure, we took five random samples (with replacement) of 100 entries and validated
them by hand. The average resulting precision was 91.67% with an estimated coverage
of 95.21% for Romanian and 96.56% for English. However, as demonstrated in Figure
20, without a gold standard, such estimated evaluations should be regarded cautiously.
All algorithms we presented are based on a 1:1 mapping hypothesis. We argued that in
case a language-specific tokenizer is responsible for pre-processing the input to the
extractor, the 1:1 mapping approach is not an important limitation anymore.
Incompleteness of the segmenter’s resources may be accounted for by using a post-
processing phase for recovering the partial translations. In (Tufiş, 2001) such a
recovering phase is presented that takes advantage of the already extracted entries.
Additional means, such as collocation extraction based on n-gram statistics and partial
grammar filtering (as included in the GUI-TREQ), are effective ways of continuously
improving the segmenter’s resources and decrease to a large extent the restrictions
imposed by the 1:1 mapping hypothesis.
Finally, we should notice that although TREQ is quite mature, TREQ-AL is under
further development and we have confidence that there is ample room for future
performance improvements.
34 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 35
Acknowledgements
The research on translation equivalents started as an AUPELF/UREF co-operation
project with LIMSI/CNRS (CADBFR) and used multilingual corpus and multilingual
lexical resources developed within MULTEXT-EAST, TELRI and CONCEDE EU
projects. The continuous improvements of the methods and tools described in this paper
were motivated and supported by two European projects we are currently involved in:
FF-POIROT (IST-2001-38248) and BALKANET (IST-2000-29388).
We are gratefull to the editor of this issue and to an annonymous reviewer who did a
great job in improving the content and the readability of this paper. All the remaining
problems are entirely ours.
References
Ahrenberg, L., Andersson, M., Merkel, M. (2000). "A knowledge-lite approach to word
alignment", in Véronis, J. (ed). Parallel Text Processing. Text, Speech and Language
Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97-116.
Brants, T.(2000) “TnT – A Statistical Part-of-Speech Tagger”, in Proceedings ANLP-2000,
April 29 – May 3, Seattle, WA.
Brew, C., McKelvie, D. (1996) “Word-pair extraction for lexicography.” Available at
http:///tokww.ltg.ed.ac.uk/ ~chrisbr/papers/nemplap96.
Brown, P., Pietra, Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L. (1993), "The
mathematics of statistical machine translation: parameter estimation" in Computational
Linguistics, 19/2, pp. 263-311.
Dunning, T. (1993), “Accurate Methods for the Statistics of Surprise and Coincidence” in
Computational Linguistics, 19/1, pp. 61-74.
35 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 36
Dimitrova, L, Erjavec, T., Ide, N., Kaalep, H., Petkevic, V. and Tufiş, D. (1998) "Multext-East:
Parallel and Comparable Corpora and Lexicons for Six Central and East European
Languages" in Proceedings ACL-COLING’1998, Montreal, Canada, pp. 315-319.
Gale, W.A. and Church, K.W. (1991). "Identifying word correspondences in parallel texts". In
Fourth DARPA Workshop on Speech and Natural Language, pp. 152-157.
Gale, W.A. and Church, K.W. (1993). “A Program for Aligning Sentences in Bilingual
Corpora”. In Computational Linguistics, 19/1, pp. 75-102.
Erjavec, T. (ed.) (2001). “Specifications and Notations for MULTEXT-East Lexicon Encoding”.
Edition Multext-East/Concede Edition, March, 21, 210 pages, available at
[http://nl.ijs.si/ME/V2/msd/html/].
Erjavec, T., Lawson A., Romary L. (1998). East Meet West: A Compendium of Multilingual
Resources.TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6.
Erjavec T., Ide, N. (1998) “The Multext-East corpus”. In Proceedings LREC’1998, Granada,
Spain, pp. 971-974.
Hiemstra, D. (1997). "Deriving a bilingual lexicon for cross language information retrieval". In
Proceedings of Gronics, pp. 21-26.
Ide, N., Veronis, J. (1995). “Corpus Encoding Standard”, MULTEXT/EAGLES Report.
Available at http//tokww.lpl.univ-aix.fr/projects/multext/CES/CES1.html.
Kay, M., Röscheisen, M. (1993). “Text-Translation Alignment”. In Computational Linguistics,
19/1, pp. 121:142.
Kupiec, J. (1993). "An algorithm for finding noun phrase correspondences in bilingual corpora".
In Proceedings of the 31st Annual Meeting of the Association of Computational
Linguistics, pp. 17:22.
Melamed, D. (2001). Empirical Methods for Exploiting Parallel Texts. The MIT Press,
Cambridge Massachusetts, London England, 195 pages.
36 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 37
Mihalcea R., Pedersen T.(2003) “An Evaluation Exercisefor Word Alignment”. Proceedings of
the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven
Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1-10.
Mititelu, C. (2003) –TREQ User Manual, Technical Report, RACAI, May, 25 pages.
Smadja, F., McKeown, K.R. and Hatzivassiloglou, V. (1996). "Translating collocations for
bilingual lexicons: A statistical approach". Computational Linguistics, 22/1, pp. 1:38.
Stamou, S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov
G., Dutoit D., Grigoriadou M. (2002) “BALKANET A Multilingual Semantic Network
for the Balkan Languages”, in Proceedings of the International Wordnet Conference,
Mysore, India, 21-25 January.
Tufiş, D., Barbu, A.M., Pătraşcu, V., Rotariu, G., Popescu, C.(1997) ”Corpora and Corpus-
Based Morpho-Lexical Processing “ in D. Tufiş, P. Andersen (eds.) “Recent Advances in
Romanian Language Technology”, Editura Academiei, pp. 35-56.
Tufiş, D., Ide, N. Erjavec, T. (1998). “Standardized Specifications, Development and
Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European
Languages”. Proceedings LREC’1998, Granada, Spain, pp. 233-240.
Tufiş, D. (2000) “Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a
Tagset for Probabilistic Tagging. Proceedings LREC’2000, Athens, pp. 1105-1112.
Tufiş, D. (2001). “Partial translations recovery in a 1:1 word alignment approach”, RACAI
Technical report, 2001(in Romanian), 18 pages.
Tufiş, D. (2002) ”A cheap and fast way to build useful translation lexicons” in Proceedings of
the 19th International Conference on Computational Linguistics, COLING2002, Taipei,
25-30 August, pp. 1030-1036.
Tufiş, D. Barbu, A.M. (2002): „Revealing translators knowledge: statistical methods in
constructing practical translation lexicons for language and speech processing”, in
International Journal of Speech Technology. Kluwer Academic Publs, no.5, pp. 199-209.
37 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 38
Tufiş, D., Barbu, A.M., Ion, R. (2003) “TREQ-AL: A word alignment system with limited
language resources”, Proceedings of the HLT-NAACL 2003 Workshop on Building and
Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June,
Edmonton, Canada, pp. 36-39.
38 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.
TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 39
NOTES 1 MtSeg has tokenization resources for many Western European languages, further enhanced in
the MULTEXT-EAST project (Erjavec and Ide, 1998; Dimitrova et al., 1998; Tufiş et al., 1998)
with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and
Slovene.
2 The lexicons were evaluated by Heiki Kaalep of the Tartu University (ET-EN), Tamas Várády
of the Linguistic Institute of the Hungarian Academy (HU-EN), Ana Maria-Barbu of RACAI
(RO-EN) and Tomaž Erjavec of the IJS Lubljana. All of them are gratefully acknowledged.
3 This was necessary because the way Gold Standard Alignment dealt with compounds: an
expression in Romanian having N words, aligned to its equivalent expression in English
containing M words, was represented by N*M word links. We considered in this case only one
lexicon entry instead of N*M.
4 The existing errors in our synsets definition might be the simplest explanation.
5 The programs are written in Perl and we tested them on Unix, Linux and Windows. The
graphical user interface of TREQ combines technologies like DHTML, XML, and XSL with the
languages HTML, JavaScript, Perl, and PerlScript.
39 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.