Extracting multilingual lexicons from parallel corpora · PDF fileTUFIŞ, BARBU, ION:...

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS 1

Extracting multilingual lexicons from parallel

corpora

DAN TUFIŞ1, ANA MARIA BARBU2 AND RADU ION3

1,2,3 Romanian Academy (RACAI),13, “13 Septembrie”, 050711, Bucharest 5, Romania

[email protected], [email protected], [email protected]

Abstract. The paper describes our recent developments in automatic extraction of translation

equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple

baseline iterative method, and two non-iterative more elaborated versions. While the baseline

algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the

use of different working hypotheses which may be motivated by different kinds of applications

and to some extent by the languages concerned. The first two algorithms rely on cross-lingual

POS preservation, while with the third one POS invariance is not an extraction condition. The

evaluation of the algorithms was conducted on three different corpora and several pairs of

languages.

Keywords: alignment, evaluation, lemmatization, tagging, translation equivalence

1 INTRODUCTION

Automatic Extraction of bilingual lexicons from parallel texts might seem a futile

task, given that more and more bilingual lexicons are printed nowadays and they can be

easily turned into machine-readable lexicons. However, if one considers only the

possibility of automatic enriching the presently available electronic lexicons, with very

limited manpower and lexicographic expertise, the problem reveals a lot of potential.

The scientific and technological advancement in many domains is a constant source of

new-term coinage and therefore keeping up with multilingual lexicography in such areas

1 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.


is very difficult unless computational means are used. On the other hand, translation

bilingual lexicons appear to be quite different from the corresponding printed lexicons,

meant for human users. The marked difference between printed bilingual lexicons and

bilingual lexicons as needed for automatic translation is not really surprising. Traditional

lexicography deals with translation equivalence (the underlying concept of bilingual

lexicography) in an inherently discrete way. What is to be found in a printed dictionary

or lexicon (bi- or multilingual) is just a set of general basic translations. In the case of

specialised registers, general lexicons are usually not very useful.

The recent interest in semantic markup of texts, motivated by the Semantic Web

technologies, raises the issue of exploiting the markup existing in one language text to

automatically generate the semantic annotations in the second language parallel text.

Finding the lexical correspondencies in a parallel text creates the possiblity of

bidirectional import of semantic annotations that might exist in either of the two parallel

texts.

The basic concept in extracting translation lexicons is the notion of translation

equivalence relation (Gale and Church, 1991). One of the widely accepted definitions

(Melamed, 2001) of the translation equivalence defines it as a (symmetric) relation that

holds between two different language texts, such that expressions appearing in

corresponding parts of the two texts are reciprocal translations. These expressions are

called translation equivalents. A parallel text, or a bitext, having its translation

equivalents linked is an aligned bitext. Translation equivalence may be defined at

various granularity levels: paragraph, sentence, lexical. Automatic detection of the

translation equivalents in a bitext is increasingly more difficult as the granularity

becomes finer. Here we are concerned with the finest alignment granularity, namely the

lexical one. If not stated otherwise, in the rest of the paper by translation equivalents we

will mean lexical translation equivalents.



Most approaches to automatic extraction of translation equivalents roughly fall into two

categories. The hypotheses-testing methods such as (Gale and Church, 1991; Smadja et

al., 1996) rely on a generative device that produces a list of translation equivalence

candidates (TECs), each of them being subject to an independence statistical test. The

TECs that show an association measure higher than expected under the independence

assumption are assumed to be translation-equivalence pairs (TEPs). The TEPs are

extracted independently one of another and therefore the process might be characterised

as a local maximisation (greedy) one. The estimating approach (e.g. Brown et al., 1993;

Kupiec, 1993; Hiemstra, 1997) is based on building a statistical bitext model from data,

the parameters of which are to be estimated according to a given set of assumptions. The

bitext model allows for global maximisation of the translation equivalence relation,

considering not individual translation equivalents but sets of translation equivalents

(sometimes called assignments). There are pros and cons for each type of approach,

some of them discussed in (Hiemstra, 1997).

Our method comes closer to the hypotheses-testing approach. It generates first a list of

translation equivalent candidates and then successively extracts the most likely

translation-equivalence pairs. The extraction process does not need a pre-existing

bilingual lexicon for the considered languages. Yet, if such a lexicon exists, it can be

used to eliminate spurious candidate translation-equivalence pairs and thus to speed up

the process and increase its accuracy.

2 CORPUS ENCODING

In our experiments, we used three parallel corpora. The largest one, henceforth

“NAACL2003”, is bilingual (Romanian and English), contains 866,036 words in the

English part and 770635 words in the Romanian part, and consists mainly of journalistic

texts. The raw texts in this corpus have been collected and provided by Rada Mihalcea

from the University of North Texas for the purpose of the Shared Task on word-



alignment organised by Rada Mihalcea and Ted Pedersen at the HLT-NAACL2003

workshop on “Building and Using Parallel Texts: Data Driven Machine Translation and

Beyond“ (see http://www.cs.unt.edu/~rada/wpt/). The smallest parallel text, henceforth

“VAT”, is 3-lingual (French, Dutch and English), contains about 42,000 words per

language), and is a legal text (the EEC Sixth VAT Directive -77/388 EEC VAT). It was

built for the FF-POIROT European project (http://www.starlab.vub.ac.be/

research/projects/poirot/), as a testbed for multilingual term extraction and alignment.

The third corpus, henceforth “1984”, is the result of the MULTEXT-EAST and

CONCEDE European projects and it is based on Orwell’s novel Nineteen Eighty-Four,

translated in 6 languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene)

with the English original included as the hub. Each translation was aligned to the

English hub, thus yielding 6 bitexts. From these 6 integral bitexts, containing on average

100,000 words per language, we selected only the sentences that were 1:1 aligned with

the English original, thus obtaining the 7-language parallel corpus with an average of

80,000 words per language. Out of the three corpora this is the most accurate, being

hand validated (Erjavec & Ide, 1998; Tufiş et al. 1998).

The input to our algorithms is represented by a parallel corpus encoded according to a

simplified version of XCES specification (http://www.cs.vassar.edu/XCES). This

encoding requires preliminary pre-processing of each monolingual part of the parallel

corpus and, afterwards, the sentence alignment of all monolingual texts. The aligned

fragments of text, in two or more languages present in the parallel corpus, make a

translation unit. Each translation unit consists of several segments, one per language. A

segment is made of one uniquely identified sentence. Each sentence is made up of one or

more tokens for which the lemma and the morpho-syntactic code are explicitly encoded

as tag attributes. More often than not, a token corresponds to what is generally called a

word, but this is not always the case. Depending on the lexical resources used in the

(monolingual) text segmentation, a multiword expression may be treated as a single



lexical token and encoded as such. As an example of the encoding used by our

algorithms, Figure 1 shows the translation unit "Ozz.42" of the “1984” corpus: <tu id="Ozz.42"> <seg lang="en"> <s id="Oen.1.1.10.2"> <tok lemma="there" ana="Pt3">There</tok> <tok lemma="be" ana="Vmis-p">were</tok> <tok lemma="no" ana="Dg">no</tok> <tok lemma="window" ana="Ncnp">windows</tok> <tok lemma="in" ana="Sp">in</tok> <tok lemma="it" ana="Pp3ns">it</tok> <tok lemma="at_all" ana="Rmp">at all</tok> <c>.</c> </s> </seg> <seg lang="ro"> <s id="Oro.1.2.10.2"> <tok lemma="nu" ana="Qz">Nu</tok> <tok lemma="avea" ana="Vmii3s">avea</tok> <tok lemma="deloc" ana="Rgp">deloc</tok> <tok lemma="fereastră" ana="Ncfp-n">ferestre</tok> <c>.</c> </s> </seg> <seg lang="sl"> <s id="Osl.1.2.11.2"> <tok lemma="okno" ana="Ncnpg">Oken</tok> <tok lemma="na" ana="Spsl">na</tok> <tok lemma="on" ana="Pp3nsl--n-n">njem</tok> <tok lemma="sploh" ana="Q">sploh</tok> <tok lemma="biti" ana="Vcip3s--y">ni</tok> <tok lemma="biti" ana="Vcps-sna">bilo</tok> <c>.</c> </s> </seg> <seg lang="cs"> <s id="Ocs.1.1.10.2"> <tok lemma="mít" ana="Vmps-snay----n">Nemělo</tok> <tok lemma="vůbec" ana="Rgp">vůbec</tok> <tok lemma="okno" ana="Ncnpa">okna</tok> <c>.</c> </s> </seg> <seg lang="bg"> <s id="Obg.1.1.9.2"><tok lemma="то" ana="PP3">То</tok> <tok lemma="изобщо" ana="RG"> изобщо</tok> <tok lemma="нямам" ana="VMII3S">нямаше</tok> <tok lemma="прозорец" ana="NCMP-N"> прозорци</tok>



<c>.</c> </s> </seg> <seg lang="et"> <s id="Oet.1.2.10.2"> <tok lemma="see" ana="Pd--s3">Sel</tok> <tok lemma="ei" ana="Va------y">ei</tok> <tok lemma="olema" ana="Vmii---ay">olnud</tok> <tok lemma="üks" ana="Pi--s1">ühtki</tok> <tok lemma="aken" ana="Nc-s1">akent</tok> <c>.</c> </s> </seg> <seg lang="hu"> <s id="Ohu.1.2.10.2"> <tok lemma="egyáltalán" ana="Rg">Egyáltalán</tok> <tok lemma="nem" ana="Rg">nem</tok> <tok lemma="van" ana="Vmis3p---n">voltak</tok> <tok lemma="õ" ana="P---sp">rajta</tok> <tok lemma="ablak" ana="Nc-pn">ablakok</tok> <c>.</c> </s> </seg> </tu>

Figure 1: Corpus encoding for the translation equivalence extraction algorithms

The next section briefly describes the pre-processing steps, used by the corpus

generation module that provides the input for the translation equivalence extraction

algorithms.

3 PRELIMINARY PROCESSING

3.1 SEGMENTATION; WORDS AND MULTIWORD LEXICAL TOKENS

A lexical item is usually considered to be a space- or punctuation-delimited string of

characters. However, especially in multilingual studies, it is convenient, and frequently

linguistically motivated, to consider some sequences of traditional words as making up a

single lexical unit. For translation purposes considering multiword expressions as single

lexical units is a regular practice justified both by conceptual and computational reasons.

The recognition of multiword expressions as single lexical tokens, and the splitting of



single words into multiple lexical tokens (when it is the case) are generically called text

segmentation and the program that performs this task is called a segmenter or a

tokenizer. In the following we will refer to words and multiword expressions as lexical

tokens or, simply, tokens.

The multilingual segmenter we used, MtSeg1, is a public domain tool

(http://tokww.lpl.univ-aix.fr/projects/multext/MtSeg/) and was developed by Philippe di

Cristo within the MULTEXT project. The segmenter is able to recognise dates,

numbers, various fixed phrases, to split clitics or contractions, etc. We implemented a

collocation extractor, based on NSP, an n-gram statistical package

(http://www.d.umn.edu/~tpederse/nsp.html) developed by Ted Pedersen. The list of

generated n-grams is subject to a regular expression filtering that considers language-

specific constituency restrictions. After validation, the new multi-word expressions may

be added to the segmenter’s resources. A complementary approach to overcome the

inherent incompleteness of the language specific tokenization resources is largely

described in (Tufiş, 2001).

3.2 SENTENCE ALIGNMENT

We used a slightly modified version of Gale and Church’s CharAlign sentence aligner

(Gale and Church, 1993). In general, sentence alignments of all bitexts of our

multilingual corpora are of the type 1:1, i.e. in most cases (more than 95%) one sentence

is translated as one sentence. In the following we will refer to the alignment units as

translation units (TU). In general, sentence alignment is a highly accurate process, but

in our corpora, alignment is error-free, either because of manual validation and

correction (“1984”) or because the raw texts were published already aligned by their

authors (“VAT” and “NACL2003”).



3.3 TAGGING AND LEMMATIZATION

For highly inflectional languages, morphological variation may generate diffusion of the

statistical evidence for translation equivalents. In order to avoid data sparseness we

added a tagging and lemmatization phase as a front-end pre-processing of the parallel

corpus. For instance, the English adjective “unacceptable”, occurring nine times in one

of our corpora, has been translated in Romanian by nine different word-forms,

representing inflected forms (singular/plural, masculine/feminine) of three adjectival

lemmas (inacceptabil, inadmisibil, intolerabil): inacceptabil, inacceptabile, inadmisibil,

inadmisibile, inadmisibilă, inadmisibilului, intolerabil, intolerabilă and intolerabilei.

Without lemmatization all translation pairs would be “hapax-legomena” pairs and thus

their statistical recognition and extraction would be hampered. The lemmatization

ensured sufficient evidence for the algorithm to extract all the three translations of the

English word.

The monolingual lexicons developed within MULTEXT-EAST contain, for each word-

form, its lemma and the morpho-syntactic codes that apply for the current word-form.

With these monolingual resources lematisation is a by-product of tagging: knowing the

word-form and its associated tag, the lemma extraction, for those words that are in the

lexicon, is just a matter of lexicon lookup; for unknown words, the lemma is implicitly

set to the word-form itself, unless a lemmatiser is available. Erjavec and Ide (1998)

provide a description of the MULTEXT-EAST lexicon encoding principles. A detailed

presentation of their application to Romanian is given in (Tufiş et al., 1997).

For morpho-syntactic disambiguation we use a tiered-tagging approach with combined

language models (Tufiş, 1999) based on TnT - a trigram HMM tagger (Brants, 2000).

For Romanian, this approach has been shown to provide an average accuracy of more

than 98.5%. The tiered-tagging model relies on two different tagsets. The first one,

which is best suited for statistical processing, is used internally while the latter (used in a



morpho-syntactic lexicon and in most cases more linguistically motivated) is used in the

tagger’s output. The mapping between the two tagsets is in most cases deterministic (via

a lexicon lookup) or, in the rare cases where it is not, a few regular expressions may

solve the non-determinism. The idea of tiered tagging works not only for very fine-

grained tagsets, but also for very low-information tagsets, such as those containing only

part of speech. In such cases the mapping from the hidden tagset to the coarse-grained

tagset is strictly deterministic. In (Tufiş, 2000) we showed that using the coarse-grained

tagset directly, (14 non-punctuation tags) gave a 93% average accuracy, while using a

tiered tagging and combined language model approach (92 non-punctuation tags in the

hidden tagset) the accuracy was never below 99.5%.

4 LEXICONS EXTRACTION

4.1 UNDERLYING ASSUMPTIONS

Extracting translation equivalents from parallel corpora is a very complex task that can

easily turn into a computationally intractable enterprise. Fortunately, there are several

assumptions one can consider in order to simplify the problem and lower the

computational complexity of its solution. Yet, we have to mention that these empirical

simplifications usually produce information loss and/or noisy results. Post-processing, as

we will describe in section 5.3, may significantly improve both precision and recall by

eliminating some wrong translation equivalence pairs and finding some good ones,

previously undiscovered.

The assumptions we used in our basic algorithm are the following:

• a lexical token in one half of the TU corresponds to at most one non-empty lexical

unit in the other half of the TU; this is the 1:1 mapping assumption which underlies

the work of many other researchers (e.g. Kay and Röscheisen, 1993; Melamed,



2001; Ahrenberg et al. 2000; Hiemstra, 1997; Brew and McKelvie, 1996). When this

hypothesis does not hold, the result is a partial translation. However, remember that

a lexical token could be a multiple-word expression tokenized as such by an

adequate segmenter; non-translated tokens are not of interest here.

• a lexical token, if used several times in the same TU, is used with the same meaning;

this assumption is explicitly used also by (Melamed, 2001) and implicitly by all the

previously mentioned authors; the rationale for this assumption comes from the

pragmatics of regular natural language communication: the reuse of a lexical token,

in the same sentence and with a different meaning, generates extra cognitive load on

the recipient and thus is usually avoided; exceptions from this communicative

behavior, more often than not, represent either bad style or a game of words.

• a lexical token in one part of a TU can be aligned to a lexical token in the other part

of the TU only if the two tokens have the same part-of-speech; this is one very

efficient way to cut off the combinatorial complexity and avoid dealing with

irregular ways of cross-POS translations; as we will show in the section 4.4 this

assumption can be nicely circumvented without too high a price in computational

performance;

• although word order is not an invariant of translation, it is not random either; when

two or more candidate translation pairs are equally scored, the one containing tokens

which are closer in relative position are preferred. This preference is also used in

(Ahrenberg et al. 2000).

Based on sentence alignment, POS tagging and lemmatisation, the first step is to

compute a list of translation equivalence candidates (TECL). By collecting all the tokens

of the same POSk (in the order they appear in the text and removing duplicates) in each



part of TUj (the jth translation unit) one builds the ordered sets LSjPOSk and LTj

POSk. For

each POSi let TUjPOSi be defined as LSj

POSi⊗LTjPOSi, with ‘⊗’ representing the Cartesian

product operator. Then, CTUj (candidates in the jth TU) is defined as follows:

CTUj= . With these notations and considering that there are n

translation units in the whole bitext, TECL is defined as: TECL = .

Υposofno

i

jPOSiTU

..

1=

Υn

j

jCTU1=

TECL contains a lot of noise and many TECs are very improbable so that a filtering is

necessary. Any filtering would eliminate many wrong TECs but also some good ones.

The ratio between the number of good TECs rejected and the number of wrong TECs

rejected is the criterion we used in deciding which test to use and what would be the

threshold score below which any TEC will be removed from TECL. After various

empirical tests we decided to use the loglikelihood test (LL) with the threshold set to 9.

4.2 THE BASELINE ALGORITHM (BASE)

Our baseline is a simple iterative algorithm and has some similarities to the algorithm

presented in (Ahrenberg et al. 2000) but unlike it, our algorithm avoids computing

various probabilities (or, more precisely, probability estimates) and scores (t-score).

Based on the TECL, an initial Sm* Tn contingency table (TBL0) is constructed for each

POS (see Figure 2), with Sm the number of token types in the first part of the bitext and

Tn the number of token types in the other part of the bitext.



Figure 2. Contingency table with counts for TECs at step K

TT1 … TTn

TS1 n11 … n1n n1*

… …

…

…

…

TSm nm1 … nmn nm*

n*1 … n*n n**

The rows of the table are indexed by the distinct source tokens and the columns are

indexed by the distinct target tokens (of the same POS). Each cell (i,j) contains the

number of occurrences in TECL of <TSi, TTj>. All the pairs <TSi, TTj> that at step k

satisfy the equation below (EQ1) are recorded as TEPs and removed from the

contingency table TBLk (the cells (i,j) are zeroed) thus obtaining a new contingency

table TBLk+1.

(EQ1) { })n(n)n(n qp, | TTTP pj ijiq ijTj Sik ≥∧≥∀>∈<= kTBL

Equation (EQ1) expresses the common intuition that in order to select <TSi, TTj> as a

translation equivalence pair, the number of associations of TSi with TTj must be higher

than (or at least equal to) any other TTp (p≠j). The same holds the other way around. One

of the main deficiencies of the BASE algorithm is that it is quite sensitive to what

(Melamed, 2001) calls indirect associations. If <TSi, TTj> has a high association score

and TTj collocates with TTk, it might very well happen that <TSi, TTk> also gets a high

association score. Although, as observed by Melamed, the indirect associations

generally have lower scores than the direct (correct) ones, they could receive higher

scores than many correct pairs and this not only generates wrong translation equivalents,

but also eliminates from further considerations several correct pairs. To weaken this

sensitivity, we had to additionally impose an occurrence threshold for the selected pairs,

so that the equation (EQ1) became:



(EQ2) { })3n()n(n)n(n qp, | TTTP ijpj ijiq ijTj Sik ≥∧≥∧≥∀>∈<= kTBL

This modification significantly improved the precision (more than 98%) but seriously

degraded the recall, more than 75% of correct pairs being missed. The BASE

algorithm’s sensitivity to the indirect associations, and thus the necessity of an

occurrence threshold, is explained by the fact that it looks at the association scores

globally, not checking whether the tokens in a TEC are both in the same TU.

4.3 A BETTER EXTRACTION ALGORITHM (BETA)

To diminish the influence of indirect associations and thus remove the occurrence

threshold, we modified the BASE algorithm so that the maximum score is not

considered globally but within each of the TUs. This brings BETA closer to the

competitive linking algorithm described in (Melamed, 2001). The competing pairs are

only the TECs generated from the current TU out of which the pair with the best LL-

score (computed, as before, from the entire corpus) is the first selected. Based on the 1:1

mapping hypothesis, any TEC containing the tokens in the winning pair is discarded.

Then, the next best scored TEC in the current TU is selected and again the remaining

pairs that include one of the two tokens in the selected pair are discarded. The multiple-

step control in BASE, where each TU was scanned several times (once in each

iteration), is not necessary anymore. The BETA algorithm will see each TU unit only

once but the TU is processed until no further TEPs can be reliably extracted or the TU is

emptied. This modification improves both the precision and the recall as compared to

the BASE algorithm. When two or more TEC pairs of the same TU share the same

token, and they are equally scored, the algorithm has to make a decision and choose only

one of them, in accordance with the 1:1 mapping hypothesis. We used two heuristics:

string similarity scoring and relative distance.



The similarity measure, COGN(TS, TT), is very similar to the XXDICE score described in

(Brew and McKelvie, 1996). If TS is a string of k characters α1α2 . . . αk and TT is a

string of m characters β1β2 . . . βm then we construct two new strings T’S and T’T by

inserting special displacement characters into TS and TT where necessary. The

displacement characters will cause both T’S and T’T to have the same length p (max (k,

m)≤p<k+m) and the maximum number of positional matches. Let δ(αi) be the number of

displacement characters that immediately precede the character αi which matches the

character βi, and δ(βi) be the number of displacement characters that immediately

precede the character βi which matches the character αi. Let q be the number of

matching characters. With these notations, equation EQ3 defines the COGN(TS, TT)

similarity measure as follows:

(EQ3)

⎪⎪⎩

⎪⎪⎨

⎧

≤

>+

−+∑=

2 q if 0

2q if |)()(|1

2

=)T ,COGN(T 1TSmk

q

i ii βδαδ

Using the COGN test as a filtering device is a heuristic based on the cognate conjecture

which says that when the two tokens of a translation pair are orthographically similar,

they are very likely to have similar meanings (i.e. they are cognates). The threshold for

the COGN(TS, TT) test was empirically set to 0.42. This value depends on the pair of

languages in the bitext. The actual implementation of the COGN test includes a

language-dependent normalisation step, which strips some suffixes, discards the

diacritics, reduces some consonant doubling, etc. This normalisation step was hand

written, but, based on available lists of cognates, it could be automatically induced.

The second filtering condition, DIST(TS, TT) considers relative distance between the

tokens in a pair and is defined as follows (where n and m are indexes of TS and TT in the

considered TU):



if ((<TS, TT>∈ LSjposk ⊗LTj

posk)&(TS is the n-th element in LSjposk)&(TT is the m-th

element in LTjposk)) then DIST(TS, TT)=|n-m|

The COGN(TS, TT) test is a more reliable heuristic than DIST(TS, TT), so that the TEC

with the highest similarity score is the preferred one. If the similarity score is irrelevant,

the weaker filter DIST(TS, TT) gives priority to the pairs with the smallest relative

distance between the constituent tokens.

The main use, up to now, of the BETA algorithm was in the European project

BALKANET (Stamou et al. 2002) aimed at building a EuroWordNet-like lexical

ontology. We used this algorithm for automatic acquisition of bilingual Romanian-

English resources and also for consistency checking of the interlingual projection of the

consortium monolingual wordnets. The multilinguality of EuroWordNet and its

BALKANET extension is ensured by linking monolingual synsets to interlingual

records that correspond to the Princeton Wordnet synsets. If two or more monolingual

wordnets are consistently projected over the interlingual index, then translation

equivalents extracted from a parallel corpus should be (ideally) projected over the same

interlingual record, or, (more realistically) onto interlingual records that correspond to

closely related meanings (according to a given metric). For this particular use, POS

identity of the translation equivalents was a definite requirement. However, in general,

imposing POS identity on the translation equivalents is too restrictive for a series of

multilingual applications. On the other hand, in the vast majority of cases, the cross-

lingual variation of the POS for translation equivalents is not arbitrary. This observation

led us to the implementation of TREQ, an improved translation-equivalents extractor,

more general than BASE.



4.4 A FURTHER ENHANCED EXTRACTION ALGORITHM (TREQ)

Besides algorithmic developments to be discussed in this section, TREQ has been

equipped with a graphical user interface which integrates additional functionality for

exploiting parallel corpora (editing the parallel corpora, generating word alignment

maps, multi-word term extraction, building multi-lingual and multi-word terminological

glossaries, etc.).

In section 4.1 we described four simplifying assumptions used in the implementation of

the translation-equivalents extraction procedures. The implementation of TREQ

dispenses with two of them, namely the assumption that the translation equivalence

preserves the POS and the assumption that repeated tokens in a sentence have the same

meaning.

4.4.1 Meta-categories

As noted before, when translation equivalents have different parts of speech this

alternation is not arbitrary and it can be generalized. TREQ allows the user to define for

each language pair the possible POS alternations. A set of grammar categories in one

language that could be mapped by the translation-equivalence relation over one or more

categories in the other language is called a meta-category. The user defines for each

language the meta-categories and then specifies their interlingual correspondence. For

instance, English participles and gerunds are often translated with Romanian nouns or

adjectives and vice versa. So, for this pair of languages we defined, in both languages,

the meta-category MC1 subsuming common nouns, adjectives and (impersonal) verbs,

and stipulated that if the source lexical token belongs to the MC1, than its translation

equivalent should belong to the same meta-category. Another example of a meta-

category we found useful, MC2, subsumes the following pronominal adjectives:

demonstrative, indefinite and negative. These types of adjectives are used differently in

the two languages (e.g. a negative adjective allowed in Romanian by the negative



concord phenomenon has as equivalent an indefinite or even demonstrative adjective in

English). For uniformity, any category not explicitly included in a user-defined meta-

category is considered the single subsumed category of a meta-category automatically

generated. The cross-lingual mapping of these meta-categories is equivalent to the POS

identity. For instance, the abbreviations, which in our multilingual corpora are labeled

with the tag X, are subsumed in this way by the MC30 meta-category. In order not to

lose information from the tagged parallel corpora, TREQ adds the meta-category

(actually a number) as a prefix to the actual tag of each token. The search space (TECL)

is computed as described in section 4.1, the only modification being that instead of POS

the meta-category prefix is used. Figure 3 shows the English and Romanian segments

from Figure 2 with the meta-category prefix added to the token tags. <tu id="Ozz.42"> <seg lang="en"> <s id="Oen.1.1.10.2"> <tok lemma="there" ana="22+Pt3">There</tok> <tok lemma="be" ana="1+Vmis-p">were</tok> <tok lemma="no" ana="2+Dg">no</tok> <tok lemma="window" ana="1+Ncnp">windows</tok> <tok lemma="in" ana="5+Sp">in</tok> <tok lemma="it" ana="13+Pp3ns">it</tok> <tok lemma="at_all" ana="14+Rmp">at all</tok> <c>.</c> </s> </seg> <seg lang="ro"> <s id="Oro.1.2.10.2"> <tok lemma="nu" ana="7+Qz">Nu</tok> <tok lemma="avea" ana="1+Vmii3s">avea</tok> <tok lemma="deloc" ana="14+Rgp">deloc</tok> <tok lemma="fereastră" ana="1+Ncfp-n">ferestre</tok> <c>.</c> </s> </seg> </tu>

Figure 3: Corpus encoding using meta-categories for the POS tagging



As the TECL becomes much larger with the introduction of meta-categories, the

memory-based book-keeping mechanisms were optimized to release unnecessarily

occupied memory and take advantage, in case of large parallel corpora, of virtual

memory (disk resident).

Besides accounting for real POS alternations in translation, the meta-category has the

advantage that it overcomes some tagging errors which could also result in POS

alternations. But probably the most important advantage of the meta-category

mechanism is the possibility of working with very different tagsets. In (Tufiş et al. 2003)

we describe a system (based on TREQ) participating in a shared task on Romanian-

English word-alignment. The English parts of the training and evaluation data were

tagged using the Penn TreeBank tagset while the Romanian parts were tagged using the

MULTEXT-EAST tagset. Using meta-categories was a very convenient way of coping

with the different encodings and granularities of the two tagsets.

Finally, we should observe that the algorithm by no means requires that the meta-

categories with the same cross-lingual identifier subsume the same grammatical

categories in the two languages; and also, that defining a meta-category that subsumes

all the categories in the languages considered is equivalent to completely ignoring the

POS information (thus tagging becomes unnecessary).

4.4.2 Repeated tokens

The second simplifying hypothesis which was dropped in the TREQ implementation

was to assume that the same token (with the same POS tag), used several times in the

same sentence, has the same meaning. Based on this assumption, in the previous

versions, only one occurrence of the token was preserved. As this hypothesis didn’t save

significant computational resources we decided to keep all the repeated tokens. This

modification slightly improved the precision of the algorithm allowing the extraction of

translation pairs that appeared only in one translation unit, but several times. Also, when



the tokens repeated in one language were translated differently (by synonyms) in the

other language, not purging the duplicates allowed extraction of translation pairs

(synonymic) which otherwise were lost.

4.4.3 Other improvements

We evaluated the cognate conjecture for Romanian-English pair of languages and found

it to be correct in more than 98% of cases when the similarity threshold was 0.68. We

also noted that many candidates, rejected either because of low loglikelihood score or

because they occurred only once, were cognates. Therefore, we modified the algorithm

to include also in the list of extracted translation equivalents all the candidates which, in

spite of failing the loglikelihood test, have a cognate score above the 0.68 threshold.

This change improved both precision and recall (see next section).

4.4.4 The Graphical User Interface

The Graphical User Interface has been developed mainly for the purpose of validation

and correction (in context) of the translation equivalents, a task committed to linguists

without (too much) computer training. Besides the lexical translation equivalents

extraction, the Graphical User Interface incorporates several other useful corpus

management and mining utilities:

a) selecting a corpus from a collection of several corpora;

b) editing and correcting the tokenization, tagging or lemmatization;

c) updating accordingly the extracted lexicons;

d) extracting compound-term translations in one language based on an inventory of

compound terms in the other language;

e) extracting multi-word collocations (monolingually) for updating the segmenter’s

resources for the languages concerned.

Figure 4 exemplifies parameters setting for the extraction process: the parallel corpus,

the language pairs, the statistic method used for independence-hypothesis testing, the



test threshold, the type of alignment (either by POS or by meta-categories), and, in the

case of POS alignment, which grammatical categories are of interest for the extracted

lexicon.

Figure 4: Parameters setting for a GUI-TREQ translation extraction session

Figure 5 displays the results of the extraction process. By displaying the running texts as

pairs of aligned sentences in two languages, the Graphical User Interface facilitates

evaluation in context of the extracted translation equivalents. If you point to a word in

either language, its translation equivalent in the other language is displayed.



Figure 5: The “1984” corpus: Romanian-English translation equivalents extracted from the OZZ.113 translation unit

A detailed presentation of the facilities and operation procedures are given in the TREQ

user manual (Mititelu, 2003).

5 EXPERIMENTS AND EVALUATION

We conducted translation equivalents extraction experiments on the three corpora

mentioned before (“1984”, “VAT” and “NAACL2003”) and for various pairs of

languages.



The bilingual lexicons extracted from the integral bitexts for English-Estonian, English-

Hungarian, English-Romanian and English-Slovene were evaluated by native speakers

of the languages paired to English and having a good command of English. The

evaluation protocol specified that all the translation pairs are to be judged in context, so

that if one pair is found to be correct in at least one context, then it should be judged as

correct. The evaluation was done for both the BASE and BETA algorithms but on

different scales. The BASE algorithm was run on all the 6 integral bitexts with the

English hub and 4 of out of the 6 bilingual lexicons were hand-validated. The lexicons

contained all parts of speech defined in the MULTEXT-EAST lexicon specifications

except for interjections, particles and residuals. The BETA and TREQ algorithms were

run on the Romanian-English partial bitext extracted from the “1984” 7-language

parallel corpus and we validated only the noun pairs. We also re-ran the BASE

algorithm, for comparison reasons, on the Romanian-English partial bitext.

The translation equivalents extracted from the “VAT” corpus by means of TREQ were

not explicitly evaluated, but were used in a multilingual term-extraction experiment for

the purposes of the FF-POIROT European project. The preliminary comparative

evaluation conducted by native speakers of French and Dutch, with excellent command

of English, showed that both precision (approx. 80%) and recall (approx. 75%) of our

results are significantly better than those of other extractors used in the comparison.

Since we don’t yet have the details of this evaluation, we will not go into further details.

The bilingual lexicon extracted from the “NAACL2003” corpus by TREQ has been

evaluated based on the test data used by the organisers of the HLT-NAACL2003 Shared

Task on word-alignment. The test text has been manually aligned at word level. This

valuable data and the program that computes precision, recall and F-measure of any

alignment against a gold standard have been graciously made public after the closing of

the shared task competition. From the word-aligned bitext used for evaluation we



removed the null alignments (words not translated in either part of the bitext) and

purged the duplicate translation pairs, and thus obtained the gold standard Romanian-

English lexicon. The evaluation considered all the words. The tables below give an

overview of the corpora and the gold standard alignment text we used for the evaluation

of the translation-equivalents extractors.

Language BU CZ EN ET HU RO SI

No. of tokens* 72020 66909 87232 66058 68195 85569 76177

No. of word forms* 15093 17659 9192 16811 19250 14023 16402

No. of lemmas* 8225 8677 6871 8403 9729 6987 7157

Figure 6. The "1984" corpus overview * the counts refer only to 1:1 aligned sentences and do not include interjections, particles and

residuals

Language EN FR NL

No. of occurrences 41722 45458 40594

No. of word forms* 3473 3961 3976

No. of lemmas* 2641 2755 3165

Figure 7. The "VAT" corpus overview

Language EN RO Language EN RO

No. of tokens 866036 770653 No. of tokens 4940 4563

No. of word forms* 27598 48707 No. of word forms* 1517 1787

No. of lemmas* 19139 23134 No. of lemmas* 1289 1370

Figure 8. Overview of the "NAACL2003" corpus (left) and the word-aligned bitext (right)

5.1 THE EVALUATION OF THE BASE ALGORITHM

For validation purposes we limited the number of iteration steps to 4. The extracted

lexicons contain adjectives (A), conjunctions (C), determiners (D), numerals (M), nouns



(N), pronouns (P), adverbs (R), prepositions (S) and verbs (V). Figure 9 shows the

evaluation results provided by human evaluators2. The precision (Prec) was computed

as the number of correct TEPs divided by the total number of extracted TEPs. The recall

(considered for the non-English language in the bitext) was computed two ways: the first

one, Rec*, took into account only the tokens processed by the algorithm (those that

appeared at least three times). The second one, Rec, took into account all the tokens

irrespective of their frequency counts. Rec* is defined as the number of source lemma

types in the correct TEPs divided by the number of lemma types in the source language

with at least 3 occurrences. Rec is defined as the number of source lemma types in the

correct TEPs divided by the number of lemma types in the source language. The F-

measure is defined as 2*Prec*Rec/(Prec+Rec) and we consider it to be the most

informative score.

The rationale for showing Rec* is to estimate the proportion of the missed tokens out of

the considered ones. This might be of interest when precision is of the utmost

importance.

Bitext 4 Steps ET-EN HU-EN RO-EN SI-EN

Entries 1911 1935 2227 1646

Prec/Rec/ F-measure

96.18/18.79/31.16

96.89/19.27/ 32.14

98.38/25.21/ 40.13

98.66/22.69/ 36.89

Rec* 57.86 56.92 58.75 57.92 Figure 9. “1984” integral bitexts; partial evaluation of the BASE algorithm after 4 iteration

steps with the occurrence threshold set to 3

The lexicons evaluation was fully performed for Estonian, Hungarian and Romanian and

partially for Slovene (the first step was fully evaluated while the rest were evaluated

from randomly selected pairs). As one can see in Figure 9, the precision is higher than

98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for

Estonian. The Rec* measure ranges from 50.92% (Slovene) to 63.90% (Estonian). The



standard recall Rec varies between 19.27% and 32.46% (quite modest, since on average,

the BASE algorithm did not consider 60% of the lemmas). Due to the low Rec value,

the composite F-measure is also low (ranging between 31.16% and 41.13%) in spite of

the very good precision. Our analysis showed that depending on the part of speech the

extracted entries have different accuracy. The noun extraction had the second worst

accuracy (the worst was the adverb), and therefore we considered that an in-depth

evaluation of this case would be more informative than a global evaluation. Moreover,

to facilitate the comparison between the BASE and BETA algorithms, we set no limit

for the number of steps, lowered the occurrence threshold to 2 and extracted only the

noun pairs from the partial Romanian-English bitext included into the “1984” 7-

language parallel corpus. The BASE program stopped after 10 steps with a number of

1673 extracted noun translation pairs, out of which 112 were wrong (see Figure 10).

Compared with the 4 steps run the precision decreased to 93.30%, but both Rec

(36.45%) and F-measure significantly increased showing that the occurrence threshold

set to 2 leads to a better Precision/Recall compromise than 3.

Noun types in text Entries Correct entries

Noun types in correct entries

Prec/Rec/F-measure

3116 1673 1561 1136 93.30/36.45/52.42

Figure 10. “1984” corpus; evaluation of the BASE algorithm with the noun lexicon extracted from the Romanian-English partial bitext; 10 iteration steps, the occurrence threshold set to 2

If the occurrence threshold is removed, because of the indirect association sensitivity the

precision of BASE degrades too much for the lexicon to be really useful.

5.2 THE EVALUATION OF THE BETA ALGORITHM

The BETA algorithm preserves the simplicity of the BASE algorithm but significantly

improves its global performance (F-measure) due to a much better recall (Rec) obtained

at the expense of some loss in precision (Prec). Keeping the occurrence threshold set at



two (that is, ignoring hapax-legomena translation-equivalence candidates) the results of

BETA evaluation on the same data are shown in Figure 11 below:



Prec/Rec/F-measure

3116 2291 2183 1735 95.28/55.68/70.28

Figure 11. “1984” corpus; partial evaluation of the BETA algorithm (noun lexicon extracted from the partial Romanian-English bitext), the occurrence threshold set to 2

Moreover, indirect association sensitivity is very much reduced so that removing the

occurrence threshold shows even better global results:



Prec/Rec/F-measure

3116 3128 2516 2114 80.43/67.84/73.60

Figure 12. “1984” corpus; partial evaluation of the BETA algorithm (noun lexicon extracted from the partial Romanian-English bitext), no occurrence threshold

Besides the occurrence threshold, the BETA algorithm offers another way to trade off

Prec for Rec: the COGN similarity score. In the experiments evaluated in Figure 12, the

threshold was set to 0.42.

We should mention that in spite of the general practice in computing recall for bilingual

lexicon-extraction tasks (be it Rec*, or Rec), this is only an approximation of the real

recall. The reason for this approximation is that in order to compute the real recall one

should have a gold standard with all the words aligned by human evaluators. Usually

such a gold standard bitext is not available and the recall is either approximated as

above, or is evaluated on a small sample and the result is taken to be more or less true

for the whole bitext.



5.3 THE EVALUATION OF THE TREQ ALGORITHM

To facilitate comparison with the BASE and BETA algorithms we ran TREQ on the

same data and used the same evaluation procedure for the extracted noun-translation

pairs. The results are shown in Figure 13 and (as expected) they are superior to those

provided by BETA.



Prec/Rec/F-measure

3116 3001 2525 2084 84.14/66.88/74.52

Figure 13. “1984” corpus; evaluation of the TREQ algorithm (noun lexicon extracted from the Romanian-English partial bitext)

All the previous evaluations were based on an approximation of the recall measure,

motivated by the lack of a gold standard lexicon. As mentioned before, for the purpose

of the shared task on word alignment at NAACL2003 workshop, the organisers created a

short hand-aligned Romanian-English bitext (248 sentences) which was made public

after the competition. We used this word-alignment data to extract a Gold Standard

Romanian-English Lexicon allowing a precise evaluation of the recall. The complete set

of links in the word-aligned bitext contains 7149 links. Each token in either language is

bi-directionally linked to a token representing its translation in the other language or to

the empty string if it was not translated. Removing the empty links we were left with

6195 links representing pairs of translation equivalents: <RO-word EN-word>. Deleting

the links for punctuation, purging the links corresponding to identical lexical pairs and

eliminating the pairs not preserving the meta-category3 we obtained a Gold Standard

Lexicon containing 1706 entries. Out of these entries 1547 are POS-preserving

translation pairs, the rest being legitimate alternations. The Gold Standard Lexicon

includes all the grammatical categories defined in the revised MULTEXT-EAST

specifications for lexicon encoding (Erjavec, 2001). Figure 14 shows the exact

evaluation of the TREQ performances.



No. of entries in

the Gold Standard

No. of entries

extracted

No. of correct

entries

Prec/Rec/F-measure

1706 1308 1041 79.58/61.01/69.06 Figure 14. “NAACL2003” word-aligned bitext; exact evaluation of the TREQ algorithm

The scores of the exact evaluation are significantly lower than expected, compared to the

approximate evaluation procedure used before on the 1984 corpus. Given the scarcity of

statistical evidence data (the NAACL evaluation bitext is almost 20 times smaller than

the bitext extracted from the “1984” corpus) the performance decrease is not surprising.

On the other hand, the exact calculation of the recall shows that considering only lemma

types in one part of the bitext and of the lexicon (as the approximate recall calculation

does) is slightly over-estimating the real recall by ignoring the multiple senses a lemma

might have. If we compute the recall as we did before, it will show an increase of more

than 2% (63.08%) and thus a better F-measure (70.67%).

We mentioned at the beginning of the paper that by adding a post-processing phase to

the basic translation-equivalence extraction procedure, one may further improve the

accuracy and coverage of the extracted lexicons. In the next section we give an overview

of such a post-processing phase, and show how the performance of the translation-

equivalence extraction was improved.

5.3.1 TREQ-AL and word-alignment maps

In (Tufiş et al. 2003) we described our TREQ-AL system which participated in the

Shared Task proposed by the organizers of the workshop on “Building and Using

Parallel Texts: Data Driven Machine Translation and Beyond” at the HLT-NAACL

2003 conference (http://www.cs.unt.edu/~rada/wpt). TREQ-AL builds on TREQ and

generates a word-alignment map for a parallel text (a bitext). The word alignment as it



was defined in the shared task is different and harder than the problem of translation

equivalence as previously addressed. In a lexicon extraction task one translation pair is

considered correct if there is at least one context in which it has been correctly observed.

A multiply-occurring pair would count only once for the final lexicon. This is in sharp

contrast with the alignment task where each occurrence of the same pair counts equally.

The word-alignment task requires that each word (irrespective of its POS) and

punctuation mark in both parts of the bitext be paired to a translation in the other part (or

the null translation if the case). Such a pair is called a link. In a non-null link both

elements of the link are non-empty words from the bi-text. If either the source word or

the target word is not translated in the other language this is represented by a null link.

Finally, the evaluations of the two tasks, even if both use the same measures as precision

or recall, have to be differently judged. The null links in a lexicon extraction task have

no significance, while in a word alignment task they play an important role (in the

Romanian-English gold standard data the null links represent 13.35% of the total

number of links). Being built on TREQ, any improvement in the precision and recall of

the extracted lexicons will have a crucial impact on the precision and recall of the

alignment links produces by TREQ-AL. This is also true the other way around: as

described in (Tufiş et al., 2003), several wrong translation pairs extracted by TREQ are

disregarded by TREQ-AL and moreover, many translation pairs unfound by TREQ are

generated by the alignment of TREQ-AL. This is clearly shown by the scores in Figure

15 as compared to those in Figure 13.



Prec/Rec/F-measure

3116 3724 3263 2450 87.62/75.08/80.87

Figure 15. “1984” corpus; evaluation of the TREQ-AL algorithm (noun lexicon extracted from the Romanian-English partial bitext)



The first three columns in Figure 16 give the initial evaluation of TREQ-AL on the

shared-task data.

Non-null links only Null links included TREQ-AL lexicon

Precision 81.38% 60.43% 84.42%

Recall 60.71% 62.80% 77.72%

F-measure 69.54% 61.59% 80.93%

Figure 16. Evaluation of TREQ-AL in the “NAACL2003” shared task on word alignment

The error analysis pinpointed some minor programming errors and we were able to fix

them in a short period of time. We also decided to see how an external resource, namely

a bilingual seed lexicon, would improve the performances of TREQ and TREQ-AL. We

used our Romanian WordNet, under development, as a source for a seed bilingual

lexicon. The Romanian WordNet contains 11,000 verb and noun synsets which are

linked to the Princeton Wordnet. From one Romanian synset SRO, containing M literals,

and its equivalent synset in English SEN, containing N literals, we generated M*N

translation pairs, thus producing a bilingual seed lexicon containing about 40,000

entries. This lexicon contains some noise since not all M*N translation pairs obtained

from two linked synsets are expected to be real translation-equivalence pairs4.

In Figure 17 we give the new evaluation results (using the official programs and

evaluation data) of the new versions of TREQ and TREQ-AL.

Non-null links only Null links included TREQ-AL lexicon TREQ lexicon

Precision 84.43% 65.58% 86.68% 79.58

Recall 64.34% 66.08% 81.96% 61.01

F-measure 73.03% 65.83% 84.26% 69.06

Figure 17. Re-evaluation of TREQ-AL and TREQ on the “NAACL2003” shared task without a seed lexicon



As shown in Figure 17, TREQ-AL dramatically improves the performance of TREQ:

precision increased with more than 7% while the recall of TREQ-AL is more than 20%

better when compared to the recall of TREQ.

The evaluation of TREQ-AL when TREQ started with a seed lexicon showed no

improvement of the final extracted dictionary. However, the results for the word-

alignment shared-task improved (apparently the frequency of what was found versus

what was lost made the difference, which is anyway not statistically significant).

Non-null links only Null links included TREQ-AL lexicon

Precision 84.72% 66.07% 86.56%

Recall 64.73% 66.43% 81.85%

F-measure 73.39% 66.25% 84.13%

Figure 18. Re-evaluation of TREQ-AL and TREQ on the “NAACL2003” shared task with an initial bilingual dictionary

Figures 19a and 19b show the performance of all participating teams on the Romanian-

English word alignment shared task. There were two distinct evaluations: the NON-

NULL-alignments only considered the links that represented non-null translations while

the NULL-alignments took into account both the non-null and the null translations.

RACAI.RE.2 is the evaluation of TREQ-AL with an initial seed lexicon and

RACAI.RE.1 is the evaluation of TREQ-AL without an initial seed lexicon. The

systems were evaluated in terms of 3 figures of merit: Fs-measure, Fp-measure, and

AER=Alignment Error Rate. Since the Romanian Gold Standard contains only sure

alignments AER reduces to 1 - Fpmeasure. For all systems that assigned only sure

alignments Fpmeasure = Fsmeasure (see Mihalcea & Pedersen (2003) for further

details).



0,00%10,00%20,00%30,00%40,00%50,00%60,00%70,00%80,00%

RACAI.RE.2

RACAI.RE.1

XRCE.Nole

m-56K.R

E.2

Proalig

n.RE.1

XRCE.Trilex.R

E.3

XRCE.Trilex.R

E.4

XRCE.Base.R

E.1

Ralign

.RE.1

BiBr.RE.1

BiBr.RE.3

BiBr.RE.2

UMD.RE.2

UMD.RE.1

Fourda

y.RE.1

F-measure Sure F-measure Probable AER

Figure 19a. NAACL2003 Shared Task: ranked results of Romanian-English non-NULL alignments

0,00%10,00%20,00%30,00%40,00%50,00%60,00%70,00%

RACAI.RE.2

RACAI.RE.1

XRCE.Trilex.R

E.3

Proalig

n.RE.1

XRCE.Trilex.R

E.4

XRCE.Base.R

E.1

XRCE.Nole

m-56K.R

E.2

Ralign

.RE.1

BiBr.RE.1

BiBr.RE.3

BiBr.RE.2

UMD.RE.2

UMD.RE.1

Fourda

y.RE.1

F-measure Sure F-measure Probable AER

Figure 19b. NAACL2003 Shared Task: ranked results of Romanian-English NULL alignments



6 IMPLEMENTATION, CONCLUSIONS AND FURTHER WORK

The extraction programs, BASE, BETA and TREQ, as well as TREQ-AL, run on both

Windows and Unix machines5. Throughput is very fast: on a Pentium 4 (1.7 GHz) with

512 MB of RAM, extracting the noun bilingual lexicon from “1984” took 109 seconds

(72 s. for TREQ plus 37 s. for TREQ-AL) while the full dictionary was generated in 285

seconds (204 s. for TREQ plus 81 s for TREQ-AL). These figures are comparable to

those reported in (Tufiş and Barbu, 2002) for BETA although the machine on which

those evaluations were conducted was a less powerful Pentium II (233 MHz) processor

accessing 96MB of RAM.

An approach quite similar to our BASE algorithm (also implemented in Perl) is

presented in (Ahrenberg et al, 2000). They used a frequency threshold of 3 and the best

results reported are 92.5% precision and 54.6% partial recall (what we called Rec*). The

BETA and TREQ algorithms exploit the idea of competitive linking underling

Melamed’s extractor (Melamed, 2001), although our program never returns to a visited

translation unit. Melamed’s evaluation is made in terms of accuracy and coverage,

where accuracy is more or less our precision and coverage is defined as percentage of

tokens in the corpus for which a translation has been found. With the best 90%

coverage, the accuracy of his lexicon was 92.8±1.1%. Coverage is a much weaker

evaluation function than recall, especially for large corpora, since it favours frequent

tokens to the detriment of hapax legomena. Melamed (2001) showed that the 4.5% most

frequent translation pair types in the Hansard parallel corpus cover more than 61% of the

tokens in a random sample of 800 sentences. Moreover, the approximation used by

Melamed in computing coverage over-estimates, since it does not consider whether the

translations found for the words in the corpus are correct or not. Based on the Gold

Standard Lexicon, we could compute exact precision, recall, coverage and also the

approximated coverage (Melamed’s way). As Figure 20 shows, in spite of a very small



text, there are significant differences between exact coverage and the approximated

coverage. The differences are much more significant in case of a larger text.

Exact Coverage Estimated Coverage Precision

Romanian 91.91% 96.92%

English 91.98% 97.21% 86.56%

Figure 20: Exact and Estimated Coverage for the lexicon extracted by TREQ-AL from the NACL2003 Gold Standard Alignment

We ran TREQ-AL (without the seed lexicon mentioned before) on the entire

NAACL2003 corpus, extracting a 48287-entry lexicon. Following Melamed’s (2001)

procedure, we took five random samples (with replacement) of 100 entries and validated

them by hand. The average resulting precision was 91.67% with an estimated coverage

of 95.21% for Romanian and 96.56% for English. However, as demonstrated in Figure

20, without a gold standard, such estimated evaluations should be regarded cautiously.

All algorithms we presented are based on a 1:1 mapping hypothesis. We argued that in

case a language-specific tokenizer is responsible for pre-processing the input to the

extractor, the 1:1 mapping approach is not an important limitation anymore.

Incompleteness of the segmenter’s resources may be accounted for by using a post-

processing phase for recovering the partial translations. In (Tufiş, 2001) such a

recovering phase is presented that takes advantage of the already extracted entries.

Additional means, such as collocation extraction based on n-gram statistics and partial

grammar filtering (as included in the GUI-TREQ), are effective ways of continuously

improving the segmenter’s resources and decrease to a large extent the restrictions

imposed by the 1:1 mapping hypothesis.

Finally, we should notice that although TREQ is quite mature, TREQ-AL is under

further development and we have confidence that there is ample room for future

performance improvements.



Acknowledgements

The research on translation equivalents started as an AUPELF/UREF co-operation

project with LIMSI/CNRS (CADBFR) and used multilingual corpus and multilingual

lexical resources developed within MULTEXT-EAST, TELRI and CONCEDE EU

projects. The continuous improvements of the methods and tools described in this paper

were motivated and supported by two European projects we are currently involved in:

FF-POIROT (IST-2001-38248) and BALKANET (IST-2000-29388).

We are gratefull to the editor of this issue and to an annonymous reviewer who did a

great job in improving the content and the readability of this paper. All the remaining

problems are entirely ours.

References

Ahrenberg, L., Andersson, M., Merkel, M. (2000). "A knowledge-lite approach to word

alignment", in Véronis, J. (ed). Parallel Text Processing. Text, Speech and Language

Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97-116.

Brants, T.(2000) “TnT – A Statistical Part-of-Speech Tagger”, in Proceedings ANLP-2000,

April 29 – May 3, Seattle, WA.

Brew, C., McKelvie, D. (1996) “Word-pair extraction for lexicography.” Available at

http:///tokww.ltg.ed.ac.uk/ ~chrisbr/papers/nemplap96.

Brown, P., Pietra, Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L. (1993), "The

mathematics of statistical machine translation: parameter estimation" in Computational

Linguistics, 19/2, pp. 263-311.

Dunning, T. (1993), “Accurate Methods for the Statistics of Surprise and Coincidence” in

Computational Linguistics, 19/1, pp. 61-74.



Dimitrova, L, Erjavec, T., Ide, N., Kaalep, H., Petkevic, V. and Tufiş, D. (1998) "Multext-East:

Parallel and Comparable Corpora and Lexicons for Six Central and East European

Languages" in Proceedings ACL-COLING’1998, Montreal, Canada, pp. 315-319.

Gale, W.A. and Church, K.W. (1991). "Identifying word correspondences in parallel texts". In

Fourth DARPA Workshop on Speech and Natural Language, pp. 152-157.

Gale, W.A. and Church, K.W. (1993). “A Program for Aligning Sentences in Bilingual

Corpora”. In Computational Linguistics, 19/1, pp. 75-102.

Erjavec, T. (ed.) (2001). “Specifications and Notations for MULTEXT-East Lexicon Encoding”.

Edition Multext-East/Concede Edition, March, 21, 210 pages, available at

[http://nl.ijs.si/ME/V2/msd/html/].

Erjavec, T., Lawson A., Romary L. (1998). East Meet West: A Compendium of Multilingual

Resources.TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6.

Erjavec T., Ide, N. (1998) “The Multext-East corpus”. In Proceedings LREC’1998, Granada,

Spain, pp. 971-974.

Hiemstra, D. (1997). "Deriving a bilingual lexicon for cross language information retrieval". In

Proceedings of Gronics, pp. 21-26.

Ide, N., Veronis, J. (1995). “Corpus Encoding Standard”, MULTEXT/EAGLES Report.

Available at http//tokww.lpl.univ-aix.fr/projects/multext/CES/CES1.html.

Kay, M., Röscheisen, M. (1993). “Text-Translation Alignment”. In Computational Linguistics,

19/1, pp. 121:142.

Kupiec, J. (1993). "An algorithm for finding noun phrase correspondences in bilingual corpora".

In Proceedings of the 31st Annual Meeting of the Association of Computational

Linguistics, pp. 17:22.

Melamed, D. (2001). Empirical Methods for Exploiting Parallel Texts. The MIT Press,

Cambridge Massachusetts, London England, 195 pages.



Mihalcea R., Pedersen T.(2003) “An Evaluation Exercisefor Word Alignment”. Proceedings of

the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven

Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1-10.

Mititelu, C. (2003) –TREQ User Manual, Technical Report, RACAI, May, 25 pages.

Smadja, F., McKeown, K.R. and Hatzivassiloglou, V. (1996). "Translating collocations for

bilingual lexicons: A statistical approach". Computational Linguistics, 22/1, pp. 1:38.

Stamou, S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov

G., Dutoit D., Grigoriadou M. (2002) “BALKANET A Multilingual Semantic Network

for the Balkan Languages”, in Proceedings of the International Wordnet Conference,

Mysore, India, 21-25 January.

Tufiş, D., Barbu, A.M., Pătraşcu, V., Rotariu, G., Popescu, C.(1997) ”Corpora and Corpus-

Based Morpho-Lexical Processing “ in D. Tufiş, P. Andersen (eds.) “Recent Advances in

Romanian Language Technology”, Editura Academiei, pp. 35-56.

Tufiş, D., Ide, N. Erjavec, T. (1998). “Standardized Specifications, Development and

Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European

Languages”. Proceedings LREC’1998, Granada, Spain, pp. 233-240.

Tufiş, D. (2000) “Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a

Tagset for Probabilistic Tagging. Proceedings LREC’2000, Athens, pp. 1105-1112.

Tufiş, D. (2001). “Partial translations recovery in a 1:1 word alignment approach”, RACAI

Technical report, 2001(in Romanian), 18 pages.

Tufiş, D. (2002) ”A cheap and fast way to build useful translation lexicons” in Proceedings of

the 19th International Conference on Computational Linguistics, COLING2002, Taipei,

25-30 August, pp. 1030-1036.

Tufiş, D. Barbu, A.M. (2002): „Revealing translators knowledge: statistical methods in

constructing practical translation lexicons for language and speech processing”, in

International Journal of Speech Technology. Kluwer Academic Publs, no.5, pp. 199-209.



Tufiş, D., Barbu, A.M., Ion, R. (2003) “TREQ-AL: A word alignment system with limited

language resources”, Proceedings of the HLT-NAACL 2003 Workshop on Building and

Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June,

Edmonton, Canada, pp. 36-39.



NOTES 1 MtSeg has tokenization resources for many Western European languages, further enhanced in

the MULTEXT-EAST project (Erjavec and Ide, 1998; Dimitrova et al., 1998; Tufiş et al., 1998)

with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and

Slovene.

2 The lexicons were evaluated by Heiki Kaalep of the Tartu University (ET-EN), Tamas Várády

of the Linguistic Institute of the Hungarian Academy (HU-EN), Ana Maria-Barbu of RACAI

(RO-EN) and Tomaž Erjavec of the IJS Lubljana. All of them are gratefully acknowledged.

3 This was necessary because the way Gold Standard Alignment dealt with compounds: an

expression in Romanian having N words, aligned to its equivalent expression in English

containing M words, was represented by N*M word links. We considered in this case only one

lexicon entry instead of N*M.

4 The existing errors in our synsets definition might be the simplest explanation.

5 The programs are written in Perl and we tested them on Unix, Linux and Windows. The

graphical user interface of TREQ combines technologies like DHTML, XML, and XSL with the

languages HTML, JavaScript, Perl, and PerlScript.


Extracting multilingual lexicons from parallel corpora · PDF fileTUFIŞ, BARBU, ION:...

Documents

Transcript of Extracting multilingual lexicons from parallel corpora · PDF fileTUFIŞ, BARBU, ION:...