Bilingual terminology mining
-
Upload
estelle-delpech -
Category
Technology
-
view
170 -
download
0
description
Transcript of Bilingual terminology mining
![Page 1: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/1.jpg)
1
Bilingual Terminology Mining
Estelle Delpech30th November, 2010
4th intensive summer school on Natural Language Processing
![Page 2: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/2.jpg)
2
About me
● Estelle Delpech
● Research engineer at Lingua et Machina, France
● CAT tools provider● ed(at)lingua-et-machina(dot)com● www.lingua-et-machina.com
● Ph. Candidate at LINA, France● taln team : specialises in NLP● estelle.delpech(at)univ-nantes(dot)fr
![Page 3: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/3.jpg)
3
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
![Page 4: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/4.jpg)
4
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
![Page 5: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/5.jpg)
5
What is a term ?
● Classical definition : ● “unequivocal expression of a concept
within a technical domain“● Traces back to 1930 Eugene Wüster
« General Theory of Terminology »● Specialized language is / should be
unambiguous
concept
term referent
Ogden semiotic triangle
![Page 6: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/6.jpg)
6
What is a term ?
“Classical terminology challenged in the 1990's by :
● sociolinguistics● corpus-based linguistics● computational terminology
● Observe terms in texts : ● there is variation, polysemy ● concepts evolve overtime● no clear-cut border between
specialized and general languages
![Page 7: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/7.jpg)
7
What is a term ?
● Definition of « term » depends on the application / audience of the terminology
● Domain expert :● Unit of knowledge
● Information retrieval : ● Descriptors for indexation
● Translation ● word or phrase that :
● is not part of general language ● Translates differently in a particular
domain● can be :
● Noun, adjective, verb● Noun phrase, verb phrase, etc.
![Page 8: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/8.jpg)
8
What is a terminology ?
● Set of terms + terminological records● Terminological record :
● Part-speech ● Frequency● Variants● contexts
● Relations between terms / concepts● Hypernoymy : cat is a sort of animal● Meronymy : head is part of body
● Bilingual terminology :● Translation relations
![Page 9: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/9.jpg)
9
http://www.termiumplus.gc.ca/
![Page 10: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/10.jpg)
10
Were do you find terms ?
● In specialized texts :● Research papers on breast cancer● Planes crashes reports
● Corpora building : ● important to gather texts following
a well-defined domain / thematic
![Page 11: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/11.jpg)
11
term extractiondata mining
Bilingual terminology mining (1)
bilingualterminology
Specialized texts
termsterms
term alignment
terminologymanagement
software
![Page 12: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/12.jpg)
12
synchronizedterm extractionand alignment
Bilingual terminology mining (2)
bilingualterminology
Specialized texts
terms terms
terminologymanagement
software
![Page 13: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/13.jpg)
13
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
![Page 14: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/14.jpg)
Term extraction : semi-supervised process
● The notion of term is « slippery »● The same lexical unit may or may not be
considered as a term depending on : ● Audience● Domain● Application
● Term extractors extract candidate terms● Frequent in texts of a given domain
● HER2 gene● Look like terms : well-formed phrase
● human cell lines● Group of words that frequently occur
together ● to compile a programL'Homme, 2004
![Page 15: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/15.jpg)
Term extraction : semi-supervised, lexico-semantic process
specialized texts
term extractor
manual selection
candidate terms candidate terms
terms
automaticindexing
terms
terminology
texts
terms
concepts
![Page 16: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/16.jpg)
Termhood clues (1) : Frequency
● Term occurs frequently in specialized texts● the higher, the better ?
● Comparison with general language :● Does the term occur more frequently
than expected in general language ?
● Compute significance tests : ● ex : ² chi-square
L'Homme, 2004
![Page 17: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/17.jpg)
17
Termhood clues (2) : form
● A term is a well-formed phrase● ...HER2/neu oncogenes are members of...
● Match morpho-syntactic patterns ● Ex: NOUN + NOUN
● Many : ● NOUN PREP DET NOUN● alternation of the gene
● NOUN PREP NOUN COORD ADJ NOUN● susceptibility to breast and ovarian cancer
● NOUN NOUN NOUN NOUN NOUN● human breast cancer cell lines
![Page 18: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/18.jpg)
Termhood clues (2) : form
● Preprocessing : ● Tokenization ● Lemmatisation● POS Tagging
… HER-2/neu oncogenes are members of ....
HER-2/neu oncogenes are members of
NOUN NOUN VERB NOUN PREP
HER-2/neu oncogene be member of
![Page 19: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/19.jpg)
Identification of Syntactic Patterns
● Patterns expressed as regular expression / Finite state automata
START
PREP
NOUN NOUN
NOUN (PREP? NOUN) ?
● NOUN : gene● NOUN NOUN : HER2 gene ● NOUN PREP NOUN : member of family
![Page 20: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/20.jpg)
Term hood clue (3) : words association
● Significant coocurrences are good clues for term hood :
● … breast cancer … ● ...breast remains...● .. alternative cancer...
● Must take into account :● number of times the two word cooccur● number of times word A occurs● number of times word B occurs
![Page 21: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/21.jpg)
Measure for cooccurrence significance
● Mutual InformationMI a ,b= log2
P a ,bP a⋅P b
P a , b=nbocc a ,b /NP a=nbocc a/N
N=total nbof words in corpus
● remarkable attraction between invasive and carcinoma despite relatively low number of cooccurrencesChurch and Hanks, 1990
L'Homme, 2004
invasive carcinoma 20
invasive 30
carcinoma 20
MI 9,7
cancer means 50
cancer 800
means 800
MI 1,69
![Page 22: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/22.jpg)
22
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
![Page 23: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/23.jpg)
23
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
● in parallel corpora● in comparable corpora
![Page 24: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/24.jpg)
24
Parallel and comparable corpora
● Parallel corpora● Source text and target texts are translations● Reduce search space little by little
● First sentences● Then terms
● Comparable corpora● Not translation but very similar in topic ● Good proportion of terms translations● Search space :
● All terms of target corpus
![Page 25: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/25.jpg)
25
Sentence alignement (1)
● Gale and Church (1993) 's hypothesis : ● Translated sentences have roughly the
same length● Probability P(S,T) that sentence S
translates into T is based on the length difference
● Improvements : use seed-lexicon● Probability P(S,T) is based on the
number of words in common
Gale and Church, 1993
![Page 26: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/26.jpg)
26
Sentence alignement (2)
● Compute probabilites for all pairs of (S,T)● Build matrix where M(i,j) contains probability
that sentence i translates to sentence j
Gale and Church, 1993
0 1 2 ... n
0 0,89 0,56 0,2 ... ...
1 0,45 0,9 0,1 ... ...
2 ... 0,23 0,9 0,3 ...
... ... ... 0,44 0,76 ...
m ... ... ... ... 0,88
![Page 27: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/27.jpg)
27
Sentence alignement (2)
● Use dynamic programming to find the best “path” i.e. the best alignments
Gale and Church, 1993
0 1 2 ... n
0 0,89 0,56 0,2 ... ...
1 0,45 0,9 0,1 ... ...
2 ... 0,23 0,9 0,3 ...
... ... ... 0,44 0,76 ...
m ... ... ... ... 0,88
![Page 28: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/28.jpg)
28
Sub sentence alignment : AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● AnyMalign is a sub-sentencial aligner● Aligns words, groups of words for MT
translation tables● Aligned group of words :
● more or less like statistical collocations● possible to find term patterns in these
groups of words
![Page 29: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/29.jpg)
29
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
a ↔ A is a perfect alignment
a d ↔ A D b ↔ Bb ↔ C
a e ↔ A DD
● Algorithm is based on « perfect alignments » :● words or groups of words that occur
exactly in the same aligned sentences
![Page 30: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/30.jpg)
30
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A
a d ↔ A D b ↔ Bb ↔ C
a e ↔ A DD
● How to get more « perfect alignments » ? ● with smaller corpora
● How to get smaller corpora ? ● randomly select sub corpora from your
corpora
Subcorpora 1
Subcorpora 2
Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A
![Page 31: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/31.jpg)
31
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
a d ↔ A D b ↔ Bb ↔ C
a e ↔ A DD
● Complementaires of perfect alignments are likely to be good alignments too :
● Perfect alignment a ↔ A● Complementaries d ↔ De ↔ DD
![Page 32: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/32.jpg)
32
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● Process : Iteratively extract random samples of of random size from your corpora
● Extract « perfect alignements » and their complementary
● The same alignment can occur several times
● Count, for each alignement the number of times it occurs
![Page 33: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/33.jpg)
33
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● Output : ● alignments sorted by descending number of
occurrences● Alignement probability :
P S∣T =C S ,T C T
S = source group of wordsT = target group of wordsC (S,T) = number of times S was aligned with
TC (T) = number of times T appears in an
alignment
![Page 34: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/34.jpg)
34
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
Advantages :● can perform alignment with more than 2
languages at the same time● 1 language → statistical collocations
● Extracts and aligns non contiguous sequences of words
to give something upto let someone down
● No a priori expectations on terms● Sometimes a term in source
language is not translated by a term● Terms = what you can align
![Page 35: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/35.jpg)
35
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● Words groups are not grammatical phrases :
that sample sentences and exchange format fitted for the
but not● Solutions :
● find term patterns● use heuristics
● trim stop words
sample sentences exchange format
![Page 36: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/36.jpg)
36
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
● in parallel corpora● in comparable corpora
![Page 37: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/37.jpg)
37
Advantages of comparable corpora
● More available● new languages● new language pairs● new topics / domains
● Less expensive to build● More natural
● data was produced spontaneously
● no influence from source text
![Page 38: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/38.jpg)
38
Contextual approach
● Based on distributional linguistics (Z. Harris)
● Words with similar meaning appear in similar contexts
● If source and target words have similar contexts, they might be translations
● Compute contexts for each source and target word
● Compare contexts● Find the most similar contexts
![Page 39: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/39.jpg)
39
Contextual approach
● Representation of the context of a given word with a vector :
● Head word + collocates
water
beer mou
th
glass
drink ● ● ● ... ●
● Vector associates « head » word with most frequent collocates
● + some indication of the force of association between head-word and collocates
![Page 40: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/40.jpg)
40
Building context vector for « drink »
● Collocates : word occuring at a distance of n words from head
is variety of reasons to drink plenty of water each day
simple as a glass of drinking water be the key to the
popular in Japan today to drink water from glass after waking
● (drink,water) = 3● (drink, glass) = 2● (drink, Japan) = 1● (drink, reason) = 1● (drink, plenty) = 1
![Page 41: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/41.jpg)
41
Normalized cooccurrences frequency
● Ex : log likelihood ratio● 1000 cooc. in corpus● (drink,x) = 75 cooc.● (water,y) = 75 cooc.● (drink, water) = 25 cooc.
water ¬ waterdrink 50 25 75¬ drink 25 900 925
75 925 1000
● Normalization : use measure like IM, log likehood ratio to counteract the influence of high frequency words
Dunning, 1993
![Page 42: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/42.jpg)
42
Log likelyhood ratio
● loglikelihoo ratio (drink,water) = 45,05
water ¬ waterdrink a b e¬ drink c d h
f g N
● Contingency table :
log likelihood ratio water , drink =log a b log bc log c d log d N log N
−e log e − f log f −g log g −h logh
Dunning, 1993
![Page 43: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/43.jpg)
43
Context vector comparison
● Compute context vectors for words in source and target corpus
water
beer mou
th
glass
drink ● ● ● ... ●
● How to compare words contexts in different languages ?
น��� เบ�ยร ป�
กแก�ว
ด ��ม ● ● ● ... ●
Rapp 1995 ; Fung 1997
![Page 44: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/44.jpg)
44
Context vector comparison
● Use seed lexicon to map collocates
water
beer mou
th
glass
drink ● ● ● ... ●
น��� เบ�ยร ป�
กแก�ว
ด ��ม ● ● ● ... ●
thaï-englishseed lexicon
Rapp 1995 ; Fung 1997
![Page 45: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/45.jpg)
45
Context vector comparison
● Measuring context similarity of words a and b
● = measuring cosinus angle between vector of a and vector of b
cosinus anglea ,b=∑c∈a∪b
w c , a⋅w c ,b
∑c∈a w c , a2 ⋅∑
c∈bw c ,b
2
c∈x=collocate in vector of xw c , x =weight of association of collocate c withhead x
● Select the top 1, 10 or 20 most closest words as candidate translations
Rapp 1995 ; Fung 1997
![Page 46: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/46.jpg)
46
Contextual approach : improvements
● Using syntactic collocates● Improving dictionary with cognates,
transliterations, other dictionaries ● Give more weight to « anchor words »
● cognates, transliterations● frequent, monosemous
● Filter with part-of-speech● Favor reciprocal translations
cb
d
a
c'b'
d'
a'SOURCE TARGET
Chiao et Zweignebaum, 2002Sadat et al., 2003Gamallo and Campos, 2005Kohen and Knight, 2002Prochasson, 2010
![Page 47: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/47.jpg)
47
Variant to direct translation of vector
● « Interlingual » translation● Translate the n-closest words instead of
context vector● Seed lexicon : some mappings between
source and target wordsSOURCE TARGET
seed lexicon
Déjean and Gaussier, 2002
![Page 48: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/48.jpg)
48
Variant to direct translation of vector
● To translate term T : ● Find n-closest words● these closest words are in the lexicon
SOURCE TARGET
seed lexicon
Déjean and Gaussier, 2002
![Page 49: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/49.jpg)
49
Variant to direct translation of vector
● Find the target term which is the closest to the n closest words
SOURCE TARGET
seed lexicon
Déjean and Gaussier, 2002
![Page 50: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/50.jpg)
50
Variant to direct translation of vector
● « Interlingual » approach● Translate closest words instead of direct
context
SOURCE TARGET
Déjean and Gaussier, 2002
![Page 51: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/51.jpg)
51
Adaptation to multi-word terms
● Context vector :● Union of vector of each word of the terms
Morin et al., 2004Morin and Daille, 2009
stron
gbe
er ...glass
energy ● ● ● ... ...
stron
gbe
er mouth
glass
energydrink
● ● ● ... ●
... beer mou
th
glass
drink ... ● ● ... ●
![Page 52: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/52.jpg)
52
Evaluation
Single word units
big, general language corpus
80%
Multi-word units
small, specialized corpus
60%
Multi-word terms
small, specialized corpus
42%
● big = hundreds milliions of words● small = one million to 100 thousand
words vectorMorin and Daille, 2010
● Precison on TopN candidates● 50% on Top20● Correct translation is in the Top 20 best
candidates for 50% of source terms
![Page 53: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/53.jpg)
53
Why is it so difficult ?
● translation might not be present● target term has not been extracted● polysemous words : undiscriminant,
fuzzy vector● low frequency words : unsignificant
vector● translation has different usage in target
language● big search space : all words of target
corpus→ can not be fully automatic→ semi supervised term alignment
![Page 54: Bilingual terminology mining](https://reader034.fdocuments.net/reader034/viewer/2022052411/55795490d8b42ab6648b4939/html5/thumbnails/54.jpg)
54
Thank you
ed(a)lingua-et-machina.com
Franco-Thai Workshop 20104th intensive summer school on Natural Language Processing