Post on 14-Dec-2015
Automatic Eurovoc Indexing
Bruno PouliquenJoint Research Centre, European Commission
Ispra-Italy
http://www.jrc.cec.eu.int/langtech
Addressing the Language Barrier Problem in the Enlarged EU
Automating Eurovoc Descriptor Assignment
Contents
• Overview of the process• Background
– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space
• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters
• Assignment– “guessing” the descriptors for a new text– …results
Overview: starting point
• Set of texts, manually indexedPoland
PolandprogrammePolish
…
Poland
EuropePoland
…
Poland
PolandCultureprogrammeArtisticPolish
Cultural policy
CultureprogrammeEurope
…
Cultural policy
EuropeCulturalprogramme
…
Cultural policy
revivalartisticCultureprogramme
…
Cultural policy
Overview: learning processfor automatic assignment
• Produce descriptor profiles
Poland
Poland 23Polish 20… Producers 9…
Culture 41 Cultural 32…Artistic 21Revival 10…
Cultural policy
Poland
PolandprogrammePolish…
PolandEuropePoland…
Poland
PolandCultureprogrammeArtisticPolish
Cultural policyCultureprogrammeEurope…
Cultural policy
EuropeCulturalprogramme…
Cultural policy
revivalartisticCultureprogramme…
Cultural policy
Overview: assignment
• A new document is compared with descriptor profiles
...
Overview: assignment results
DECISION No 3/96 OF THE ASSOCIATION COUNCIL between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part of 16 July 1996 settling the dispute between the European Communities and the Republic of Poland concerning skins and hides in accordance with Article 105 (1) and (2) of the Europe Agreement between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part (96/496/Euratom, ECSC, EC)
THE ASSOCIATION COUNCIL,Having regard to the Europe Agreement establishing an Association between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part (hereinafter 'the Europe Agreement`), and in particular Article 105 thereof,Whereas it is laid down in Article 105 (1) and (2) of the Europe Agreement that the Association Council may settle by means of a decision any dispute relating to the application or interpretation of the Europe Agreement;Considering that in view of a critical shortage of raw material in the form of skins and hides the Republic of Poland introduced on1 January 1994 a quota for the export of skins and hides set at 1 400 tonnes for 1994 and 1995 and 3 000 tonnes for 1996, invoking Article 31 of the Europe Agreement;Recognizing that, at the first meeting of the Association Council held in Warsaw on 23 and 24 June 1994, the Community requested Poland to increase the quota to 15 000 tonnes for 1994 and 20 000 tonnes for 1995 in order to maintain a balance, in accordance with the Europe Agreement, between the measures taken by Poland and the real shortage of the raw material existing;Considering that Poland informed the Community that the restriction had been introduced temporarily, was the result of the existing shortage and would be withdrawn as soon as the causes of its implementation disappeared;Considering that both sides have not reached a common understanding;Recognizing that in its letter of 28 July the Community referred the matter to the Association Council in accordance with Article 105 (1) of the Europe Agreement in order that it might settle the dispute;Considering that at the second meeting of the Association Council held in Brussels on 17 July 1995 the Community proposed that the quota for 1995 be raised to 13 500 tonnes;Recognizing that, as Poland could not accept the Community's proposal and as various proposals by Poland to increase the quota had not been accepted by the Community, both sides agreed to the application of Article 105 (4) of the Europe Agreement;Considering that the Republic of Poland and the Community have both notified their arbiters;Considering that, in the meantime the Republic of Poland in its letter of 18 March 1996 submitted a compromise proposal concerning the establishment of a timetable for liberalization of the export of skins and hides which envisages the final withdrawal of restrictions on 1 January 1999 at the latest and provides for another investigation of the matter in 1997 in order to hasten the process of full liberalization by one year;Recognizing that in such circumstances both sides have decided to stop the arbitration procedure provided for in Article 105 (4) and finish it according to Article 105 (2) of the Europe Agreement,HAS DECIDED AS FOLLOWS:
Article 1The amount of the annual quota for exports from Poland of skins and hides, set by Poland at 3 000 tonnes for 1996 shall be increased for the same products to 10 000 tonnes for 1996, 12 000 tonnes for 1997 and 15 000 tonnes for 1998. The Republic of Poland will eliminate the restriction in export of skins and hides with effect from 1 January 1999.
TITLE: Association council between the European Communities and the Republic of Poland
Poland 28
EC association 16agreement
Eastern Europe 14
…
Cultural Policy 0.05
Visual overview of the process
training
New text preprocessing
DescriptorprofilesDescriptor
profilesDescriptorprofiles
DescriptorDescriptor
Descriptor
Training corpus
preprocessing
assignment
Contents
• Overview of the process• Background
– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space
• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters
• Assignment– “guessing” the descriptors for a new text– …results
Background: Eurovoc thesaurus
• Created for indexing European texts• Created for human use• Contains some abstract concepts (cultural policy)• Hierarchically organised (BT/NT)
Background: Corpus
• Corpus: set of (homogeneous) texts• Collected:
– Parliamentary questions– Debate– Resolution– Protocol– Council regulation– Coucil decision– Council proposition– Agreements and contracts– ...
Background: Corpus
• Our corpus:– About 75000 texts in de,en,fr,it…– About 30000 in fi, sv...– And...
• 22000 in Lithuanian• 8000 in Hungarian• Hopefully 8000, or more, in the new EU languages• ...
Main approaches for thesaurus indexing
• Look for the descriptor text in documents• Linguistic, rule-based approach• Machine learning, statistical approach (JRC)
Look for the descriptor text in documents
• Most intuitive attempt for Eurovoc indexing– Try to find a descriptor text explicitly in documents
• Many texts (69%) do not include descriptor text explicitly
Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.
Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20
…
Poland
Cultural policy
Community programme
Financing of the community project
Look for the descriptor text in documents(2)
• Many documents do contain some descriptor text without being indexed with it (90%)
Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.
Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20
…
Culture
Form
Decision
Rule-based approach for Eurovoc indexing
• Manually write a set of rules for each descriptorExample: [Hlava & Hainebach 1996]
• 40000 rules for English– Words and word combinations
» Fishery AND management– Word locations in the text
» Proximity (words in the same sentence…)» Location (title/text/beginning of sentence…)» Format (capital letters)
– Exploit legal references» E.g. “Council directive 79/112/EEC”
• Too expensive, difficult to update
Machine learning
• Inductive process– Tries to “learn” from manually indexed examples– Tries to “reproduce” indexing on new document
• Advantages– Fast and cheap– Easy to adapt to new languages– Easier to update
• Possibility to re-index old documents with new descriptors– More consistent than manual indexing– Ranked assignment (better relevance ranking in search)
Machine learning: our approach
– Trained on manually Eurovoc-indexed documents– Basic and fast representation of texts: “bag of words”
• No consideration of relationship between words (syntax, semantic, discourse…)
– Build a representation of each descriptor • the profile: (weighted) list of the most representative words
Vector space representation
• A text is represented as a vector• The dimensions are the words• A Eurovoc descriptor profile is also a vector
Nuclear MaterialAgreement in the form of an exchange of letters (…)relating to the amendment of the Convention of 20 May 1987on a common transit procedure
See “Vector space” annex for more information
Contents
• Overview of the process• Background
– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space
• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters
• Assignment– “guessing” the descriptors for a new text– …results
Training Eurovoc assignment: text pre-processing
• Why?Find a better representation in the “bag-of-words”– Semantically not relevant to consider some common words
• ‘a’, ‘the’, ‘very’…– Unify word forms
• ‘culture’, ‘cultures’, ’cultural’…– Word sense disambiguation
• ‘pilot’, ‘order’…
Training Eurovoc assignment: text pre-processing
• How?– “Stop word” list, avoid the use of non-meaningful words– Lemmatisation
Replace inflected word forms by their base form (lemma)• ‘towns’ => ‘town’• ‘difficulties’=>’difficulty’• ‘mice’=>’mouse’
– Multi-word unitsReduce polysemy• ‘European_Union’• ‘in_order_to’• ‘sustainable_development ‘• ‘pilot_project’
– Remove annexes and signatures
Text pre-processing exampleArticle 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.
Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20 …
Article 1 Poland shall participate_in the Culture 2000 programme accord_to the term_and_condition set_out in annex_i and II which shall form an integral_part of this Decision .
Article 2 This Decision shall enter_into_force on the day of its adoption . It shall apply_for the duration of the Culture 2000 programme , start from 1 January 2001 .
Machine learning: training
• “learn” from examples• Based on human assignment• Take every text indexed with a given descriptor as the “training
sample” of this descriptor
Poland
Cultural policy
training
training
Poland
PolandproducersPolish
…
Poland
EuropePolandPolish
…
Poland
PolandCultureprogrammeArtisticPolish
…
Cultural policy
Cultureprogrammeeurope
…
Cultural policy
EuropeCulturalprogramme
…
Cultural policy
revivalartisticCultureprogramme
…
Cultural policy
PolandproducersPolish
…
EuropePolandPolish
…
PolandCultureprogrammeArtisticPolish
…
PolandCultureprogrammeArtisticPolish
…
Cultureprogrammeeurope
…
EuropeCulturalprogramme
…
revivalartisticCultureprogramme
…
Training: representation of texts
• Bag-of-Words representation of text
Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.
Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20
…
PolandCultureProgrammeDecisionstartingTerms…
Training: identifying most representative words
radioactiveukraineresolutionplutoniumdeuteriumparliamentnuclearblottnitz...
plutoniumdeuteriumassemblynuclearschmidtradioactivekoreaiaea...
Illegal_trafficchernobylradioactiveukrainianplutoniumlithiumdangerousmox...
radioactive (3)plutonium (3) nuclear (2)deuterium (2)Illegal_traffic (1)chernobyl (1)...
+ + =
RADIOACTIVE MATERIALS
Training: identifying most representative words
Poland
Culture 41Cultural 32…Artistic 21Revival 10…
training
training
Cultural policy
Poland
PolandproducersPolish
…
Poland
EuropePolandPolish
…
Poland
PolandCultureprogrammeArtisticPolish
…
Cultural policy
Cultureprogrammeeurope
…
Cultural policy
EuropeCulturalprogramme
…
Cultural policy
revivalartisticCultureprogramme
…
Cultural policy
PolandPolishproducers
Poland 2Polish 2producerseurope
Poland 3Polish 2producersEuropeprogramme
Poland 23Polish 20… Producers 9…
Building Eurovoc descriptor profiles
For each Eurovoc Descriptor– Find the texts it appears in– Find the words appearing in those texts– Combine various weights to compute the
weight of each word for this descriptor– Various normalisations used:
• A very common word has less impact than a rare word– The word ‘contradiction ’ (400 times) is less meaningful than ‘cloud’ (40
times)
• A word used with only one descriptor has higher impact– ‘Chernobyl’ does not appear with many descriptors– ‘redistribute’ (same frequency in texts) appears with various descriptors
• Texts being indexed with one descriptor have better impact than those with 20
Descriptor profile: weight of a word
Weight of a word in a descriptor profile– Based on the frequency of the
word • Number of texts it appears in
• Each text being indexed with Nd
descriptors, word contribution is 1/Nd
– Normalised by the number of
other descriptors it appears in
IDFTFWeight dl .,
1)(log2,
l
DFdl DF
MaxIDF l
IDFNd
WeightdlTt t
dl
,
)1
(,
dlTt t
dl NdTF
,
1,
dlTt
dl NtTF,
1,
)1)((log)1
( 2,
,
l
DF
Tt tdl DF
Max
NdWeight l
dl
ldl DF
MIDF ,
Eurovoc descriptor profile
• List of weighted words (associates)
TOWN PLANNING
Associate Weight
urban_policy 3.98
urban 3.40
urban_renewal 3.18
derelict 3.02
urban_development
2.90
run_down 2.79
city 2.77
urban_plan 2.70
revitalize 2.60
town 2.38
esdp 2.20
conurbations 2.20
heritage 1.88
spatial_development
1.87
interregional 1.87
quality_of_life 1.82
regeneration 1,8
urban_environment
1,77
spatial 1,76
planning 1,75
urban_area 1,74
regional_plan 1,74
Associate List: RADIOACTIVE MATERIALS
fishery-related
management-related
Associate List: FISHERY MANAGEMENT
Tuning: various parameters
– Minimum size and number of training texts available for each descriptor. We chose to require at least 5 texts (with at least 2000 characters each).
– Select words in texts: log-likelihood formula (p-value to the low value of 0.15 to produce long associate lists)
– Reference corpus for the log-likelihood formula. A general corpus vs. the training corpus
– Meta-text vs individual texts– Minimum number of texts per descriptor for which the word is an
associate (a word has to appear in at least 2 texts to be an associate)– Use: number of texts / cumulated frequency of word / log-likelihood value– Impact of all associates occurring at least 10% as often as the most
common associate word – We do not consider the length of each training text – Minimum weight threshold for each associate– Minimum length of the associate list
Contents
• Overview of the process• Background
– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space
• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters
• Assignment– “guessing” the descriptors for a new text– …results
Assignment Phase• Normalise new document
(lemmatise, multi-word mark-up)• Produce word frequency list
(excluding stop words)
Calculate similarity between word frequency list and descriptor associate lists, using statistical formulae
...
Eurovoc assignment: example
Resolution on human rights in Ethiopia
ETHIOPIA 30%HUMAN RIGHTS 25%POLITICAL VIOLENCE 19%REPRESSION 18%DEMOCRATIZATION 18%…EXTREMISM 10%DEATH PENALTY 10%…TREATY ON EUROPEAN UNION 6%
Ethiopia, ethiopianhuman_rights, (condemn, respect…)human_rights, (condemn, killing…)human_rights, (condemn, repression…)human_rights, (condemn, democratic..)
human_rights, (condemn, call_on…)human_rights, (condemn, call_on…)
human_rights, (citizen, respect…)
Document:
(Eurovoc assignment in vector space)
Keyword 1
Keyword 3
Keyword 2
Descriptor 1
Document
Descriptor 2
Formulae tested for descriptor assignment
)1).((log2,, l
dldl DF
NTFTFIDF
dl tltldl
tdltldl
TFIDFTFIDF
TFIDFTFIDFtdCOSINE
)).((
.),(
2
,
2
,
,,
M
dTF
TF
DF
DFNOkapi
dl
dl
dtl l
ldt
,
,, )log(
)max(18.0
)max(21.0
)max(61.0
Sproduct
Sproduct
Okapi
Okapi
COSINE
COSINE
tdl
tldl TFIDFTFIDFtdSproduct ,, .),(
Cosine uses TF.IDF; computes the angle of two multi-dimensional vectors (of the document (t) and of the descriptor associate list)
Term Frequency, Inverse Document Frequency Considers occurrence frequency of lemma (l) in meta-text (TFl,t) and number of descriptors (d) for which the lemma is an associate (DFl)
Okapi considers occurrence frequency of lemma as an associate (DFl); the number of associates in the associate list (size, |d|); the average size of descriptor associate lists (M); the total number of descriptors used (N)
‘622’ mixed formula, uses all of the above
‘Scalar Product’ adds product of TF.IDF values of associates and text lemmas
Sample Assignment ResultTitle: Legislative resolution embodying Parliament's opinion on the proposal for a Council
Regulation amending Regulation N. 2847/93 establishing a control system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146) (Consultation procedure)
Results: …starting from a plain text…Resolution on human rights in EthiopiaThe European Parliament,- having regard to its resolution of 18 July 1996 on human rights in Ethiopia ((OJ C 261, 9.9.1996, p. 166.)),A. whereas respect for human rights, democratic principles and the rule of law constitute essential elements of the revised Lomé IV Convention and whereas the Ethiopian constitution also includes respect for human rights,B. having regard to the continuing process of democratic and institutional change in Ethiopia,C. concerned by the repression of civil society associations which recently forced into exile leaders like Mr Kefale Mamo and Mr Mulugeta Lule (president and vice-president respectively of the Ethiopian Free Press Journalist Association), Mr Gemorav Kassa (General Secretary of the Ethiopian Teachers Association), and Mr Dawi Ibrahim (president of the Confederation of Ethiopian Trade Unions),D. deeply concerned by the killing on 11 June 1997 of Assefa Maru, an executive board member of both the Ethiopian Teachers Association and of the Ethiopian Human Rights Council,1. Condemns the killing of Assefa Maru;2. Condemns all human rights violations committed by the government and military forces;3. Calls on the Ethiopian authorities to guarantee the fundamental rights of all Ethiopian citizens and to put an end to politically motivated persecutions and to abuses such as extrajudicial disappearances, torture, detention, rapes and arrests, in accordance with the Ethiopian constitution;4. Calls on the Ethiopian authorities fully to respect freedom of the press, independence of unions and the right of association of citizens;5. Urges the Ethiopian Government to release all prisoners of conscience and to provide corrective procedures to the judiciary system whereby people can be charged and tried in a fair way;6. Calls on the Council, the Commission and the Member States to monitor closely human rights in Ethiopia and use all means to improve the situation;7. Instructs its President to forward this resolution to the Council, the Commission, the Government of Ethiopia, the Secretary- General of the United Nations and the UN High Commissioner for Human Rights.
New text
Pre-processing and keyword extractionNew text
preprocessing
DescriptorDescriptor
Descriptor
assignment
What next?
• New languages• Experiments, using output of current assignment
– SVM– Other categorization techniques
• Filter output – Geographic descriptors
Automatic Eurovoc Indexing: vector model
Bruno PouliquenJoint Research Centre, European Commission
Ispra-Italy
http://www.jrc.cec.eu.int/langtech
Addressing the Language Barrier Problem in the Enlarged EU
Automating Eurovoc Descriptor Assignment
Annex: Vector space representation
Word 1
Word 3
Word 2
Document
Vector space representation
Poland
Culture
Programme
EU-Poland culture 2000
Example: document in a three dimensional space
Vector space representation
Cultural policy
Poland
Culture
Programme
A Eurovoc descriptor in the same three dimensional space
Vector space representation
Keyword 1
Keyword 3
Keyword 2
Eurovoc descriptor
Document
Eurovoc descriptor and documents comparison
Vector space representation
Cultural policy
Poland
Culture
Programme
EU-Poland culture 2000
Eurovoc descriptor and documents comparison
Eurovoc assignment in vector space
• A text is “compared” to each Eurovoc descriptor profile
Keyword 1
Keyword 3
Keyword 2
Descriptor 1
Document
Descriptor 2
dtl
tldl WeightWeightdtsim ,,),(