Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others...

78
Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Kal Järvelin & Many Others Others University of Tampere University of Tampere

Transcript of Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others...

Page 1: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Managing Morphologically Complex Languages in Information Retrieval

Kal Järvelin & Many OthersKal Järvelin & Many Others

University of TampereUniversity of Tampere

Page 2: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

1. Introduction1. Introduction

Morphologically complex languagesMorphologically complex languages unlike English, Chineseunlike English, Chinese rich inflectional and derivational rich inflectional and derivational

morphologymorphology rich compound formationrich compound formation

U. Tampere experiences 1998 - 2008U. Tampere experiences 1998 - 2008 monolingual IRmonolingual IR cross-language IRcross-language IR focus: Finnish, Germanic languages, Englishfocus: Finnish, Germanic languages, English

Page 3: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Methods for MorphologyMethods for MorphologyVariation

Management

GenerativeMethods

Word FormGeneration

Infl stemGeneration

Lemmatiz-ation

Stemming

GeneratingAll Forms

InflectionalStems enhanced

Rule-based

Rule-based

FCGInflectional

StemsRules +

DictRules +

Dict

ReductiveMethods

Page 4: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

AgendaAgenda

1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 5: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

2. Normalization2. Normalization Reductive methods, conflationReductive methods, conflation

stemmingstemming lemmatizationlemmatization + conflation -> simpler searching+ conflation -> simpler searching + smaller index+ smaller index + provides query expansion+ provides query expansion

Stemming available for many languages Stemming available for many languages (e.g. Porter stemmer)(e.g. Porter stemmer)

Lemmatizers less available and more Lemmatizers less available and more demanding (dictionary requirement)demanding (dictionary requirement)

Page 6: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Alkula 2001Alkula 2001

Boolean environment, inflected index, Finnish:Boolean environment, inflected index, Finnish: manual truncation vs. automatic stemmingmanual truncation vs. automatic stemming stemming improves P and hurts Rstemming improves P and hurts R many derivatives are lostmany derivatives are lost

Boolean environment, infl vs. lemma index, Boolean environment, infl vs. lemma index, Finnish:Finnish: manual truncation vs. lemmatizationmanual truncation vs. lemmatization lemmatization improves P and hurts Rlemmatization improves P and hurts R many derivatives are lost, others correctly avoidedmany derivatives are lost, others correctly avoided

Differences not great between automatic Differences not great between automatic methodsmethods

Page 7: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Kettunen & al 2005Kettunen & al 2005

Ranked retrieval, Finnish:Ranked retrieval, Finnish: Three problems Three problems

how lemmatization and how lemmatization and inflectional stem inflectional stem generationgeneration compare in a best-match compare in a best-match environment?environment?

is a stemmer realistic for the handling Finnish is a stemmer realistic for the handling Finnish morphology?morphology?

feasibility of simulated truncation in a best-feasibility of simulated truncation in a best-match system?match system?

Lemmatized vs inflected form vs. stemmed Lemmatized vs inflected form vs. stemmed index.index.

Page 8: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Kettunen & al. 2005Kettunen & al. 2005 MethodMethod IndexIndex MAPMAP

Change %Change % FinTWOLFinTWOL lemmaslemmas 35.035.0 -- -- Inf Stem GenInf Stem Gen inflforminflform 34.234.2 - 2.3- 2.3 PorterPorter stemmed stemmed 27.727.7 - 20.9- 20.9 RawRaw inflform inflform 18.918.9 - 46.0- 46.0

But very long queries for inflectional stem But very long queries for inflectional stem generation & expansion (thousands of words); generation & expansion (thousands of words); weaker generations shorter but progressively weaker generations shorter but progressively deteriorating results.deteriorating results.

(InQuery/TUTK/graded-35/regular; )(InQuery/TUTK/graded-35/regular; )

Page 9: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Kettunen & al. 2005Kettunen & al. 2005

QuickTime™ and a decompressor

are needed to see this picture.

Page 10: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Language Index type Average P % Diff Base % English Inflected 43.4

Lemmas 45.6 2.2 Stemmed 46.3 2.9

Finnish Inflected 31.0 Lemmas 47.0 16.0 Stemmed 48.5 17.5

Swedish Inflected 30.2 Lemmas 31.4 1.2 Stemmed 33.5 3.3

German Inflected 30.2 Lemmas 31.9 1.7 Stemmed 35.7 5.5

InQuery/CLEF/TD/TWOL&Porter&Raw

MonoIR: Airio 2006MonoIR: Airio 2006

Page 11: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR: Inflectional CLIR: Inflectional MorphologyMorphology

NL queries contain inflected form source NL queries contain inflected form source keyskeys

Dictionary headwords are in basic form Dictionary headwords are in basic form (lemmas)(lemmas)

Problem significance varies by languageProblem significance varies by language StemmingStemming

stem both the dictionary and the query wordsstem both the dictionary and the query words but may cause all too many translationsbut may cause all too many translations

Stemming in dictionary translation best applied Stemming in dictionary translation best applied after translation.after translation.

Page 12: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Lemmatization in CLIRLemmatization in CLIR

Lemmatization Lemmatization easy to access dictionarieseasy to access dictionaries but tokens may be ambiguous but tokens may be ambiguous dictionary translations not always in dictionary translations not always in

basic formbasic form lemmatizer’s dictionary coveragelemmatizer’s dictionary coverage

insufficient -> non-lemmatized source keys, insufficient -> non-lemmatized source keys, OOVsOOVs

too broad coverage -> too many senses too broad coverage -> too many senses providedprovided

Page 13: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR Findings: Airio CLIR Findings: Airio 20062006

Target X Index type Average P % Diff to Split%

Finnish Lemmas 29.0 -6.5

Stemmed 20.8 -14.7

Swedish Lemmas 17.4 -9.7 Stemmed 19.0 -8.1

German Lemmas 26.4 -4.6 Stemmed 25.7 -5.3

English -> X

InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

Page 14: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

AgendaAgenda

1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 15: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

3. Compounds3. Compounds

Compounds, compound word typesCompounds, compound word types determinative: Weinkeller, vinkällare, life-jacketdeterminative: Weinkeller, vinkällare, life-jacket copulative: schwartzweiss, svartvit, black-and-copulative: schwartzweiss, svartvit, black-and-

whitewhite compositional: Stadtverwaltung, stadsförvaltningcompositional: Stadtverwaltung, stadsförvaltning non-compositional: Erdbeere, jordgubbe, non-compositional: Erdbeere, jordgubbe,

strawberrystrawberry

Note on spelling : compound word Note on spelling : compound word components are spelled together (if not -> components are spelled together (if not -> phrases)phrases)

Page 16: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Compound Word Compound Word TranslationTranslation

All compounds are not in dictionaryAll compounds are not in dictionary some languages are very productive some languages are very productive small dictionaries: atomic words, old non-small dictionaries: atomic words, old non-

compositional compoundscompositional compounds large dictionaries: many compositional large dictionaries: many compositional

compounds addedcompounds added Compounds remove phrase Compounds remove phrase

identification problems, but cause identification problems, but cause translation and query formulation translation and query formulation problemsproblems

Page 17: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Joining MorphemesJoining Morphemes Joining morphemes Joining morphemes

complicate compound complicate compound analysis & translationanalysis & translation

Joining morpheme Joining morpheme types in Swedishtypes in Swedish <omission> flicknamn <omission> flicknamn -s rätt-s rättssfall fall -e flick-e flickeebarn barn -a gäst-a gästaabud bud -u gat-u gatuubelysning belysning -o människ-o människookärlekkärlek

Joining morpheme Joining morpheme types in Germantypes in German -s Handel-s Handelssvertragvertrag -n Affe-n Affennhaushaus -e Gäst-e Gästeebettbett -en Fotograph-en Fotographenenaus- aus-

bildungbildung

-er Gespenst-er Gespenstererhaushaus -es Freund-es Freundeseskreiskreis -ens Herz-ens Herzensensbrecherbrecher <omission> <omission>

Sprachwissen-schaftSprachwissen-schaftSuggestive finding that the treatment of joining morphemes improves MAP by 2 %- Hedlund 2002, SWE->ENG, 11 Qs

Page 18: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Compound Processing, 2Compound Processing, 2 A Finnish natural A Finnish natural

language query: language query: lääkkeet sydänvaivoihinlääkkeet sydänvaivoihin (medicines for heart (medicines for heart

problems) problems) The output of The output of

morphological morphological analysisanalysis lääke lääke sydänvaiva, sydän, sydänvaiva, sydän,

vaivavaiva

Dictionary translation and Dictionary translation and the output of component the output of component tagging: tagging: lääke ---> medication druglääke ---> medication drug sydänvaiva - ”not in dict”sydänvaiva - ”not in dict” sydän ---> heartsydän ---> heart vaiva ---> ailment, vaiva ---> ailment,

complaint, discomfort, complaint, discomfort, inconvenience, trouble, inconvenience, trouble, vexationvexation

Many ways to combine Many ways to combine components in querycomponents in query

Page 19: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Compound Processing, 3Compound Processing, 3 Sample Sample English CLIR queryEnglish CLIR query::

#sum( #sum( #syn( medication drug )#syn( medication drug ) heartheart #syn( ailment, #syn( ailment, complaint, discomfort, inconvenience, trouble, complaint, discomfort, inconvenience, trouble, vexation ))vexation ))

i.e. translating as if source compounds were phrasesi.e. translating as if source compounds were phrases Source compound handling may vary here:Source compound handling may vary here:

#sum( #sum( #syn( medication drug )#syn( medication drug ) #syn(#uw3( #syn(#uw3( heartheart ailment ) #uw3( ailment ) #uw3( heartheart complaint ) #uw3( complaint ) #uw3( heartheart discomfort ) #uw3( discomfort ) #uw3( heartheart inconvenience ) inconvenience ) #uw3( #uw3( heartheart trouble ) #uw3( trouble ) #uw3( heartheart vexation ))) vexation )))

#uw3 = proximity operator for three intervening #uw3 = proximity operator for three intervening words, free word orderwords, free word order

i.e. forming all proximity combinations as synonym i.e. forming all proximity combinations as synonym sets.sets.

Page 20: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Compound Processing, 4Compound Processing, 4

No clear benefits seen from using No clear benefits seen from using proximity combinations.proximity combinations.

We did neither observe a great We did neither observe a great effect in changing the proximity effect in changing the proximity operator (OD vs. UW)operator (OD vs. UW)

Some monolingual results follow Some monolingual results follow (Airio 2006)(Airio 2006)

Page 21: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Language Index type Average P %

Diff to Baseline

English Inflected 43.4

Lemmas 45.6 2.2 Stemmed 46.3 2.9 Finnish Inflected 31.0

Lemma & decomp 50.5 19.5 Lemmas 47.0 16.0 Stemmed 48.5 17.5 Swedish Inflected 30.2 Lemma & decomp 38.8 8.6 Lemmas 31.4 1.2 Stemmed 33.5 3.3 German Inflected 30.2 Lemma & decomp 36.2 6.0 Lemmas 31.9 1.7 Stemmed 35.7 5.5

InQuery/CLEF/Raw&TWOL&Porter

Page 22: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Finnish

English

Swedish

Morphological complexity increases

Page 23: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Hedlund 2002Hedlund 2002

Compound translation as Compound translation as compounds:compounds: 47 German CLEF 2001 topics, English docs

collection. comprehensive dictionary (many compounds) vs.

small dict (no compounds) mean AP 34.7% vs. 30.4% dictionary matters ...

Alternative approach: if not translatable, split and translate components

Page 24: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLEF Ger -> EngCLEF Ger -> Eng

1. best manually translated 0,4465

2. large dict, no comp splitting 0,3520

3. limited dict, no comp splitting 0,3057

4. large dictionary & comp splitting 0,3830

5. limited dict & comp splitting 0,3547

InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n

Page 25: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR Findings: Airio CLIR Findings: Airio 20062006

Target language

Index type Average P %

Diff to Baseline%

Finnish Lemma & decomp 35.5

Lemmas 29.0 -6.5 Stemmed 20.8 -14.7 Swedish Lemma & decomp 27.1 Lemmas 17.4 -9.7 Stemmed 19.0 -8.1 German Lemma & decomp 31.0 Lemmas 26.4 -4.6 Stemmed 25.7 -5.3

English ->

InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

Page 26: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Eng->Ger

QuickTime™ and a decompressor

are needed to see this picture.

Eng->Swe

QuickTime™ and a decompressor

are needed to see this picture.

Eng->Fin

Page 27: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

AgendaAgenda

1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 28: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

4. Generative Methods4. Generative MethodsVariationhandling

GenerativeMethods

Word FormGeneration

Infl stemGeneration

Lemmatiz-ation

Stemming

GeneratingAll Forms

InflectionalStems, ench

Rule-based

Rule-based

FCGInflectional

StemsRules +

DictRules +

Dict

ReductiveMethods

Page 29: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Generative Methods: inf Generative Methods: inf stemsstems

Instead of normalization, generate Instead of normalization, generate inflectional stems for an inflectional inflectional stems for an inflectional index.index. then using stems harvest full forms then using stems harvest full forms

from the indexfrom the index long queries ...long queries ...

Page 30: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

... OR ...... OR ...

Instead of normalization, generate Instead of normalization, generate full inflectional forms for an full inflectional forms for an inflectional index.inflectional index. Long queries? Sure!Long queries? Sure! Sounds absolutely crazy ...Sounds absolutely crazy ...

Page 31: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

... BUT!... BUT!

Are morphologically complex Are morphologically complex languages languages that that complex in IR complex in IR in in practicepractice??

Instead of full form generation, only Instead of full form generation, only generate generate sufficientsufficient forms -> FCG forms -> FCG

In Finnish, 9-12 forms cover 85% of In Finnish, 9-12 forms cover 85% of all occurrences of nounsall occurrences of nouns

Page 32: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Kettunen & al 2006: Kettunen & al 2006: FinnishFinnish

IRIR MAP for relevance level MAP for relevance level

MethodMethod Liberal Liberal NormalNormal StringentStringent

TWOLTWOL 37.8 37.8 35.035.0 24.124.1

FCG12FCG12 32.7 32.7 30.0 30.0 21.4 21.4

FCG6FCG6 30.9 30.9 28.0 28.0 21.0 21.0

SnowballSnowball 29.8 29.8 27.7 27.7 20.0 20.0

RawRaw 19.619.6 18.918.9 12.412.4

... monolingual ...

Page 33: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Kettunen & al 2007: Kettunen & al 2007: Other LangsOther Langs

IRIR MAP for Language MAP for Language

MethodMethod Swe Swe GerGer RusRus

TWOLTWOL 32.6 32.6 39.739.7 ....

FCGFCG 30.6 30.6 /4/4 38.0 38.0 /4/4 32.7 /2 32.7 /2

FCGFCG 29.1 29.1 /2/2 36.8 36.8 /2/2 29.2 /6 29.2 /6

SnowballSnowball 28.5 28.5 39.1 39.1 34.734.7

RawRaw 24.0 24.0 35.935.9 29.8 29.8 Results for long queries ... monolingual ...

Page 34: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR Findings: Airio CLIR Findings: Airio 20082008

Language Language pairspairs

Raw Raw transltransl

Fi-FCG_9Fi-FCG_9

Sv-FCG_4Sv-FCG_4

Fi-FCG_12Fi-FCG_12

Sv-FCG_7Sv-FCG_7

Lemma-Lemma-tizedtized

Fin -> Fin -> EngEng 11.211.2 32.432.4 32.532.5 39.639.6

Fin -> Fin -> SweSwe 14.314.3 22.622.6 23.923.9 35.235.2

Eng -> Eng -> SweSwe 18.118.1 25.125.1 27.327.3 34.134.1

Swe -> Swe -> FinFin 11.711.7 28.028.0 27.927.9 37.637.6

Page 35: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

AgendaAgenda

1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 36: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

5. Query Structures5. Query Structures

Translation ambiguity such as ...Translation ambiguity such as ... Homonymy: homophony, homographyHomonymy: homophony, homography

Examples: platform, bank, bookExamples: platform, bank, book Inflectional homographyInflectional homography

Examples: train, trains, training Examples: train, trains, training Examples: book, books, booking Examples: book, books, booking

PolysemyPolysemy Examples: back, trainExamples: back, train

... a problem in CLIR.... a problem in CLIR.

Page 37: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Ambiguity ResolutionAmbiguity Resolution

MethodsMethods Part-of-speech tagging (e.g. Ballesteros Part-of-speech tagging (e.g. Ballesteros

& Croft ‘98)& Croft ‘98) Corpus-based methods Ballesteros & Corpus-based methods Ballesteros &

Croft ‘96; ‘97; Chen & al. ‘99)Croft ‘96; ‘97; Chen & al. ‘99) Query ExpansionQuery Expansion CollocationsCollocations

Query structuring - the Pirkola Method Query structuring - the Pirkola Method (1998)(1998)

Page 38: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Query Structuring Query Structuring From weak to From weak to

strong query strong query structures by structures by recognition of ...recognition of ... conceptsconcepts expression weightsexpression weights phrases, compoundsphrases, compounds

Queries may be Queries may be combined ... query combined ... query fusionfusion

Concepts?

no yes

Phrases ?

Weighting ? Weighting ?

no yes no yes

Phrases ?

no yes no yes

#sum(a b c d e)

#wsum(1 3 #syn(a #3(b c)) 1 #syn(d e))

~~ ~~

~~ ~~

Page 39: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Structured Queries in Structured Queries in CLIRCLIR

CLIR performance (Pirkola 1998, 1999)CLIR performance (Pirkola 1998, 1999) English baselines, manual Finnish English baselines, manual Finnish

translationstranslations Automatic dictionary translation FIN -> ENGAutomatic dictionary translation FIN -> ENG

natural language queries (NL) vs. concept natural language queries (NL) vs. concept queries (BL)queries (BL)

structured vs. unstructured translationsstructured vs. unstructured translations single words (NL/S) vs. phrases marked (NL/WP)single words (NL/S) vs. phrases marked (NL/WP) general and/or special dictionary translationgeneral and/or special dictionary translation

500.000 document TREC subcollection500.000 document TREC subcollection probabilistic retrieval (InQuery)probabilistic retrieval (InQuery) 30 health-related requests30 health-related requests

Page 40: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

The Pirkola MethodThe Pirkola Method

All translations of all senses All translations of all senses provided by the dictionary are provided by the dictionary are incorporated in the queryincorporated in the query

All translations of each source All translations of each source language word are combined by the language word are combined by the synonym operator, synonym groups synonym operator, synonym groups by #and or #sumby #and or #sum this effectively provides disambiguationthis effectively provides disambiguation

Page 41: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

An ExampleAn Example

Consider the Finnish natural Consider the Finnish natural language query: language query: lääke sydänvaiva [= medicine lääke sydänvaiva [= medicine

heart_problem]heart_problem] Sample Sample English CLIR queryEnglish CLIR query::

#sum( #sum( #syn( medication drug )#syn( medication drug ) heart heart #syn( ailment, complaint, discomfort, #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation ) inconvenience, trouble, vexation ) ))

Each source word forming a synonym Each source word forming a synonym setset

Page 42: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Query Translation Test Query Translation Test Set-upSet-up

InQueryTRECUnix-server

Translated Finnish RequestEnglish Request

Finnish NL Query Finnish BL Query

General Dict Med. Dict.

Baseline Queries Translated English Queries

General Dict Med. Dict.

Page 43: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Unstructured NL/S Unstructured NL/S QueriesQueries

Only 38% of

the average baselineprecision

(sd&gd)

Baseline

#sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)

Page 44: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

0

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90 100

Precision

Recall

GD SD --> GDSD and GD BL

Baseline

77% of the

average baselineprecision (sd & gd)

Structure doubles

precision in all cases#and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))

Structured Queries w/ Structured Queries w/ Special DictionarySpecial Dictionary

Page 45: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Query Structuring, More Query Structuring, More ResultsResults

CLEF CLEF

Topic set 2000 Topic set 2000 (N = 33)(N = 33)

Unstructured Unstructured queriesqueries

no s-gramsno s-grams

Structured Structured queriesqueries

no s-gramsno s-grams

Change %Change %

Finnish - EngFinnish - Eng 0.16090.1609 0.21500.2150 33.633.6

German - EngGerman - Eng 0.20970.2097 0.26390.2639 25.825.8

Swedish - EngSwedish - Eng 0.20150.2015 0.22420.2242 11.311.3

Topic set 2001Topic set 2001 (N = 47)(N = 47)

Finnish - EngFinnish - Eng 0.24070.2407 0.34430.3443 43.043.0

German - EngGerman - Eng 0.32420.3242 0.38300.3830 18.118.1

Swedish - EngSwedish - Eng 0.34660.3466 0.34650.3465 0.00.0

Page 46: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Transit CLIR – Query Transit CLIR – Query StructuresStructures

Average precision for the transitive, bilingual and monolingual runs of CLEF 2001 topics (N = 50)

Page 47: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Transitive CLIR Results, Transitive CLIR Results, 22

2 0 0 1 T o p i c s

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

0 2 0 4 0 6 0 8 0 1 0 0

R e c a l l

Precision

s w e - f i - e n g

s w e - e n g

f i n - s w e - e n

f i n - e n g

m o n o l e n g

Page 48: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Transitive CLIR Transitive CLIR EffectivenessEffectiveness

RegularRelevance

MAPN=35

% mono-lingual perfor-mance

% direct performance

Swe-Eng-Fin 21.8 59** 68**

Eng-Swe-Fin 27.5 75    88   

Ger-Eng-Fin 24.3 66** 83   

Ger-Swe-Fin 29.3 79    100    

Mean 25.7 70   85  

Lehtokangas & al 2008

Page 49: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

TransCLIR + pRF TransCLIR + pRF effectivenesseffectiveness

RegularRelevance

MAPN=35

% monolingual +

pRF

% monolingual

% direct

% pRF exp direct

Swe-Eng-Fin 28.7 68** 78 90 78*

Eng-Swe-Fin 32.3 76* 88 103 87 

Ger-Eng-Fin 33.6 79   91 115 104   

Ger-Swe-Fin 34.5 81  94 118 107  

Mean 32.3 76   88 107 94 

Page 50: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

AgendaAgenda

1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 51: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

6. OOV Words6. OOV Words Low coverage -- non-translated wordsLow coverage -- non-translated words

Domain-specific terms in general dictionariesDomain-specific terms in general dictionaries e.g. dystrophye.g. dystrophy Covered in domain-specific dictionariesCovered in domain-specific dictionaries

Compound wordsCompound words Proper names: persons, geographical names, …Proper names: persons, geographical names, … Often central for query effectivenessOften central for query effectiveness

Large dictionaries solution? BUT:Large dictionaries solution? BUT: Excessive number of senses and words for eachExcessive number of senses and words for each Increases ambiguity problemsIncreases ambiguity problems and still many words remain OOVsand still many words remain OOVs

Page 52: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

OOV ProcessingOOV Processing

Proper names are often spelled differently Proper names are often spelled differently between languagesbetween languages Transliteration variationTransliteration variation Brussels, Bryssel; Chernobyl, Tshernobyl; Brussels, Bryssel; Chernobyl, Tshernobyl;

Chechnya, Tsetsenia Chechnya, Tsetsenia Non-translated keys are often used as such Non-translated keys are often used as such

in CLIR queries -- simplistic approach, not in CLIR queries -- simplistic approach, not optimaloptimal

In some languages, proper names may In some languages, proper names may inflectinflect In Finnish ”in Cairo” = KairoIn Finnish ”in Cairo” = Kairossassa

Page 53: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Approximate String Approximate String MatchingMatching

Means for Means for correcting spelling errorscorrecting spelling errors matching variant word forms (e.g. matching variant word forms (e.g.

proper names between languages)proper names between languages) MethodsMethods

edit-distance, n-grams, edit-distance, n-grams, skip-gramsskip-grams soundex, phonix, based of phone soundex, phonix, based of phone

similaritysimilarity transliterationtransliteration

Page 54: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Sample digram MatchingSample digram Matching

Sample words Bryssel, Brussels, Bruxelles Sample words Bryssel, Brussels, Bruxelles Bryssel --> Bryssel --> N1= {br, ry, ys, ss, se, el}N1= {br, ry, ys, ss, se, el} Brussels --> Brussels --> N2 = {br, ru, us, ss, se, el, ls}N2 = {br, ru, us, ss, se, el, ls} Bruxelles --> Bruxelles --> N3 = {br, ru, ux, xe, el, ll, le, N3 = {br, ru, ux, xe, el, ll, le,

es}es} sim(N1, N2) = | N1 sim(N1, N2) = | N1 N2| / | N1 N2| / | N1 N2| N2| = 4 / 9 = 0.444= 4 / 9 = 0.444 sim(N1, N3) = 1/6 = 0.167sim(N1, N3) = 1/6 = 0.167

Page 55: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Skip-gramsSkip-grams

Generalizing n-gram matchingGeneralizing n-gram matching The strings to be compared are split The strings to be compared are split

into substrings of length ninto substrings of length n Skipping characters is allowedSkipping characters is allowed Substrings produced using various skip Substrings produced using various skip

lengthslengths n-grams remain a special case: no skipsn-grams remain a special case: no skips ((Pirkola & Keskustalo & al 2002Pirkola & Keskustalo & al 2002))

Page 56: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

An ExampleAn Example katalyyttinen – catalytic katalyyttinen – catalytic

skip-0:skip-0: {_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_} {_k, ka, at, ta, al, ly, yy, yt, tt, ti, in, ne, en, n_} {_c, cs, at, ta, al, ly, yt, ti, ic, c_}{_c, cs, at, ta, al, ly, yt, ti, ic, c_}

skip-1:skip-1: {_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_} {_a, kt, aa, tl, ay, ly, yt, yt, ti, tn, ie, nn, e_} {_a, ct, aa, tl, ay, lt, yi, tc, i_}{_a, ct, aa, tl, ay, lt, yi, tc, i_}

skip-2:skip-2: {_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_}{_t, ka, al, ty, ay, lt, yt, yi, tn, te, in, n_} {_t, ca, al, ty, at, lc, yc, t_}{_t, ca, al, ty, at, lc, yc, t_}

calculate similarity over different skip-gram calculate similarity over different skip-gram setssets

Page 57: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Skip-gram effectivenessSkip-gram effectiveness

Several Several nn-gram methods tested-gram methods tested nn-grams & skip-grams-grams & skip-grams with & without paddingwith & without padding

The relative improvement for some The relative improvement for some ss--grams vs. grams vs. nn-grams in X -> Finnish name -grams in X -> Finnish name matching: matching: 18.2% (Eng medical) 18.2% (Eng medical) 49.7% (Eng 49.7% (Eng

geograph) geograph) 20.7% (Ger geograph) 20.7% (Ger geograph) 17.1% (Swe 17.1% (Swe

geograph)geograph) Statistically significant, Statistically significant, pp = 0.01-0.001 = 0.01-0.001

Page 58: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR Findings: Järvelin & CLIR Findings: Järvelin & al 2008al 2008

Closely related languagesClosely related languages Norwegian and SwedishNorwegian and Swedish

Translation by string matching aloneTranslation by string matching alone no dictionary at all, no other vocabulary no dictionary at all, no other vocabulary

sourcesource encouraging resultsencouraging results

Page 59: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR Findings: Järvelin & CLIR Findings: Järvelin & al 2008al 2008

QuickTime™ and a decompressor

are needed to see this picture.

Page 60: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

CLIR Findings: Airio CLIR Findings: Airio 20082008

Language pairsLanguage pairs Raw Raw transltransl

Fi-sgram_9Fi-sgram_9

Sv-sgram_4Sv-sgram_4

Fi-sgram_12Fi-sgram_12

Sv-sgram_7Sv-sgram_7

Lemma-Lemma-tizedtized

Fin -> Fin -> EngEng

11.211.2 29.229.2 31.031.0 39.639.6

Eng -> Eng -> SweSwe

18.118.1 25.725.7 25.325.3 34.134.1

Swe -> Swe -> Fin-Fin-

11.711.7 26.126.1 26.726.7 37.637.6

Fin -> Fin -> SweSwe

14.314.3 26.726.7 22.622.6 35.235.2

*) target index not normalized

*) *) *)

Page 61: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Rule-based Rule-based TransliterationTransliteration

Between languages, the originally ‘same’ Between languages, the originally ‘same’ words have different spellings:words have different spellings: Proper names, technical termsProper names, technical terms

Transliteration rules are based on regular Transliteration rules are based on regular variations in the spelling of equivalent word variations in the spelling of equivalent word forms between languagesforms between languages

construction construction - konstruktio, - konstruktio, c -> kc -> k somatology somatology - somatologia, - somatologia, y -> iay -> ia universiti universiti - university, - university, i -> y i -> y

Transliteration rules mined in a bilingual Transliteration rules mined in a bilingual word listword list

Their frequency and reliability are recordedTheir frequency and reliability are recorded

Page 62: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Sample Rules German-Sample Rules German-to-Englishto-English

SourceSource TargetTarget LocationLocationConfidence Confidence

string string stringstring of the rule of the rule factorfactor

ekt ekt ect ect middlemiddle 89.289.2 m m ma ma endend    21.121.1 akt akt act act middlemiddle 86.786.7 ko ko coco beginningbeginning 80.780.7

Page 63: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

TRT Translation

OOVSourceWord

TranslationCandidates

TRT RuleBase

Skip-gram matching

Target Index

Identified Target

Word(s)

TRT RuleProduction

Precision

Recall

TRT Translation – 2 TRT Translation – 2 StepsSteps

1 2

Evaluation

Page 64: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

TRT – rule collectionsTRT – rule collections

Rule collectionRule collection # Rules# Rules # # RulesRules

CF>4.0%, CF>4.0%, Fr. >2Fr. >2

Spanish-EnglishSpanish-English 88008800 12951295 Spanish-GermanSpanish-German 54125412 984984 Spanish-FrenchSpanish-French 97249724 14301430 German-EnglishGerman-English 86098609 12191219 French-EnglishFrench-English 98739873 11701170

Page 65: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

TRT EffectivenessTRT Effectiveness

Finnish-to-English translation (HCF)Finnish-to-English translation (HCF)

Term typeTerm type Digrams TRT + digrams Digrams TRT + digrams % chg% chg

Bio terms Bio terms 61,461,4 72,0 72,0 +17,3+17,3

Place namesPlace names 30,030,0 35,935,9 +19,7+19,7 EconomicsEconomics 32,232,2 38,038,0 +18,0+18,0 TechnologyTechnology 31,631,6 53,753,7 +69,9+69,9 MiscellaneousMiscellaneous 33,833,8 40,640,6 +20,1+20,1

Page 66: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

A Problem with TRTA Problem with TRT

TRT gives many target word forms for a source TRT gives many target word forms for a source word (even tens of thousands) but does not word (even tens of thousands) but does not indicate the correct oneindicate the correct one

For example, in Spanish-English translation For example, in Spanish-English translation TRT gives the following forms of for a Spanish TRT gives the following forms of for a Spanish word biosintesis:word biosintesis: biosintesis, biosintessis, biosinthesis, biosinthessis, biosintesis, biosintessis, biosinthesis, biosinthessis,

biosyntesis, biosyntessis, biosynthesis, biosynthessisbiosyntesis, biosyntessis, biosynthesis, biosynthessis To identify the correct equivalent, we use FITETo identify the correct equivalent, we use FITE

frequency-based identification of equivalentsfrequency-based identification of equivalents

Page 67: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Sample Sample Frequency Frequency PatternPattern

The frequency pattern associated with the The frequency pattern associated with the English target word candidatesEnglish target word candidates for the for the Spanish wordSpanish word biosintesis:biosintesis:

Target candidateTarget candidate Doc FreqDoc Freq biosynthesisbiosynthesis 2 230 0002 230 000 biosintesisbiosintesis 909 909 biosyntesis biosyntesis 634 634 biosinthesisbiosinthesis 255 255 biosynthessisbiosynthessis 3 3 biosintessisbiosintessis 0 0 ...... ... ...

Page 68: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

TRT Translation

OOVSourceWord

TranslationCandidates

TRT RuleBase

FITEIdentification

FrequencyStatistics

Identified TargetWord

TRT RuleProduction

Frequencypattern Ok

Relative Frequency Ok

LengthDifference Ok

NativeSourceWord

FITE-TRT TranslationFITE-TRT Translation

1 2

Precision

Recall

Evaluation

Page 69: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

FITE-TRT effectivenessFITE-TRT effectiveness

Spanish-English biological and Spanish-English biological and medical spelling variants (n = 89)medical spelling variants (n = 89)

Finnish-English biological and Finnish-English biological and medical spelling variants (n = 89)medical spelling variants (n = 89)

Translation toward English by TRTTranslation toward English by TRT English equivalent identification by English equivalent identification by

FITEFITE

Page 70: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

FITE-TRT EffectivenessFITE-TRT Effectiveness

Translation Recall and Precision for bio-Translation Recall and Precision for bio-terms terms

Source Source TranslationTranslation TranslatTranslat LanguageLanguage Recall % Recall % Prec % Prec % SpanishSpanish Web Web 91.0 91.0 98.8 98.8 Freq ListFreq List 82.0 82.0 98.8 98.8 Finnish Finnish Web Web 71.9 71.9 97.0 97.0 Freq List Freq List 67.4 67.4 97.3 97.3

Page 71: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

FITE-TRT effectivenessFITE-TRT effectiveness

UTACLIR UTACLIR TREC Genomics Track TREC Genomics Track 2004 topics 2004 topics Spanish-English actual OOV words (n = Spanish-English actual OOV words (n =

93+5)93+5) Finnish-English actual OOV words (n = Finnish-English actual OOV words (n =

48+5)48+5) Translation toward English by TRTTranslation toward English by TRT English equivalents identification by English equivalents identification by

FITEFITE

Page 72: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

FITE-TRT EffectivenessFITE-TRT Effectiveness

Translation Recall and Precision for actual Translation Recall and Precision for actual OOV words OOV words

Source Source TranslationTranslation TranslationTranslation

LanguageLanguage Recall % Recall % Precision % Precision %

SpanishSpanish Web Web 89.2 89.2 97.6 97.6

Freq ListFreq List 87.1 87.1 97.6 97.6

Finnish Finnish Web Web 72.9 72.9 97.2 97.2

Freq List Freq List 79.2 79.2 95.0 95.0

Page 73: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Genomics CLIR Genomics CLIR experimentexperiment

German - English CLIR in genomicsGerman - English CLIR in genomics TREC Genomics Track data, 50 TREC Genomics Track data, 50

topicstopics (20 training + 30 testing)(20 training + 30 testing) MedLine 4.6 M medical abstractsMedLine 4.6 M medical abstracts Baseline: raw German as such + Baseline: raw German as such +

Dict translDict transl Test queries - FITE-TRTTest queries - FITE-TRT

Page 74: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Genomics CLIR Genomics CLIR experimentexperiment

QuickTime™ and a decompressor

are needed to see this picture.

Page 75: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

AgendaAgenda

1. Introduction1. Introduction 2. Reductive Methods2. Reductive Methods 3. Compounds3. Compounds 4. Generative Methods4. Generative Methods 5. Query Structures5. Query Structures 6. OOV Words6. OOV Words 7. Conclusion7. Conclusion

Page 76: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

8. Conclusions8. Conclusions

Monolingual retrievalMonolingual retrieval morphological complexity morphological complexity

who owns the index?who owns the index? reductive and generative approachesreductive and generative approaches skewed distributions; surprisingly little may skewed distributions; surprisingly little may

be enoughbe enough compound handling perhaps not critical compound handling perhaps not critical

in monolingual retrievalin monolingual retrieval

Page 77: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

More ConclusionsMore Conclusions Cross-language retrievalCross-language retrieval

Query structuring simple and effectiveQuery structuring simple and effective Closely related languages / dialectsClosely related languages / dialects

simple language independent techniques maybe simple language independent techniques maybe enoughenough

Observe what happens between compounding vs Observe what happens between compounding vs isolating languagesisolating languages

compound splitting seems essentialcompound splitting seems essential Inflected form indicesInflected form indices

skip-grams or FCG after translation may be a solutionskip-grams or FCG after translation may be a solution Specific domains: Dom dictionaries or aligned Specific domains: Dom dictionaries or aligned

corpcorp

Page 78: Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere.

Thanks!Thanks!