DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F....
-
Upload
maximillian-wiggins -
Category
Documents
-
view
213 -
download
0
Transcript of DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F....
-
DCU meets MET: Bengali and Hindi Morpheme ExtractionDebasis Ganguly, Johannes Leveling, Gareth J.F. JonesCNGL, School of Computing, Dublin City University, Ireland
-
OutlineMotivationTask DescriptionBengali Stemming ApproachHindi Stemming ApproachResultsConclusions and Future Work
-
MotivationSome languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word formsExample: company, companies company; hopeful hopeFor information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documentsIndex base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms
-
Task DescriptionMorpheme Extraction Task:Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages)
Subtasks:Subtask 1: manual evaluation of morpheme extractionSubtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)
-
Stemming ApproachesLight vs aggressive stemmingRule-based vs. corpus-based stemmingmanually created vs. cluster of related wordsiteratively remove word suffixesproblem:overstemming, i.e. removed suffix is too longe.g. international/intern; news/newunderstemming, i.e. removed suffix is too shorte.g. forgetfulness/forgetfulirregular forms e.g. feet/foot; women/woman
-
Our Bengali Stemming ApproachRule-based stemmer created by native speakerFocus on nouns (most important for IR)Four categories [Bhattacharya et al. 2005]:Title markers added as suffixes to proper nounse.g. (Mrs.), (sir)Classifier for plurality and specificity/gender of a noune.g. (Pictures), (the Picture), (female student)Case marker for possessive or accusative relationse.g. (familys)Emphasizer to emphasize the current word e.g. (only a picture), (only this picture)
-
Bengali StemmerDrop emphasizers (iteratively)e.g. Drop classifiers and case markerse.g. , Drop title markerse.g. Drop plural suffixese.g. Drop derivational suffixese.g.
-
Our Hindi Stemming ApproachHindi has less complex inflectional morphologyfewer stemming rulesRule-based stemmerStemming rules manually created by native Hindi speaker
-
Hindi StemmerIteratively remove Hindi vowels, Matras, Anusvara, and (character ya) from the right of a string until first consonant is encounteredDrop derivational suffixes, e.g. (to boys) (boy) (to girls) (girl)
-
MET Experiments Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier
-
Results
TeamLanguageMAPBaselineBengali0.2740JUBengali0.3307(+20.69%)DCUBengali0.3300(+20.44%)IIT-KGPBengali0.3225(+17.70%)CVPR-TeamBengali0.3159(+15.29%)ISMBengali0.3103(+13.25%)BaselineHindi0.2821DCUHindi0.2963(+5.03%)ISMHindi0.2793(-0.99%)
-
Conclusions
Bengali stemmer:2nd best performance
Hindi stemmer:Best performance
Both have also been used successfully in previous ad-hoc IR experiments for FIRE
-
Future workExplore use of exclusion lists for irregular casesExtend rule set (i.e. handle verbs)Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoys web page on cross-language IRInvestigate morphology of named entities
-
Thank+s for your attentionAny question+s ?
***Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated **