DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F....

14
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland

Transcript of DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F....

  • DCU meets MET: Bengali and Hindi Morpheme ExtractionDebasis Ganguly, Johannes Leveling, Gareth J.F. JonesCNGL, School of Computing, Dublin City University, Ireland

  • OutlineMotivationTask DescriptionBengali Stemming ApproachHindi Stemming ApproachResultsConclusions and Future Work

  • MotivationSome languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word formsExample: company, companies company; hopeful hopeFor information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documentsIndex base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms

  • Task DescriptionMorpheme Extraction Task:Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages)

    Subtasks:Subtask 1: manual evaluation of morpheme extractionSubtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)

  • Stemming ApproachesLight vs aggressive stemmingRule-based vs. corpus-based stemmingmanually created vs. cluster of related wordsiteratively remove word suffixesproblem:overstemming, i.e. removed suffix is too longe.g. international/intern; news/newunderstemming, i.e. removed suffix is too shorte.g. forgetfulness/forgetfulirregular forms e.g. feet/foot; women/woman

  • Our Bengali Stemming ApproachRule-based stemmer created by native speakerFocus on nouns (most important for IR)Four categories [Bhattacharya et al. 2005]:Title markers added as suffixes to proper nounse.g. (Mrs.), (sir)Classifier for plurality and specificity/gender of a noune.g. (Pictures), (the Picture), (female student)Case marker for possessive or accusative relationse.g. (familys)Emphasizer to emphasize the current word e.g. (only a picture), (only this picture)

  • Bengali StemmerDrop emphasizers (iteratively)e.g. Drop classifiers and case markerse.g. , Drop title markerse.g. Drop plural suffixese.g. Drop derivational suffixese.g.

  • Our Hindi Stemming ApproachHindi has less complex inflectional morphologyfewer stemming rulesRule-based stemmerStemming rules manually created by native Hindi speaker

  • Hindi StemmerIteratively remove Hindi vowels, Matras, Anusvara, and (character ya) from the right of a string until first consonant is encounteredDrop derivational suffixes, e.g. (to boys) (boy) (to girls) (girl)

  • MET Experiments Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier

  • Results

    TeamLanguageMAPBaselineBengali0.2740JUBengali0.3307(+20.69%)DCUBengali0.3300(+20.44%)IIT-KGPBengali0.3225(+17.70%)CVPR-TeamBengali0.3159(+15.29%)ISMBengali0.3103(+13.25%)BaselineHindi0.2821DCUHindi0.2963(+5.03%)ISMHindi0.2793(-0.99%)

  • Conclusions

    Bengali stemmer:2nd best performance

    Hindi stemmer:Best performance

    Both have also been used successfully in previous ad-hoc IR experiments for FIRE

  • Future workExplore use of exclusion lists for irregular casesExtend rule set (i.e. handle verbs)Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoys web page on cross-language IRInvestigate morphology of named entities

  • Thank+s for your attentionAny question+s ?

    ***Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated *Remove reformulated **