ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal...

30
ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian School of Mines Dhanbad, India

Transcript of ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal...

Page 1: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task

Avinash YadavRobins YadavSukomal Pal

Department of Computer Science & EngineeringIndian School of Mines Dhanbad, India

Page 2: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Contents IntroductionAdhoc retrieval task participationMorpheme Extraction Task

participationConclusion

Page 3: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Introduction

StemmerISMstemmerEvaluation

Page 4: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

StemmerAttempts to reduce word variants to its stem

or root form

Example – education, educating, educative

will all reduce to educat

Approaches for StemmingLanguage based approachStatistical approach

Page 5: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ISMstemmerstatistical stemmerbased on suffix extractionsuffix frequencyalgorithm

Page 6: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Data PreprocessingConvert the corpus into single file

File 1

File 2

File n

Single File

Cleaning of data

John asked a girl with an

apple of Kashmir, “ do you

have the time”. She said,

“yes”.

John asked a girl with an apple of Kashmir do you have the time she said yes

Removing Stop Words

John asked a girl with an apple of Kashmir do you have the time she said yes

John asked girl with apple Kashmir you time she said yes

John asked girl with apple Kashmir you time she said yes

Johnaskedgirlwith appleKashmiryoutimeshesaidyes

Convert file into Single

Column

Page 7: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Data preprocessing (contd….)unique words extractedHindi- 4,90,391English-7,95,144

Page 8: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Find valid suffixesReverse the

words of single column file

aborning absolution absorption abuilding acquisition activation added addition admiration admitted admitting agreed agreeing allotted allotting ambling angling

gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna

Sort the reversed

list

gninroba noitulosba noitprosba gnidliuba noitisiuqca noitavitca dedda noitidda noitarimda dettimda gnittimda deerga gnieerga dettolla gnittolla gnilbma gnilgna

dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

Find suffix according

to threshold

dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

degniniot

gni

17%

40%

Page 9: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Threshold usedEnglish: 0.01 - 0.1%

Hindi: 0.1 – 1.0%

Page 10: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Stemming of corpusStem the

reversed words with reversed valid suffixes

dedda deerga dettimda dettolla gnidliuba gnieerga gnilbma gnilgna gninroba gnittimda gnittolla noitarimda noitavitca noitidda noitisiuqca noitprosba noitulosba

dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba

Reverse stemmed words

to get the original words

dda erga ttimda ttolla dliuba eerga lbma lgna nroba ttimda ttolla arimda avitca idda isiuqca prosba ulosba

addagreadmittallottabuildagreeamblanglabornadmittallottadmiraactivaaddiacquisiabsorpabsolu

Page 11: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Note:

If the length of a word after stemming is less than ’3’ alphabets, then that word will not be stemmed

agingking

agk

Page 12: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Evaluation of ISMstemmerFor evaluation of ISMstemmer we have

participated in:

1. Monolingual Adhoc retrieval task in English and Hindi Languages

2. Morpheme Extraction Task (MET) of FIRE-2012

Page 13: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Adhoc Retrieval Task(ART) ParticipationMonolingual taskLanguages chosen:

EnglishApproachResults

HindiApproachResults

Page 14: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ART: English

Approach:Indexing:

Search Engine used: Indri(IndriBuildIndex)

Retrieval:Search engine used: Lemur (RetEval)

Data Provided:Corpus from The Telegraph and BD News50 query set

Page 15: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ART: English (contd….)Results:

Run id No. of queries

No. of results

No. of relevant docs.

No. of rel. docs ret.

MAP value

EE.ism.unstemmed

50 50000 3539 2503 0.2264

EE.ism.krovetzstemmer

50 50000 3539 2504 0.2255

EE.ism.ismstemmer

50 50000 3539 2415 0.2096

Page 16: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ART: Hindi

Approach:Indexing:

Search Engine used: Indri (IndriBuildIndex)

Retrieval: Search Engine used: Indri

(IndriRunQuery)Data Provided:

Corpus from Navbharat Times and Amar Ujala

50 query set

Page 17: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

ART: Hindi (contd….)Results:

Run id No. of queries

No. of results

No. of relevant docs

No. of rel. docs ret.

MAP value

HH.ism.unstemmed.indri

50 50000 2309 222 0.0173

HH.stemmmedcorpus.unstemmedquery

50 50000 2309 98 0.0026

HH.stemmmedcorpus.stemmedquery

50 50000 2309 209 0.0137

Page 18: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Morpheme Extraction Task Participation

Tool submittedResults

Page 19: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

MET Tool Submission.ISMstemmer submittedevaluated at IR Labs: DAIICT,

Gujarattested on 6 languages of South

Asian originhas given efficient results with 3

languages

Page 20: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

MET Results:1. BENGALI

Institute Language MAP Obtained

Baseline Bengali 0.2740

JU Bengali 0.3307

DCU Bengali 0.3300

IIT-KGP Bengali 0.3225

CVPR-Team1 Bengali 0.3159

ISM Bengali 0.3103

  CVPR-Team2+  Bengali NA

Page 21: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

MET Results (contd….)2. GUJARATI

Institute Language MAP Obtained

Baseline Gujarati 0.2677

ISM Gujarati 0.2824

3. MARATHI

Institute Language MAP Obtained

Baseline Marathi 0.2320

ISM Marathi 0.2797

IIT-B Marathi 0.2684

Page 22: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

MET Results (contd….)4. ODIA

Institute Language MAP Obtained

Baseline Odia 0.1537

IIIT-Bh Odia 0.1537

ISM Odia 0.1537

5. HINDI

Institute Language MAP Obtained

Baseline Hindi 0.2821

DCU Hindi 0.2963

ISM Hindi 0.2793

Page 23: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

MET Results (contd….)6. TAMIL

Institute Language MAP Obtained

Baseline Tamil NA

AUCEG Tamil NA

ISM Tamil NA

NA : results are not available, due non-availability of qrels

Page 24: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Reasons for Underperformance with Hindi

overstemmingundesired stemming of proper

nouns

Page 25: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

OverstemmingThis refers to words that shouldn’t be grouped

together by stemming, but are.Example –

1. accent, accentual, accentuate

Stem word – accent

2. accept, acceptant, acceptor

Stem word – accept

3. access, accessible, accession

Stem word – access

due to overstemming it may be possible that these all group into wrong stem - acce

Page 26: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Undesired stemming of proper nouns

proper nouns should not be stemmed as they are not inflected

Example – Beijing

It will get stemmed to Beij

Page 27: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

Conclusion

ART: English: not satisfactory Hindi: poor

Reasons: overstemming undesired stemming of proper nouns

MET: performed efficiently with Bengali, Gujarati and

Marathi languages performed up to the mark with Odia underperformed with Hindi

Page 28: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

References

1. Banerjee R. and Pal S. 2011. ISM@FIRE-2011 Bengali Monolingual Task: A frequency based stemmer. Forum for Information Retrieval Evaluation 2011, ISI kolkata.

2. www.isical.ac.in/~fire/ (as on 06.12.2012)

3. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9.

4. http://en.wikipedia.org/wiki/Information_retrieval (as on 06.12.2012)

5.http://sourceforge.net/p/lemur/wiki/Indri%20query%20Language%20Reference/ (as on 06.12.2012)

6. www.lemurproject.org (as on 06.12.2012)

7. Paik, J. H., Mitra, M., Parui, S. K., and J¨ arvelin, K. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4, Article 19 (November 2011)

Page 29: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

References (contd…)

8. Paik, J. H. and Parui, S. K. 2011. A fast corpus-based stemmer. ACM Trans. Asian Lang. N form. Process. 10, 2, Article 8 (June 2011).

9. Paik J. H., Pal Dipasree, Parui S. K. A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics. SIGIR’11, July 24–28, 2011, Beijing, China.

10. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61–81.

11. http://en.wikipedia.org/wiki/Stemming (as on 06.12.2012)

12. How Effective Is Suffixing? Donna Harman. lister Hill Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20209

Page 30: ISM@FIRE 2012: Adhoc Retrieval Task & Morpheme Extraction Task Avinash Yadav Robins Yadav Sukomal Pal Department of Computer Science & Engineering Indian.

THANK YOU!!