Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak...
-
Upload
janice-williams -
Category
Documents
-
view
232 -
download
0
Transcript of Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak...
![Page 1: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/1.jpg)
Marathi – Marathi Monolingual Information Retrieval
Mr. Ashish AlmeidaProf. Pushpak Bhattacharyya
![Page 2: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/2.jpg)
Overview
• Morphological analyzer• Suffix processing• Stop-words• Future work
![Page 3: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/3.jpg)
Present work Search “भा�रत” – bhaarat – Bharat Will not match pages which has terms such as
भा�रत�चा� – bharataachaa - Of Bharat भा�रत�त – bharataat - In Bharat
Lack of large size corpus Unavailability of tools
![Page 4: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/4.jpg)
Corpus Statistics- Marathi
• 99,275 Documents (510 MB)– Maharashtra times– Sakal News
• April 2004 to September 2007 • UTF-8 encoding• XML tags
– DOC - document– DOCNO – document identifier– TEXT - article
![Page 5: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/5.jpg)
Document: example<DOC> <DOCNO>MaharashtraC06E811C6B.htm.txt</DOCNO> <TEXT> मोहफू ल वे�चाण्या�स गे�ल�ल्या� तरुणा�वेर बि��ट्या�चा� हल्ल� (attack of a leapord on a young man who has gone to collect flowers of Moha) इस्ल�पू र, त�. २२ - चा�रळी आणिणा मोहफू ल वे�चाण्या�स�ठी$ जं&गेल�त गे�ल�ल्या� एका� आदि*वे�स+ तरुणा�वेर
बि��ट्या�ने� अचा�नेका हल्ल� का� ल्या�ने� त तरुणा गे&भा+र जंखमो+ झा�ल� आह�. ह+ घटने� शु3क्रवे�र+ (त�. २०) मो3ळीझार� (त�. बिकानेवेट) या� गे�वे�च्या� जं&गेल�त घडल+. .......
इस्ल�पू र वेने पूरिरक्षे�त्र का�या�;लया�अ&तगे;त या�णा�ऱ्या� मो3ळीझार� या�थी+ल आदि*वे�स+ तरुणा मोनेहर . . .. . .</TEXT> </DOC>
![Page 6: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/6.jpg)
Topics
• 100 topics• Aligned with English topics• XML tags
– num : query identifier – title: title of the query– desc: description– narr: Additional information about the query
• Cover all issues –local, international
![Page 7: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/7.jpg)
Topic example
<top><num>1<title>ट>वे?ट - २० बिवेश्वचाषका�त+ल भा�रत�चा� क्रBड�पूट3त्वे
(India’s championship in tewnty-20 Worldcup)<desc> पूबिहल्या� आयास+स+ बिवेश्व ट>वे?ट - २० सवेDत्काE ष्ट-बिवेजं�त�- स्पर्धेIत+ल
भा�रत�च्या� बिवेजंया�चा� वेEत्त *�णा�र� ल�ख शुर्धे�.</desc><narr>ट>वे?ट - २० बिवेश्चचाषका स्पर्धेIमोर्धे+ल पू�बिकास्त�ने बिवेरूद्ध भा�रत�चा� बिवेजंया,
ह्या� ऐबितह�सिसका बिवेजंया� बिनेमिमोत्त ख�ळी�ड &ने+ का� ल�ल� बिवेक्रमो त्या�&ने+ मिमोळीबिवेल�ल+ �णिक्षेस� वे पू3रस्का�र�चा+ रक्कामो स�मोने�वे+र�चा� तस�चा मो�सिलका�वे+र�चा� ने�वे, मो�जं+
ख�ळी�ड &ने+ आणिणा जंगेभार�त+ल लका�&ने+ का� ल�ल+ प्रशु&स� या�स&*भा�;त आम्ह+ उसिचात मो�बिहत+ मिमोळीवेत आहत.
</top>
![Page 8: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/8.jpg)
Tools
• Terrier– Open source IR system– Models
• TF-IDF (Vector space model)• DFR-BM25 (Probabilistic)
– Both models available in Terrier
• Evaluation against relevance judged document for 25 queries
![Page 9: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/9.jpg)
Lemmatizer Vs stemmer
– भा�रत�ला� bhaarataalaa – for Bharat– भा�रत�चा� bhaarataachaa - of Bharat– भा�रत�त bhaarataat – in Bharat– भा�रत�वेर bhaarataavar – on Bharat
• Lemmatizer finds Lemma– भा�रत
• Stemmer finds stem: Longest unchangeable word prefix– भा�रत�
![Page 10: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/10.jpg)
Marathi suffixes
• Suffixes include case markers, postposition markers etc.
• Suffixes may get attached after another suffix• Example:
– घर�समोरचा�*�खिखल– घर�-समोर-चा�-*�खिखल– gharaa-samor-chaa-dekhil – house-front- of-also – Root word: घर (ghar) (house)
![Page 11: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/11.jpg)
Morphological analyzer
• Use of Marathi morphology analyzer– Better matching words
• र�मो versus र�मो�
• Gives all possible roots– Selects first root – most frequent
• Used at indexing and query processing end
![Page 12: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/12.jpg)
Lemmatizer Results
MAP
R-precision Precision at
5Precision at 10
Recall
TF-IDF without lemmatizer
0.3366 0.2944 0.3167 0.2583 0.8724
TF-IDF + lemmatizer 0.4003 0.3551 0.3417 0.2917 0.9686
DFR+ without lemmatizer
0.3455 0.3209 0.3500 0.2667 0.8744
DFR-BM25 + lemmatizer
0.4140 0.3686 0.3833 0.3083 0.9619
DFR-BM25 + lemmatizer
(Fire submission)
0.3625 0.3797 0.4600 0.3960 0.9178
![Page 13: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/13.jpg)
Suffixes
• Usually ignored• Indexing suffixes - not studied• Index selected suffixes
– Suffixes of space and time• वेर – var - on• समोर – samor - in front of• मोध्या� – madhye - in• ने&तर -nanter – after
• Created manually– 66 words list
![Page 14: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/14.jpg)
Stop-words
• Most frequently occurring words• Little discriminatory value• Occur in 80 % or more documents• Selected stop-words
– त+, त�, या�, ू ने, अस, आह, या�, ह, कार, त
![Page 15: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/15.jpg)
Results suffix indexing and stop-words
MAP R-precision Precision at
5
Precision at
10
Recall
DFR-BM25
+ lemmatization
+ suffix Indexing
0.4381 0.3846 0.3917 0.3167 0.97085
DFR-BM25
+ lemmatization
+ suffix Indexing
+ stop-words
0.4433 0.3798 0.4000 0.3208 0.9731
![Page 16: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/16.jpg)
P-R graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70 80 90 100
Recall %
Pre
cis
ion
lemmatization, indexing suffixes and stopwords
lemmatization and indexing suffixes
lemmatization
base-line
• Precision-recall graph for all four cases is show below
![Page 17: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/17.jpg)
Future work
• Morphological analyzer– Accuracy 94.5 %
• Needs to be improved
• Heuristic suffix stripping: unknown words• Handle derivational morphology• Spelling variations, common spelling mistakes
![Page 18: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/18.jpg)
Acknowledgement
• “Cross Lingual Information Access” Project• Maharashtra times: Times Media Group,
– http://in.indiatimes.com/aboutus.cms
• Sakal: Sakal Media Group– http://www.sakaal.in/
![Page 19: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/19.jpg)
References
• http://ir.dcs.gla.ac.uk/terrier/ • Ricardo Baeza Yates and Berthier Ribeiro
Neto, Modern Information Retrieval• Jacques Savoy, Searching strategies for the
Bulgarian language• Morphological Analyzer, CFILT
![Page 20: Marathi – Marathi Monolingual Information Retrieval Mr. Ashish Almeida Prof. Pushpak Bhattacharyya.](https://reader036.fdocuments.net/reader036/viewer/2022082400/56649e765503460f94b7736a/html5/thumbnails/20.jpg)
Thank you