Pre-Retrieval based Strategies for Cross Language …clia/slides/clinss_aarti_fire13.pdf ·...
Transcript of Pre-Retrieval based Strategies for Cross Language …clia/slides/clinss_aarti_fire13.pdf ·...
Pre-Retrieval based Strategies for Cross Language News Story Search
Presented by:
Aarti Kumar & Sujoy Das
Research Scholar Associate Professor
Department of Computer Applications
MANIT, Bhopal
CLINSS 2013
To find Cross language same news event and same
focal event between English and Hindi pair of
language.
A set of 50691 potential source news stories S,
written in Hindi.
A set of 25 target news stories T, written in English.
Objective of Study To test Pre-retrieval strategies
To Compare
dictionary based and
machine translation based CLIR Approach
Approach Pre-retrieval strategies
Query formed using Proper Noun
Query formed using higher frequency words whose frequency is equal to or higher than average frequency
are used to retrieve the Hindi news stories.
Translation strategy: dictionary based or
machine translation based
Indexing and Retrieval Terrier 3.5 retrieval engine
Pre Retrieval
Strategies
Machine
Translation
Based System
Retrieval
Engine
English
Documents
Hindi
Documents
Retrieved
Hindi
Documents
Top 100
Formulated
Pre Retrieval Approach for CLINSS
Query
Dictionary
based CLIR
System
Proper Noun
Greater than
equal to
frequency average
Preprocessing
Experiment Preprocessing:
Query Formulation:
Pre-Retrieval Strategies and Dictionary Based
approach for MANIT-1-Run-1 and MANIT-1-Run-2
Pre-Retrieval Strategy and Machine Translation Based
approach for MANIT-1-Run-3
Indexing and retrieval:
Preprocessing For all the three runs all the words of <title>and
<content>were extracted from each of the English document.
Punctuation was removed at the time of tokenization and stop-
words, verbs and adverbs were removed from <content> part
only using a list of 430 stop-words [5] and 1514 verbs and
adverbs [6] which were compiled from the web.
Dates and numbers were also removed at time of
preprocessing from both <title> and <content> as it was
observed at the time of trial runs that query started
drifting if one considers them.
No other preprocessing was done on the <title> and all the
words were taken as it is at the time of query formulation.
Query Formulation Pre-Retrieval Strategies and Dictionary Based approach
for MANIT-1-Run-1 and MANIT-1-Run-2
MANIT-1-Run-1
In this run only Proper nouns are extracted from<content> of the English news story.
The grammar rule, that proper noun begins with a capital letter, has been used to identify Proper nouns instead of using part of speech tagger.
The idea behind choosing proper nouns for formulating queries to retrieve the source documents is that they are the ones that are never changed while translating text and more so in news stories as they are important entities in any news.
MANIT-1-Run-2
In this run only those words whose frequency is greater
than or equal to the average word frequency of the
<content>, has been selected at the time of query
formulation.
Taking words having greater than or equal to
average word frequency for forming query words is
considered in view of the fact that out of those words
which appear more than average number of times,
some of the words must be of importance in catching
the linked documents
Query Formulation continued… In both of these runs
Porter Stemmer[10] is used for stemming.
Dictionary based approach is used for translating query in Hindi.
The Shabdanjali dictionary[9] is used for translating English tokens to Hindi and only the first Hindi translation of each word is considered.
The words that didn’t have Hindi equivalent in Hindi Shabdanjali dictionary were transliterated using a transliterator developed by us.
The translated queries are submitted to Terrier retrieval engine [11] and top 100 documents are retrieved.
MANIT-1-Run-3 It is same as that of MANIT-1-Run-1 but machine translation
based approach is used for translating query words .
Freely available online Hindi Google Translate[7] is used
to translate/transliterate English query words to Hindi.
For those words which Google translate [7] failed to
transliterate online Changathi Hindi transliterator [8] was
used.
The process was carried out manually. This manual
intervention was with the purpose of getting the correct Hindi
words and then comparing the results thus obtained, with our
fully automated approaches used for MANIT-1-Run-1 and
MANIT-1-Run-2.
Problems with transliteration: few
examples
Interpretation of alphabet ‘a’ in Hindi
Kamal कमल
Mayawati मायावती Mulayam मलुायम
Akriti आकृतत(not transliterated by Google)
Akhilesh अखिलेश
Interpretation of bigram ‘an’ in Hindi
Anubha अनभुा(not transliterated by Google)
Anshu अंश ु
Anand आनंद
Kanak कनक
Janki जानकी Pranav प्रणव
‘Banka’ was transliterated as ‘बााँका’ but ‘BANKA’ was not transliterated by Google.
Our transliterator gave 1-8 combinations of such words
Indexing and retrieval
Indexing of Hindi documents and retrieval of linked
news stories in Hindi for each English document has
been done using Terrier 3.5[11] using TF-IDF ranking
model.
Result MANIT-1-Run-1 gives performance of 0.6, 0.545 and 0.5388
for NDCG@1, NDCG@5 and NDCG@10 respectively.
MANIT-1-Run-2 gives performance of 0.56, 0.4521 and
0.4828 for NDCG@1, NDCG@5 and NDCG@10
respectively.
MANIT-1-Run-3 gives performance of 0.5, 0.4803 and
0.4867 for NDCG@1, NDCG@5 and NDCG@10
respectively.
It is observed that proper noun based pre-retrieval strategy
clubbed with dictionary based CLIR approach has performed
fairly well. At NDCG@5 and NDCG@10 Google Translate
based approach performed next.
Comparative performance
Run NDCG@1 NDCG@5 NDCG@10
run-1-manit1 0.6 0.545 0.5388
run-2-manit1 0.56 0.4521 0.4828
run-3-manit1 0.5 0.4803 0.4867
Table 1.Comparative performance of the
three runs
Analysis 1
Out of 140 rel.
documents
Run-1
(Proper D)
Run-2
(GT)
Run-3
(Proper Google)
Found as 1st 16 15 14
Found among
top 5 38 31 44
Found among
top 10 48 46 63
Found among
top 100 95 75 120
Not found in the
top 100 45 65 20
Analysis 1 continued…
0
20
40
60
80
100
120
16
38
48
95
45
15
31
46
75
65
14
44
63
120
20
Run-1(Proper)
Run-2(GT)
Run-
3(ProperGoogle)
Analysis 2
Run-1
(Proper)
Run-2
(GT)
Run-3
(Proper Google)
As 1st document 5 4 5
in top 5 7 5 6
in top 10 8 7 6
Out of the 8 documents with score 2 i.e. documents with
"same news event + same focal event” the no. of documents
retrieved by the different query strategies are:
MANIT-1-Run-1 performed the best in this. This might be the
reason for the degradation in the NDCG performance of the
queries formed using Google Translate
Analysis 2 continued….
0
1
2
3
4
5
6
7
8
5 4
5
7
5 6
8 7
6
As 1st document
in top 5
in top 10
Analysis III English-Hindi Relevant Document Pair Linked Hindi Documents
english-document-00002.txt 0 hindi-document-00416.txt 1 For 2 and 23
english-document-00016.txt 0 hindi-document-48171.txt 1 For 16 and 9
english-document-00016.txt 0 hindi-document-29606.txt 1 For 16 and 23
english-document-00016.txt 0 hindi-document-32003.txt 1 For 16 and 13
english-document-00005.txt 0 hindi-document-00414.txt 1 For 5 and 11
english-document-00019.txt 0 hindi-document-10863.txt 1 For 19 and 21
english-document-00019.txt 0 hindi-document-19273.txt 1 For 19 and 25
english-document-00019.txt 0 hindi-document-19272.txt 1 For 19 and 25
english-document-00001.txt 0 hindi-document-16606.txt 1 For 1 and 4
english-document-00001.txt 0 hindi-document-39272.txt 1 For 1 and 4
english-document-00001.txt 0 hindi-document-17481.txt 1 For 1 and 2
english-document-00001.txt 0 hindi-document-08897.txt 1 For 1, 21 and 24
english-document-00001.txt 0 hindi-document-19255.txt 1 For 1 and 8
english-document-00001.txt 0 hindi-document-46293.txt 1 For 1 and 4
english-document-00001.txt 0 hindi-document-08773.txt 1 For 1 and 4
english-document-00017.txt 0 hindi-document-14001.txt 2 For 17, 9 and 12
english-document-00004.txt 0 hindi-document-20282.txt 2 For 4, 10 and 21
english-document-00004.txt 0 hindi-document-37101.txt 1 For 4 and 21
Conclusion
It is observed that dictionary based approach clubbed up
with proper noun based pre-retrieval strategy performed
better than other two runs in all the three cases.
MANIT-1-Run-3 which aimed at getting the right
translation and transliteration for given query words, did
not show a good performance at NDCG@1 level.
In this study some of the pre-retrieval strategies to
retrieve a subset of source Hindi documents from large
corpus has been studied. The post processing techniques
to link the exact news stories shall be studied in future.
Acknowledgement
We are thankful to Terrier group for providing us Terrier
Retrieval Engine to carry out our research work.
One of the presenters, Aarti Kumar, is thankful to Maulana
Azad National Institute of Technology, Bhopal for
providing her the financial support to pursue her Doctoral
work as a full time research scholar.
References
Paul D. Clough, Department of Computer Science University of
SheÆeld, England : Measuring Text Reuse in Journalistic
Domain
Parth Gupta, Paul Clough, Paolo Rosso, Mark Stevenson:
PAN@FIRE: Overview of the Cross-Language !ndian News Story
Search (CL!NSS) Track. In:Forum for Information Retrieval
Evaluation, ISI, Kolkata,India(2012)
YuriiPalkovskii, Alexei Belov: Using TF-IDF Weight Ranking
Model in CLINSS as Effective Similarity Measure to Identify
Cases of Journalistic Text Re-use In: Overview paper CLINSS
2012, Forum for Information Retrieval Evaluation, ISI,
Kolkata,India(2012)
NitishAggarwal, KartikAsooja, Paul Buitelaar, Tamara
Polajanar, Jorge Gracia: Cross-Lingual Linking of News
Stories using ESA. In:Overview paper CLINSS 2012, Forum for
Information Retrieval Evaluation, ISI, Kolkata,India(2012)
List of Stopwords Available on
http://www.ranks.nl/resources/stopwords.html,http://norm.al/2
009/04/14/list-of-english-stop-
words/,http://www.webconfs.com/stop-
words.php,http://jmlr.org/papers/volume5/lewis04a/a11-smart-
stop-list/english.stop
References continued…
List of Verbs and Adverbs Available on
http://www.englishclub.com/vocabulary/regular-verbs-
list.htm,http://www.momswhothink.com/reading/list-of-
verbs.html,http://www.linguanaut.com/verbs.htm,http://www.acme2k.co.uk
/acme/3star%20verbs.htm,http://www.enchantedlearning.com/wordlist/verb
s.shtml, http://www.enchantedlearning.com/wordlist/adverbs.shtml
http://translate.google.com/?prev=hp&hl=en&text=&sl=en&tl=hi#en/hi/-
ChangathiTransliterator Available on
http://hindi.changathi.com/
Shabdanjali available on
http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict_Frame.html
Porter stemmer available on
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/porter.java
Terrier 3.5 available on http://terrier.org/download/