Post on 14-Jan-2016
Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity
Measure to Identify Cases of Journalistic Text Re-use
Palkovskii Y., Belov A.Zhytomyr State University
Institute of Foreign Philology, In affiliation with SkyLine LLC
[Plagiarism Prevention Solutions]Zhytomyr, Ukraine
Who we are\what we doSmall, devoted group of students\professors in ZSU.Focused on Plagiarism Detection\Cross-Language PD.We develop a core text compare engine for a number of
commercial products, PD related, for SkyLine LLC:
We like to participate in competitions in Plagiarism Detections (especially in hot countries) and proud to have taken part in:PAN09 Spain, PAN10 Italy, PAN11 Amsterdam, PAN12
Italy, CL!TR11 India Mumbai, IIT
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Plagiarism Detector Accumulator Server [PDAS]
Plagiarism Detector Client [PDC]
CL!NSS proposed taskWhat we are looking for? -“Same news event”
within a pair of documentsPair-wise document comparisonReasonable processing timeResolution issues for focal news events are
not a requirement, at least at this pointFocus on the final result and a “starting
point” prototype
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
How does it work?Language normalization via Google TranslateText preprocessing that included most frequent
words removal (preliminary harvested from both corpuses and sorted by frequency)
Running comparison of each document against the test corpus, saving the data retrieved for further analysis
Each cached result for every pair undergoes estimation via predefined filter set getting scores.
Top 100 list is formed by ascending score value.
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Our evaluation methods
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
via Google Images
News set about “Curiosity” landing on Mars
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
via Google Images
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
via Google Images
Latest Bollywood newsfeeds
In detailInserting manually crafted news pairs into
the both corpora and evaluating final ranking positions
Different degree of news stories uniqueness – ranging from news about Curiosity Landing on Mars to the latest Bollywood films news (i.e. matching the context character and the exact vocabulary of the training set)
10 news planted, 9 out of ten fell into the “top 10” ranking, thus proving the initial hypothesis
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Detailed document comparisonPAN 2012 prototype – “iGTC” project, based
on an n-gram matching principle, with 3 levels of graphically based clusterization, already tuned in by a GA last year FIRE\PAN to both tackle medium-to-high degrees jf obfuscation as well as translated and simulated plagiarism
We did not use it. With main reason – retain the purely statistical approach based on TF-IDF values
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
CL!NSS Results Achieved Hindi\EnglishRank Run
NDCG@1
NDCG@5 NDCG@10
1 run-1-english-hindi-palkovskii 0.3229 0.3259 0.3380
2 run-2-english-hindi-deriupm 0.2100 0.2136 0.2613
3 run-1-english-hindi-deriupm 0.1900 0.2110 0.2168
4 run-1-english-hindi-iiith 0.1939 0.1994 0.2154
5 run-3-english-hindi-deriupm 0.1500 0.1886 0.2030
6 run-3-english-hindi-iiith 0.1837 0.1557 0.1722
7 run-2-english-hindi-iiith 0.0204 0.0462 0.0512In affiliation with Zhytomyr State Uni and
SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
CL!NSS Results Achieved Gujarati\English
Rank Run
NDCG@1
NDCG@5 NDCG@10
1 run-1-english-hindi-palkovskii 0.0541 0.0843 0.0955
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Ideas to consider:Different efficiency for different sources types and news
types\structure\origin [according to Parth Gutpa analysis of CL!NSS]
MT substitute for Gujarati
PAN2011\CLEF CLPD BaselineManual: 0.37 P-det R: .69 P: .26
G: 1Automatic: 0.92 P-det R: .97 P: .88
G: 1
Comparison problem:NDCG* metrics vs P-det (any ideas?)
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Hardware\RuntimeModerately computationally intensiveSingle Intel 6-core 990 ex6 GB Ram (RAM intensive usage)Single SSD driveTotal runtime of 12 hours for the test corpus
(excluding the PAN2012 comparer filter)
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Software used
Microsoft windows 7 []Microsoft Visual Studio
2010\C#
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
What we missedMeta-parameters tuning-in exhaustivenessNERA hybrid approach that uses a combination of PAN2012 text
comparer prototype as an additional scoring mechanism (runtime limitations and an idea to stick to a single methodology)
Post analysis of successful and failed detections Including results visualization in hope for further insights
Our competitive colleagues from Austria, Romania, Chile, etc.!
Layered Analysis of each influencing scoring factor [ref. to PAN 2011\2012 analysis]
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Things we’re happy to discussResults evaluationAchieved baseline in comparison to PAN
resultsThe corpus sizeAutomatic evaluation platform for result
processing and evaluationPerspectives of machine learningHybrid approachesBaseline comparison with other related tracks
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
References [1] Cristian Grozea and Marius Popescu. Encoplot—Performance in the Second International Plagiarism Detection Challenge: Lab Report for PAN at CLEF 2010. In Braschler et al. ISBN 978-88-904810-0-0 [2] Debora Weber-Wulff, "Plagiarism Detection Competition" copy-shake-paste.blogspot.com. 2009. 21 June.2011. [3] Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System: Lab Report for PAN at CLEF 2010. In Braschler et al. [2]. ISBN 978-88-904810-0-0. [4] Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors, SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1–9. CEUR-WS.org, September 2009. URL http://ceur-ws.org/Vol-502. [5] Thanh Dao. "An improvement on capturing similarity between strings" www.codeproject.com. 2005. 29 Jul. 2011. http://www.codeproject.com/KB/recipes/improvestringsimilarity.aspx [6] Troy Simpson, Thanh Dao. "WordNet-based semantic similarity measurement" www.codeproject.com. 2005. 1 Oct. 2011. http://www.codeproject.com/KB/string/semanticsimilaritywordnet.aspx [7] Yurii Palkovskii, Alexei Belov, and Irina Muzika. Exploring Fingerprinting as External Plagiarism Detection Method: Lab Report for PAN at CLEF 2010. In Braschler et al. [2]. ISBN 978-88-904810-0-0.
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Letters are powered by people:
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
Letters are powered by people:
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
I would like to thank those people – thank you for your assistance and help:
In affiliation with Zhytomyr State Uni and SkyLine LLC
© Palkovskii, Belov et al. 2012TF-IDF Weight Ranking Model as news similarity
measure
And an additional “thank you” for getting as far as Kolkata!
•Mandar Mitra•Parth Gupta•Anwar Shaikh