Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011...

26
Approaches for Intrinsic and External Plagiarism Detection Notebook for PAN at CLEF 2011 Gabriel Oberreuter Gallardo [email protected] University of Chile - September 2011 Group members : Gastón L’Huillier, Sebastián Ríos and Juan D. Velásquez

Transcript of Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011...

Page 1: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Approaches for Intrinsic and External Plagiarism Detection

Notebook for PAN at CLEF 2011

Gabriel Oberreuter [email protected]

University of Chile - September 2011

Group members : Gastón L’Huillier, Sebastián Ríos and Juan D. Velásquez

Page 2: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Who are we?

Group of students, professionals and professors from University of Chile

We work on studying plagiarism in academia

www.docode.cl

Gabriel Oberreuter - 2011 2

Page 3: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Who we are

Gabriel Oberreuter - 2011 3

DOCODE EngineIntegration of document copy detection algorithms, collecting and indexing documents,queuing large scale copy detection petitions.

DOCODE Impact AnalysisSurveys and interviews in schools, social analysis of copy and paste phenomenon,evaluation of the impact of such tools in education.

Application Service DOCODE Web application for DOCODE, design of document copy detection reporting tools, software engineering, features and requirement analysis.

Docode Engine

Docode ASP

Docode Impact

Page 4: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Index

1. Where we were @2010

2. External Plagiarism Detection

3. Intrinsic Plagiarism Detection

4. Results @PAN2011

5. Conclusions

Gabriel Oberreuter - 2011 4

Page 5: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Where we were @2010

Gabriel Oberreuter - 2011 5

Page 6: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Where we were @2010

Focused on external

plagiarism detection.

No cross-lingual

consideration.

Based on word bi-grams and

word tri-grams.

Selecting samples for each

document in order to reduce

compute time.

Gabriel Oberreuter - 2011 6

Page 7: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Where we were @2010

Gabriel Oberreuter - 2011 7

External Comparison between 2 documents

Page 8: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Where we were @2010

Gabriel Oberreuter - 2011 8

External Comparison between 2 documents

Page 9: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Where we were @2010

Gabriel Oberreuter - 2011 9

Monolingual

Based on word bi-grams and word tri-grams.

No intrinsic detection

Acceptable precision,

detecting half of cases and

good granularity.

Comparison computed on two eight-core Servers, each with 6

GB of RAM.

Java Implementation.

Reducing Search Space: ~20 Hours.

Finding Plagiarized Passages: ~12 Hours.

Page 10: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

External Plagiarism

Detection

Gabriel Oberreuter - 2011 10

Page 11: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

External Plagiarism Detection

Gabriel Oberreuter - 2011 11

From 2010 experience, we decided to focus on:

Better precision

Better recall

Reduce processing time

External detector @2011

Uses word 4-grams, removing SW for search space reduction

Uses word 3-grams for exhaustive search

Page 12: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

External Plagiarism Detection

Gabriel Oberreuter - 2011 12

Some results on 2010 PAN Corpus (includes intrinsic and external plagiarism):

Dual core notebook with 4GB RAM.

Java Implementation.

Reducing Search Space :~2 Hours. ( 0.3% promising doc pairs)

Exhaustive Search :~1 Hour.

AlgorithmVersion

Overall Recall Precision Granularity

2010 0.61 0.48 0.85 1.001

2011 0.73 0.6 0.94 1.001

Page 13: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism

Detection

Gabriel Oberreuter - 2011 13

Page 14: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Gabriel Oberreuter - 2011 14

Given a document, determine whether

one of its paragraphs belong to the

average writing style

Page 15: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Following Stamatatos’ (2009) approach:

Divide the document in partitions

Compare each partition’s writing style characterization

against the whole document’s style

If a partition deviates from the mean value past some

threshold, flag it

Gabriel Oberreuter - 2011 15

Page 16: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

If some of the words used on the document are

author-specific, one can think that those words

could be concentrated on the paragraphs (or more

general, on the segments) that the mentioned

author wrote

Gabriel Oberreuter - 2011 16

On the characterization of writing style…

Page 17: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Fundamentals:

Divide document in partitions of equal length

Word Frequencies

No stopword removal

Only chars from a-z

Gabriel Oberreuter - 2011 17

Page 18: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Gabriel Oberreuter - 2011 18

Comparing the partitions against the whole document…

Page 19: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Gabriel Oberreuter - 2011 19

Example 1: Document written by single author

Page 20: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Gabriel Oberreuter - 2011 20

Example 2: Document written by multiple authors

Page 21: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Intrinsic Plagiarism Detection

Gabriel Oberreuter - 2011 21

Some results on 2009 PAN Intrinsic Corpus:

Dual core notebook with 4GB RAM.

Java Implementation.

Run under 10 minutes (~6.000 documents).

Overall Recall Precision Granularity

Stamatatos 0.25 0.46 0.23 1.38

Oberreuter 0.34 0.31 0.39 1.01

Page 22: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Results @PAN2011

Gabriel Oberreuter - 2011 22

Page 23: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Results @PAN2011

External Plagiarism Detection Performance

Ranked third after Grman&Ravas (0.56) and Grozea&Popescu (0.42)

Overall good precision, but low recall for obfuscated plagiarism and

simulated plagiarism

Gabriel Oberreuter - 2011 23

Page 24: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Results @PAN2011

Intrinsic Plagiarism Detection Performance

Ranked first with a good recall-precision balance

Overall score of 0.32, with better results with medium- and long-length

documents

Gabriel Oberreuter - 2011 24

Page 25: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Conclusions

Word tri-grams and word 4-grams can be used effectively as

tokens for external plagiarism detection

The effectiveness of the approach is strongly correlated to the

ability to detect those dense coincidence zones

When no sources are available, the use of words appear to be a

good starting point to model the writing style present in

documents

Best result in self-information task, but the scores are overall still

too low

Gabriel Oberreuter - 2011 25

Page 26: Approaches for Intrinsic and External Plagiarism Detection fileWho we are Gabriel Oberreuter - 2011 3 DOCODE Engine Integration of document copy detection algorithms, collecting and

Approaches for Intrinsic and External Plagiarism Detection

Notebook for PAN at CLEF 2011

Gabriel Oberreuter [email protected]

University of Chile - September 2011

Group members : Gastón L’Huillier, Sebastián Ríos and Juan D. Velásquez