A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse...
Transcript of A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse...
05.07.2018Pipeline for TR extractionMilad Alshomary
A Pipeline for Scalable Text Reuse Analysis
Milad Alshomary
05.07.2018
Bauhaus Universität
1
05.07.2018Pipeline for TR extractionMilad Alshomary
Overview
2
● Motivation
● A Pipeline for Scalable Text Reuse Extraction
● Application on Wikipedia
● Application on Wikipedia and Common Crawl
● Conclusion
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse (TR)Motivation
3
● Quoting
● Verbatim
● Paraphrasing
● Translation
● Summarization
05.07.2018Pipeline for TR extractionMilad Alshomary
TR Detection ApplicationsMotivation
4
METER project(Measuring Text Reuse)
Plagiarism detection
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
Plagiarism detection
5
METER projet(Measuring Text Reuse)
TR Detection Applications
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
6
Plagiarism detectionMETER projet
(Measuring Text Reuse)
TR Detection Applications
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The WorldMotivation
● Digital Encyclopedia
● Collaborative environment
● Giant public source of
information
● Free to use
7
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The WorldMotivation
● Digital Encyclopedia
● Collaborative environment
● Giant public source of
information
● Free to use
8
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The WorldMotivation
● Digital Encyclopedia
● Collaborative environment
● Giant public source of
information
● Free to use
9
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The WorldMotivation
● Digital Encyclopedia
● Collaborative environment
● Giant public source of
information
● Free to use
10
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
Wikipedia vs The World
Quality Flaws
11
Scientific community
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
Wikipedia vs The World
- Web pages = Wikipedia text + advertisements
12
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
Research Questions
➔ What kinds of text reuse occur within Wikipedia?
➔ How much of the web is a copy of Wikipedia content?
➔ How much revenue does this content generate?
13
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
14
➔ What kinds of text reuse occur within Wikipedia?
➔ How much of the web is a copy of Wikipedia content?
➔ How much revenue does this content generate?
Research Questions
05.07.2018Pipeline for TR extractionMilad Alshomary
Motivation
15
➔ What kinds of text reuse occur within Wikipedia?
➔ How much of the web is a copy of Wikipedia content?
➔ How much revenue does this content generate?
Research Questions
05.07.2018Pipeline for TR extractionMilad Alshomary
A Pipeline for Scalable Text Reuse Extraction
16
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse Pipeline
TR PipelineD1
D2
➔ Input: Two datasets
➔ Output: Text reuse
cases
17
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse Pipeline
TR PipelineD1
D2
➔ Input: Two datasets
➔ Output: Text reuse
cases
18
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse Pipeline
Text Preprocessing
Candidate Elimination
Text Alignment
19
➔ Content extraction
➔ Chunking
➔ Feature extraction
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse Pipeline
Text Preprocessing
Candidate Elimination
Text Alignment
20
➔ Content extraction
➔ Chunking
➔ Feature extraction
➔ Pairwise scan
➔ Text Reuse heuristics
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse Pipeline
Text Preprocessing
Candidate Elimination
Text Alignment
➔ Content extraction
➔ Chunking
➔ Feature extraction
➔ Pairwise scan
➔ Text Reuse heuristics
➔ Detailed scan of text
reuse
➔ Picapica framework
21
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Text Preprocessing
Candidate Elimination
Text Alignment
Keys for scaling-up:
➔ Cluster computing
➔ Heuristics based candidate elimination
algorithms
22
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Text Preprocessing
Candidate Elimination
Text Alignment
Keys for scaling-up:
➔ Cluster computing
➔ Heuristics based candidate elimination
algorithms
23
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
For a candidacy function we proposed
the following methods:
- Cosine similarity of TF-IDF (semantic)
- Paragraph embedding (semantic)
- Stopwords N-grams (structure)
- Weighted average of Stopwords Ngrams and
Paragraph embedding (semantic + structure)
24
d2n
D1 D2
candidacy(d11, d21) → [0, 1]
d11 d21
d22d12
d1n
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Wikipedia Document Sample
Text alignment using picapica framework
TR sample
Sample 1k documents
Generate TR Sample from Wikipedia:
- Sample 1k documents from
Wikipedia
- Using Picapica framework to find
TR cases
25
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Wikipedia Document Sample
Text alignment using picapica framework
TR sample
Sample 1k documents
Generate TR Sample from Wikipedia:
- Sample 1k documents from
Wikipedia
- Using Picapica framework to find
TR cases
26
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Wikipedia Document Sample
Text alignment using picapica framework
TR sample
Sample 1k documents
Generate TR Sample from Wikipedia:
- Sample 1k documents from
Wikipedia
- Using Picapica framework to find
TR cases
27
- 232 documents
- ~ 90% have < 10 alignements (TR case)
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
TR sampleEvaluation of “candidacy” function:
- For each document in TR sample:
- Sort all Wikipedia articles
according to the proposed
“candidacy” .
- Precision/Recall on
Thresholds of [1, 101,..,100k]
- A True Positive (TP) is a pair of
documents that have TR.
28
T1 T2 T3
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
TR sampleEvaluation of “candidacy” function:
29
T1 T2 T3
r1r2
p1
p2
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Semantic hashing function:
- Hashes documents into binary
hashes.
- Similar documents get similar or
exact binary hash.
30
011001 011001
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Semantic hashing function:
- Hashing all documents.
- Inverted index.
- Hash document’s chunks.
- Apply candidacy function only on
documents that intersect in one
hash at least.001001
011001
001000
Inverted index
011001
011001
D1D2
31
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
001001
011001
001000
Inverted index
011001
011001
D1D2
32
Semantic hashing function:
- Hashing all documents.
- Inverted index.
- Hash document’s chunks.
- Apply candidacy function only on
documents that intersect in one
hash at least.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
001001
011001
001000
Inverted index
011001
011001
D1D2
33
Semantic hashing function:
- Hashing all documents.
- Inverted index.
- Hash document’s chunks.
- Apply candidacy function only on
documents that intersect in one
hash at least.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
001001
011001
001000
Inverted index
011001
011001
D1D2
34
Semantic hashing function:
- Hashing all documents.
- Inverted index.
- Hash document’s chunks.
- Apply candidacy function only on
documents that intersect in one
hash at least.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Proposed semantic hashing methods:
- Random Projection (data
independent)
- Variational Deep Semantic
Hashing (data dependent)
35
di
dj
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Proposed semantic hashing methods:
- Random Projection (data
independent)
- Variational Deep Semantic
Hashing (data dependent)
36
di001
100
dj
001
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Proposed semantic hashing methods:
- Random Projection (data
independent)
- Variational Deep Semantic
Hashing (data dependent)
37
Learning
VDSH
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Transform
011001
38
Learning
VDSH VDSH
Proposed semantic hashing methods:
- Random Projection (data
independent)
- Variational Deep Semantic
Hashing (data dependent)
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
39
Hashing methods evaluation:
- Using same TR sample for
evaluation.
- Hashing all documents using the
proposed hashing function.
- Compute precision and recall.
TR sample
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
40
Hashing methods evaluation:
- Using same TR sample for
evaluation.
- Hashing all documents using the
proposed hashing function.
- Compute precision and recall.
TR sample
101
001
111
101 101 101
110000
100
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
41
Hashing methods evaluation:
- Using same TR sample for
evaluation.
- Hashing all documents using the
proposed hashing function.
- Compute precision and recall.
TR sample
101
001
111
101 101 101
110000
100
Precision = 2/3 Recall = 1.0
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
Random projection
bits precision recall
…. 8 3.1 x 10-4 0.8741
…. 16 9.9 x 10-4 0.324
VDSH bits precision recall
…. 8 2.8 x 10-4 0.88
…. 16 4.5 x 10-3 0.73
42
Hashing methods evaluation
- Using same TR sample for
evaluation.
- Hashing all documents using the
proposed hashing function.
- Compute precision and recall.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Candidate Elimination
VDSH bits precision recall
8 2.8 x 10-4 0.88
16 4.5 x 10-3 0.73
● Retains 73% of the recall● By experiment:
○ Reduces the computations needed by 3 order of magnitude
43
Hashing methods evaluation
- Using same TR sample for
evaluation.
- Hashing all documents using the
proposed hashing function.
- Compute precision and recall.
A Pipeline for Scalable Text Reuse Extraction
05.07.2018Pipeline for TR extractionMilad Alshomary
Application on Wikipedia
44
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse In WikipediaApplication on Wikipedia
➔ What kinds of text reuse occur within Wikipedia?
➔ How much of the web is a copy of Wikipedia content?
➔ How much revenue does this content generate?
45
05.07.2018Pipeline for TR extractionMilad Alshomary
Text Reuse In WikipediaApplication on Wikipedia
100 million text reuse
TR Pipeline
Wikipedia
Wikipedia
46
Wikipedia Articles
360k Wikipedia Article
05.07.2018Pipeline for TR extractionMilad Alshomary
What kinds of text reuse occur in
Wikipedia?
- Reasons behind text reuse:
(1) Two texts describe the same
topic.
(2) Two texts describe two
different topics, that share similar
characteristics
47
Text Reuse In WikipediaApplication on Wikipedia
05.07.2018Pipeline for TR extractionMilad Alshomary
What kinds of text reuse occur in
Wikipedia?
- Reasons behind text reuse:
(1) Two texts describe the same
topic.
Text Reuse
Structure Text Reuse
Content Text Reuse
48
Text Reuse In WikipediaApplication on Wikipedia
05.07.2018Pipeline for TR extractionMilad Alshomary
What kinds of text reuse occur in
Wikipedia?
- Reasons behind text reuse:
- Tow texts describing same
topic.
Text Reuse
Structure Text Reuse
Content Text Reuse
49
Text Reuse In WikipediaApplication on Wikipedia
05.07.2018Pipeline for TR extractionMilad Alshomary
What kinds of text reuse occur in
Wikipedia?
- Reasons behind text reuse:
(2) Two texts describe two
different topics, that share similar
characteristics
Text Reuse
Structure Text Reuse
Content Text Reuse
50
Text Reuse In WikipediaApplication on Wikipedia
05.07.2018Pipeline for TR extractionMilad Alshomary
What kinds of text reuse occur in
Wikipedia?
- Reasons behind text reuse:
(2) Two texts describe two
different topics, that share similar
characteristics
Text Reuse
Structure Text Reuse
Content Text Reuse
51
Text Reuse In WikipediaApplication on Wikipedia
05.07.2018Pipeline for TR extractionMilad Alshomary 52
Text Reuse In WikipediaApplication on Wikipedia
- Vertical alignment → Content TR
- Horizontal alignment → Structure TR
Vertical relation Horizontal relation
05.07.2018Pipeline for TR extractionMilad Alshomary 53
Text Reuse In WikipediaApplication on Wikipedia
- Vertical alignment → Content TR
- Horizontal alignment → Structure TR
05.07.2018Pipeline for TR extractionMilad Alshomary 54
Text Reuse In WikipediaApplication on Wikipedia
- Vertical alignment → Content TR
- Horizontal alignment → Structure TR
05.07.2018Pipeline for TR extractionMilad Alshomary
Application on Wikipedia and Common Crawl
55
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The WebApplication on Wikipedia and Common Crawl
56
➔ What kinds of text reuse occur within Wikipedia?
➔ How much of the web is a copy of Wikipedia content?
➔ How much revenue does this content generate?
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
WWW
- Crawling
Extracted web content
- Content extraction
- Keeping only english
pages
Web Sample
10% random
sample
57
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
WWW
- Crawling
Extracted web content
- Content extraction
- Keeping only english
pages
Web Sample
10% random
sample
- 59 million web pages.
- 1.4 million websites.
- 70% of these websites
contains less than 10 web
pages
Number of web pages
Num
ber
of w
ebsi
tes
58
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
TR Pipeline
Web Sample
Wikipedia
- 1.6 million text reuse cases.
- 15k pages reuse Wikipedia text.
- 4.8k websites reuse Wikipedia text.
59
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Monthly revenue estimation:
- Rough estimate of Ads revenue
- Based on CPM (Cost Per Millie)
- Sampled 100 webpages and
manually checked the existence of
Advertisements.
60
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web pagewebsite Monthly
revenue Percentage of reuse Monthly Wikipedia value
pdxretro.com $195 0.012 $2.5
seqrchquarry.com $8,850 0.096 $850
asiatees.com $36,000 0.017 $613
…. ….. ….. ….
Total $1.2 million
61
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web pagewebsite Monthly
revenue Percentage of reuse Monthly Wikipedia value
pdxretro.com $195 0.012 $2.5
seqrchquarry.com $8,850 0.096 $850
asiatees.com $36,000 0.017 $613
…. ….. ….. ….
Total $1.2 million
62
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web pagewebsite Monthly
revenue Percentage of reuse Monthly Wikipedia value
pdxretro.com $195 0.012 $2.5
seqrchquarry.com $8,850 0.096 $850
asiatees.com $36,000 0.017 $613
…. ….. ….. ….
Total $1.2 million
63
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web pagewebsite Monthly
revenue Percentage of reuse Monthly Wikipedia value
pdxretro.com $195 0.012 $2.5
seqrchquarry.com $8,850 0.096 $850
asiatees.com $36,000 0.017 $613
…. ….. ….. ….
Total $1.2 million
64
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web pagewebsite Monthly
revenue Percentage of reuse Monthly Wikipedia value
pdxretro.com $195 0.012 $2.5
seqrchquarry.com $8,850 0.096 $850
asiatees.com $36,000 0.017 $613
…. ….. ….. ….
Total $1.2 million
The rough estimate of monthly revenue of Wikipedia content
65
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web page
- Percentage of pages reusing Wikipedia >= 0.5
- 87 websites.
- Estimated monthly revenue: $15k
66
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Extracted from Wikipedia API
67
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web page Reused Wikipedia page
Average page views Average CPM Average monthly
revenue
Nuclear renaissance 645 $2.8 $1.806
Second Chechen War 34655 $2.8 $97
Enumerated powers 12858 $2.8 $36
…. ….. ….. ….
Total $900k
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Estimated from marketing reports
68
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web page Reused Wikipedia page
Average page views Average CPM Average monthly
revenue
Nuclear renaissance 645 $2.8 $1.806
Second Chechen War 34655 $2.8 $97
Enumerated powers 12858 $2.8 $36
…. ….. ….. ….
Total $900k
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
69
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web page Reused Wikipedia page
Average page views Average CPM Average monthly
revenue
Nuclear renaissance 645 $2.8 $1.806
Second Chechen War 34655 $2.8 $97
Enumerated powers 12858 $2.8 $36
…. ….. ….. ….
Total $900k
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Reused Wikipedia page
Average page views Average CPM Average monthly
revenue
Nuclear renaissance 645 $2.8 $1.806
Second Chechen War 34655 $2.8 $97
Enumerated powers 12858 $2.8 $36
…. ….. ….. ….
Total $900k
70
Revenue estimation:
- Per website (all websites)
- Per website (highly reusing)
- Per Wikipedia web page
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
71
Monthly revenue:
Application on Wikipedia and Common Crawl
Per Web sample Number of reusing web pages Revenue(per webpage)
59 million 15k $900k
590 million 150k $9 million
05.07.2018Pipeline for TR extractionMilad Alshomary
Wikipedia vs The Web
Per Web sample Number of reusing web pages Revenue(per webpage)
59 million 15k $900k
590 million 150k $9 million
72
Monthly revenue:
Application on Wikipedia and Common Crawl
05.07.2018Pipeline for TR extractionMilad Alshomary
Conclusion
73
05.07.2018Pipeline for TR extractionMilad Alshomary
Summaryconclusion
74
● Pipeline for TR Extraction
● Text Reuse in Wikipedia
● Text Reuse between
Wikipedia and the Web
Text Preprocessing
Candidate Elimination Text Alignment
Text Reuse
Structure Text Reuse
Content Text Reuse
Per website (all websites)
Per website (highly reuse)
Per Webpage
$1.2 million $15k $900k
05.07.2018Pipeline for TR extractionMilad Alshomary
Future Workconclusion
TR Pipeline
Wikipedia
?
75
● Using the pipeline to extract and analyze
TR between Wikipedia and the scientific
community.
● Experiments on the Text Alignment
subtask.
● Further analysis of the extracted Text
Reuse cases.
● More accurate estimation on the
monthly revenue generated by
Wikipedia content.
05.07.2018Pipeline for TR extractionMilad Alshomary
ConclusionFuture Work
TR Pipeline
Wikipedia
?● Using the pipeline to extract and analyze
TR between Wikipedia and the scientific
community.
● Experiments on the Text Alignment
subtask.
● Further analysis of the extracted Text
Reuse cases.
● More accurate estimation on the
monthly revenue generated by
Wikipedia content.
Text Preprocessing
Candidate Elimination Text Alignment
76
05.07.2018Pipeline for TR extractionMilad Alshomary
ConclusionFuture Work
TR Pipeline
Wikipedia
?
Text Preprocessing
Candidate Elimination Text Alignment
77
● Using the pipeline to extract and analyze
TR between Wikipedia and the scientific
community.
● Experiments on the Text Alignment
subtask.
● Further analysis of the extracted Text
Reuse cases.
● More accurate estimation on the
monthly revenue generated by
Wikipedia content.
05.07.2018Pipeline for TR extractionMilad Alshomary
ConclusionFuture Work
TR Pipeline
Wikipedia
?
Text Preprocessing
Candidate Elimination Text Alignment
78
● Using the pipeline to extract and analyze
TR between Wikipedia and the scientific
community.
● Experiments on the Text Alignment
subtask.
● Further analysis of the extracted Text
Reuse cases.
● More accurate estimation on the
monthly revenue generated by
Wikipedia content.
05.07.2018Pipeline for TR extractionMilad Alshomary
Backup Slides
79
05.07.2018Pipeline for TR extractionMilad Alshomary 80
- Candidate Elimination functions:
05.07.2018Pipeline for TR extractionMilad Alshomary 81
- Stopwords N-grams procedure:
Wiki paragraphs stopwords stopword
ngrams
Extract stop words generate n-grams filtered stopword ngrams
Top 50 frequent stopwords:
the, of, and, a, in, to,is, was, it, for, with, he,
be, on, i, that, by, at, you, 's, are, not,his,
this, from, but, had, which, she, they, or, an,
were, we, their, been, has, have, will, would,
her, there, can, all,as, if, who, what, said
filter n-grams
- Let C = {the, of, and, a, in, to, ’s} stopwords that increases false positive.
- X is accepted n-gram if:- It doesn’t contain more than n-1
stopwords from C- The maximal sequence of stopwords
belonging to C is less than n-2
binary count vector
- Binary count vector ignores the frequency in which a specific n-gram happened in a paragraph.
- We apply the scoring function on the binary count vector
05.07.2018Pipeline for TR extractionMilad Alshomary 82
- VDSH explained:
VDSH USAGE
05.07.2018Pipeline for TR extractionMilad Alshomary 83
- Performance of candidacy functions on different thresholds:
Documents from sample who have number of aligned docs <= 10
Documents from sample who have number of aligned docs > 10
Thresholds between (1 to 1000 and step of 5)
RECALLRECALL
Prec
isio
n
Prec
isio
n
05.07.2018Pipeline for TR extractionMilad Alshomary 84
- Performance of candidacy functions on different thresholds:
Documents from sample who have number of aligned docs <= 10
Documents from sample who have number of aligned docs > 10
Thresholds between (1 to 1000 and step of 5)
RECALLRECALL
Prec
isio
n
Prec
isio
n
05.07.2018Pipeline for TR extractionMilad Alshomary 85
- Candidate Elimination procedure over the cluster:
05.07.2018Pipeline for TR extractionMilad Alshomary 86
- Hash based Candidate Elimination procedure over the cluster:
05.07.2018Pipeline for TR extractionMilad Alshomary 87
- Hash based Candidate Elimination procedure over the cluster:
05.07.2018Pipeline for TR extractionMilad Alshomary 88
- Heuristics:
- H1: ne_sim ∈ (0.5, 1.0] AND 10grams_sim > 0.5 AND (s_percent_reused < 0.5 or
t_percent_reused < 0.5) => content reuse otherwise structure reuse
- 6700 content reuse cases only
- Validation on two random samples of size 100:
Structure reuse Content reuse
Sample1 100% 58%
Sample2 (Text1 or Text2 > 200)
100% 73%