Crowdsourcing Interaction Logs to Understand Text …...Crowdsourcing Interaction Logs to Understand...
Transcript of Crowdsourcing Interaction Logs to Understand Text …...Crowdsourcing Interaction Logs to Understand...
Crowdsourcing Interaction Logs to UnderstandText Reuse from the Web
Martin Potthast Matthias Hagen Michael Völske Benno Stein
Bauhaus-Universität Weimarwww.webis.de
Outline · Introduction
· The Webis-TRC-12 Dataset
· Categorizing Crowdsourced Text Reuse
· Search Missions For Source Retrieval
· Summary
IntroductionModeling Text Reuse From the Web
Search I‘m Feeling Lucky
Search Copy & Paste Modification
3 [∧] © Völske 2013
IntroductionPrevious Text Reuse Corpora
Source Selection InsertionAutomaticExcerpt
AutomaticModification
q PAN-PC-09/10/11: >25.000 documents; >60.000 plagiarism cases each
q Automatically generated artificial plagiarism
q Automatic modifications do not preserve semantics
4 [∧] © Völske 2013
The Webis-TRC-12 Dataset
5 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
6 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Topics
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
7 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Topics
Example topic:
Obama’s family.
Write about President Barack Obama’s family history, including genealogy, nationalorigins, places and dates of birth, etc. Where did Barack Obama’s parents andgrandparents come from? Also include a brief biography of Obama’s mother.
q Based on TREC Web Track topics 2009–2011 [details]
q 150 topics, 297 essays
q Target essay length: 5000 words
8 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Authors
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
9 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Authors
Num
ber
of E
ssay
s
Author ID
q Crowdsourcing: 27 total
q Professional writers hired on oDesk + volunteers
q Fluent English speakers
10 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Authors
Num
ber
of E
ssay
s
Author ID
Author Demographics (n=12)Age (Median) 37Years Writing (Median) 8Academic degreePostgrad 41%Undergrad 25%None 17%n/a 17%EnglishNative 67%Second Language 33%
q Crowdsourcing: 27 total
q Professional writers hired on oDesk + volunteers
q Fluent English speakers
11 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Sources
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
12 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Sources
q ClueWeb09: 500 million English pages
q Representative sample of the web
q Commonly used in search engine evaluation (TREC)
13 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Search Engine
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
14 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Search Engine
q Used for source retrieval [chatnoir.webis.de]
q Indexes ClueWeb
q Records fine-grained interaction log [example]
15 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Editor
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
16 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Editor
17 [∧] © Völske 2013
The Webis-TRC-12 DatasetConstruction Overview: Editor
q Custom web-based rich text editor
q Records sources of re-used text passages
q New revision every 300ms of inactivity
q Detailed revision history [example]
18 [∧] © Völske 2013
The Webis-TRC-12 DatasetThree Main Data Sources
ClueWeb API
Revision log ClueWebQuery log
Editor
Topic
ChatNoir SE
Author
19 [∧] © Völske 2013
The Webis-TRC-12 DatasetResearch Questions
20 [∧] © Völske 2013
The Webis-TRC-12 DatasetResearch Questions
1. Different text reuse approaches distinguishable?
2. Relation to existing text reuse categorizations?
3. Influence of text reuse task on search engine interaction?
We expect new research insights and impacts to
q text reuse detectionq query formulationq paraphrasing
21 [∧] © Völske 2013
Categorizing Crowdsourced Text Reuse
22 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse
Build-up reuse (left) versus boil-down reuse (right).
q text length (y-axis) over text revision (x-axis)q colors: different source documents (original text is white)q blue dots: position of the writer’s last editq Build-up: 45%; boil-down: 40%; mixed: 12%
23 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse
Build-up reuse (left) versus boil-down reuse (right).
q text length (y-axis) over text revision (x-axis)q colors: different source documents (original text is white)q blue dots: position of the writer’s last editq Build-up: 45%; boil-down: 40%; mixed: 12%
24 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse
Text
leng
th
Edit number
Text
leng
th
Edit number
Text
leng
th
Edit number
Author 6 (12 topics) Author 20 (9 topics) Author 21 (21 topics)
Build-up reuse: Averaged editing histories by authors.
q one author per plotq gray lines: individual essaysq black line: average
25 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseBuild-Up Versus Boil-Down Reuse
Text
leng
th
Edit number
Text
leng
th
Edit number
Text
leng
th
Edit number
Author 2 (66 topics) Author 7 (20 topics) Author 24 (27 topics)
Boil-down reuse: Averaged editing histories by authors.
q one author per plotq gray lines: individual essaysq black line: average
26 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
27 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
28 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
q Quantify: N-Gram similarity and ratio of passagesto sources[details]
q Measure for all essays
q Hypothesis: will show evidence of authors’ individ-ual text reuse styles
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
29 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Passages per source (interleaving)
Med
ian
5-gr
am p
assa
ge s
imila
rity
(par
aphr
asin
g)1
0
0.2
0.8
0.6
0.4
1 2 4 8
Author (topic count)
A14 (1)
A15 (3)
A16 (1)
A17 (33)
A18 (32)
A19 (7)
A20 (8)
A21 (21)
A22 (1)
A23 (1)
A24 (27)
A25 (1)
A26 (1)
A0 (2)
A1 (3)
A2 (66)
A3 (1)
A4 (1)
A5 (37)
A6 (11)
A7 (20)
A8 (1)
A9 (1)
A10 (12)
A11 (1)
A12 (1)
A13 (1)
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
30 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Passages per source (interleaving)
Med
ian
5-gr
am p
assa
ge s
imila
rity
(par
aphr
asin
g)
1
0
0.2
0.8
0.6
0.4
1 2 4 8
Author (essay count)
A14 (1)
A15 (3)
A16 (1)
A17 (33)
A18 (32)
A19 (7)
A20 (8)
A21 (21)
A22 (1)
A23 (1)
A24 (27)
A25 (1)
A26 (1)
A0 (2)
A1 (3)
A2 (66)
A3 (1)
A4 (1)
A5 (37)
A6 (11)
A7 (20)
A8 (1)
A9 (1)
A10 (12)
A11 (1)
A12 (1)
A13 (1)
31 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Passages per source (interleaving)
Med
ian
5-gr
am p
assa
ge s
imila
rity
(par
aphr
asin
g)
1
0
0.2
0.8
0.6
0.4
1 2 4 8
Author (essay count)
A14 (1)
A15 (3)
A16 (1)
A17 (33)
A18 (32)
A19 (7)
A20 (8)
A21 (21)
A22 (1)
A23 (1)
A24 (27)
A25 (1)
A26 (1)
A0 (2)
A1 (3)
A2 (66)
A3 (1)
A4 (1)
A5 (37)
A6 (11)
A7 (20)
A8 (1)
A9 (1)
A10 (12)
A11 (1)
A12 (1)
A13 (1)
32 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Passages per source (interleaving)
Med
ian
5-gr
am p
assa
ge s
imila
rity
(par
aphr
asin
g)
1
0
0.2
0.8
0.6
0.4
1 2 4 8
Author (essay count)
A14 (1)
A15 (3)
A16 (1)
A17 (33)
A18 (32)
A19 (7)
A20 (8)
A21 (21)
A22 (1)
A23 (1)
A24 (27)
A25 (1)
A26 (1)
A0 (2)
A1 (3)
A2 (66)
A3 (1)
A4 (1)
A5 (37)
A6 (11)
A7 (20)
A8 (1)
A9 (1)
A10 (12)
A11 (1)
A12 (1)
A13 (1)
33 [∧] © Völske 2013
Categorizing Crowdsourced Text ReuseClassification Scheme for Text Reuse
Passages per source (interleaving)
Med
ian
5-gr
am p
assa
ge s
imila
rity
(par
aphr
asin
g)
1
0
0.2
0.8
0.6
0.4
1 2 4 8
Author (essay count)
A14 (1)
A15 (3)
A16 (1)
A17 (33)
A18 (32)
A19 (7)
A20 (8)
A21 (21)
A22 (1)
A23 (1)
A24 (27)
A25 (1)
A26 (1)
A0 (2)
A1 (3)
A2 (66)
A3 (1)
A4 (1)
A5 (37)
A6 (11)
A7 (20)
A8 (1)
A9 (1)
A10 (12)
A11 (1)
A12 (1)
A13 (1)
34 [∧] © Völske 2013
Search Missions For Source Retrieval
35 [∧] © Völske 2013
Search Missions For Source RetrievalDistribution of Queries Over Time
1653320161582105870113231811928271962334710924840148153113154319
642630182083524341142844652605248669750138364234703457
1206167410162326910641362810898474610555088481989421848198
11276201471701395610632370607410451423011116944150274489215599
241588418140135461181851429133611723782466803368121626076
6210822424275691814720830241731616522636960864427464A
F
E
D
C
B
1 5 10 15 20 25
Distribution of queries over time.
q fraction of posed queries (y-axis) over elapsed time (x-axis)between the first query until essay completion
q each cell represents one of 150 essaysq the numbers denote the total amount of posed queriesq the cells are sorted by area under the curve
36 [∧] © Völske 2013
Search Missions For Source RetrievalDistribution of Queries Over Time
1653320161582105870113231811928271962334710924840148153113154319
642630182083524341142844652605248669750138364234703457
1206167410162326910641362810898474610555088481989421848198
11276201471701395610632370607410451423011116944150274489215599
241588418140135461181851429133611723782466803368121626076
6210822424275691814720830241731616522636960864427464A
F
E
D
C
B
1 5 10 15 20 25
Distribution of queries over time.
q fraction of posed queries (y-axis) over elapsed time (x-axis)between the first query until essay completion
q each cell represents one of 150 essaysq the numbers denote the total amount of posed queriesq the cells are sorted by area under the curve
37 [∧] © Völske 2013
Search Missions For Source RetrievalDistribution of Queries Over Time
64 112 165
Distribution of queries over time.
q fraction of posed queries (y-axis) over elapsed time (x-axis)between the first query until essay completion
q each cell represents one of 150 essaysq the numbers denote the total amount of posed queriesq the cells are sorted by area under the curve
38 [∧] © Völske 2013
Search Missions For Source RetrievalCorrelation of Editing and Querying
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Author 5 (18 topics) Author 2 (33 topics) Author 24 (13 topics)Author 20 (9 topics)
Correlation of editing and querying behavior.
averaged editing histories by authors [plots]
distribution of queries over time [plots]
39 [∧] © Völske 2013
Summary
1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse
2. Evidence of two fundamental editing strategies: build-up & boil-down
3. New classification scheme for documents in a text reuse corpus
4. Relationship between editing behavior and search engine use
Future Work
1. Interleaving and paraphrasing in the time dimension
2. Authors’ text reuse strategies across multiple documents
3. Paraphrasing study: track individual passages over time
Thank you for your attention!40 [∧] © Völske 2013
Summary
1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse
2. Evidence of two fundamental editing strategies: build-up & boil-down
3. New classification scheme for documents in a text reuse corpus
4. Relationship between editing behavior and search engine use
Future Work
1. Interleaving and paraphrasing in the time dimension
2. Authors’ text reuse strategies across multiple documents
3. Paraphrasing study: track individual passages over time
Thank you for your attention!41 [∧] © Völske 2013
Summary
1. Novel quality of the Webis-TRC-12 dataset of crowdsourced text reuse
2. Evidence of two fundamental editing strategies: build-up & boil-down
3. New classification scheme for documents in a text reuse corpus
4. Relationship between editing behavior and search engine use
Future Work
1. Interleaving and paraphrasing in the time dimension
2. Authors’ text reuse strategies across multiple documents
3. Paraphrasing study: track individual passages over time
Thank you for your attention!42 [∧] © Völske 2013
Example topic:
Obama’s family.
Write about President Barack Obama’s family history, including genealogy, nationalorigins, places and dates of birth, etc. Where did Barack Obama’s parents andgrandparents come from? Also include a brief biography of Obama’s mother.
Original topic 001 of the TREC Web Track 2009:
Query. obama family tree
Description. Find information on President Barack Obama’s family history, includinggenealogy, national origins, places and dates of birth, etc.
Sub-topic 1. Find the TIME magazine photo essay “Barack Obama’s Family Tree.”
Sub-topic 2. Where did Barack Obama’s parents and grandparents come from?
Sub-topic 3. Find biographical information on Barack Obama’s mother.
[<]
Type RankFrequency Severity
Clone 1 1Exact copy of another author’s work
Mashup 2 3A mix of material copied verbatim from several sources
Ctrl-C 3 2Significant portions of text copied from a single source
Remix 4 9Paraphrasing from several sources and making the content fit together seamlessly
Recycle 5 5Self-plagiarism
Re-Tweet 6 10Proper citation, but closely follows a single source
Find-Replace 7 7Near copy of a single source, with key phrases changed
Aggregator 8 4Proper citation, but (almost) no original work
404 Error 9 6Citations to non-existent or inaccurate information about sources
Hybrid 10 8Combining properly cited sources with plagiarism in one paper
[Turnitin 2012]
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
How to quantify?
q Measure at the passage levelq Passage: Block of text reused from the
same sourceq Paraphrasing: simple N-Gram similarity
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Three passages:
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Paraphrasing: N-Gram similarity
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Paraphrasing: N-Gram similarity
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Paraphrasing: N-Gram similarity
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Paraphrasing: N-Gram similarity
ϕn(Nc, Ns) :=|Nc ∩Ns||Nc|
q Nc: N-Grams in the passageq Ns: N-Grams in the source
We choose n = 5.
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]
Details: Paraphrasing & InterleavingClassification Scheme for Text Reuse
Interleavinglow high
low
high
Pa
rap
hra
sin
g
F ind-R eplac e R emix
C lone, C trl-C Mas hup
Interleaving: Passages per source
pps(C, S) :=|C||S|
q C: passagesq S: sources
Classification Scheme for Text Reuse.
q types of plagiarism as distinguished by Turnitin [Turnitin 2012]
q interpret text reuse (plagiarism) as a combination of two factors:paraphrasing and interleaving
[<]