The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper
-
Upload
julian-urbano -
Category
Technology
-
view
534 -
download
0
description
Transcript of The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper
![Page 1: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/1.jpg)
TREC 2011 Gaithersburg, USA · November 18th Picture by Michael Dornbierer
Practical and Effective Design of a Crowdsourcing Task for
Unconventional Relevance Judging Julián Urbano @julian_urbano
Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns University Carlos III of Madrid
![Page 2: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/2.jpg)
Task I
Crowdsourcing Individual Relevance Judgments
![Page 3: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/3.jpg)
In a Nutshell
• Amazon Mechanical Turk, External HITs
• All 5 documents per set in a sigle HIT = 435 HITs
• $0.20 per HIT = $0.04 per document
graded slider hterms
Hours to complete 8.5 38 20.5
HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%)
Submitted workers (just preview) 29 (102) 83 (383) 30 (163)
Average documents per worker 76 32 75
Total cost (including fees) $95.7 $95.7 $95.7
ran out of time
![Page 4: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/4.jpg)
Document Preprocessing
• Ensure smooth loading and safe rendering
– Null hyperlinks
– Embed all external resources
– Remove CSS unrelated to style or layout
– Remove unsafe HTML elements
– Remove irrelevant HTML attributes
4
![Page 5: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/5.jpg)
Display Mode
• With images
• Black & white, no images
5
hterms run
![Page 6: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/6.jpg)
Display Mode (and II)
• Previous experiment
– Workers seem to prefer images and colors
– But some definitelly go for just text
• Allow them both, but images by default
• Black and white best with highlighting
– 7 (24%) workers in graded
– 21 (25%) in slider
– 12 (40%) in hterms 6
![Page 7: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/7.jpg)
HIT Design
7
![Page 8: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/8.jpg)
Relevance Question
• graded: focus on binary labels
• Binary label
– Bad = 0, Good = 1
– Fair: different probabilities? Chose 1 too
• Ranking
– Order by relevance, then by failures in Quality Control and then by time spent
![Page 9: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/9.jpg)
Relevance Question (II)
• slider: focus on ranking
• Do not show handle at the beginning
– Bias
– Lazy indistinguishable from undecided
• Seemed unclear it was a slider
![Page 10: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/10.jpg)
Relevance Question (III)
10 slider value
Fre
quency
0 20 40 60 80 100
0
100
200
300
400
500
600
700
![Page 11: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/11.jpg)
Relevance Question (IV)
• Binary label
– Threshold
– Normalized between 0 and 100
– Worker-Normalized
– Set-Normalized
– Set-Normalized Threshold
– Cluster
• Ranking label
– Implicit
threshold = 0.4
![Page 12: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/12.jpg)
Relevance Question (and V)
• hterms: focus on ranking, seriously
• Still unclear?
slider value
Fre
qu
en
cy
0 20 40 60 80 100
0
20
0
40
0
60
0
slider value
Fre
qu
en
cy
0 20 40 60 80 100
0
20
0
40
0
60
0
![Page 13: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/13.jpg)
Quality Control
• Worker Level: demographic filters
• Task Level: additional info/questions
– Implicit: work time, behavioral patterns
– Explicit: additional verifiable questions
• Process Level: trap questions, training
• Aggregation Level: consensus from redundancy
![Page 14: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/14.jpg)
QC: Worker Level
• At least 100 total approved HITs
• At least 95% approved HITs
– 98% in hterm
• Work in 50 HITs at most
• Also tried
– Country
– Master Qualifications
![Page 15: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/15.jpg)
QC: Implicit Task Level
• Time spent in each document
– Images and Text modes together
• Don’t use time reported by Amazon
– Preview + Work time
• Time failure: less than 4.5 secs
15
![Page 16: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/16.jpg)
QC: Implicit Task Level (and II)
16
Time Spent (secs)
graded slider hterms
Min 3 3 3
1st Q 10 14 11
Median 15 23 19
![Page 17: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/17.jpg)
QC: Explicit Task Level
• There is previous work with Wikipedia
– Number of images
– Headings
– References
– Paragraphs
• With music / video
– Aproximate song duration
• Impractical with arbitrary Web documents
![Page 18: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/18.jpg)
QC: Explicit Task Level (II)
• Ideas
– Spot nonsensical but syntactically correct sentences
“the car bought a computer about eating the sea”
• Not easy to find the right spot to insert it
• Too annoying for clearly (non)relevant documents
– Report what paragraph made them decide
• Kinda useless without redundancy
• Might be several answers
• Reading comprehension test
![Page 19: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/19.jpg)
QC: Explicit Task Level (III)
• Previous experiment
– Give us 5-10 keywords to describe the document
• 4 AMT runs with different demographics
• 4 faculty members
– Nearly always gave the top 1-2 most frequent terms
• Stemming and removing stop words
• Offered two sets of 5 keywords, choose the one better describing the document
19
![Page 20: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/20.jpg)
QC: Explicit Task Level (and IV)
• Correct
– 3 most frequent + 2 in the next 5
• Incorrect
– 5 in the 25 least frequent
• Shuffle and random picks
• Keyword failure: chose the incorrect terms
20
![Page 21: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/21.jpg)
QC: Process Level
• Previous NIST judgments as trap questions?
• No
– Need previous judgments
– Not expected to be balanced
– Overhead cost
– More complex procress
– Do not tell anything about non-trap examples
21
![Page 22: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/22.jpg)
Reject Work and Block Workers
• Limit the number of failures in QC
22
Action Failure graded slider hterms
Reject HIT Keyword 1 0 1
Time 2 1 1
Block Worker Keyword 1 1 1
Time 2 1 1
Total HITs rejected 3 (1%) 100 (23%) 13 (3%)
Total Workers blocked 0 (0%) 40 (48%) 4 (13%)
![Page 23: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/23.jpg)
Workers by Country Preview Accept Reject % P % A % R Preview Accept Reject % P % A % R Preview Accept Reject % P % A % R
Australia 8 100% Bangladesh 15 3 2 75% 60% 40% Belgium 2 100% Canada 2 100% 1 100% 1 100% Croatia 4 1 80% 100% 0% Egypt 1 100% Finland 11 50 1 18% 98% 2% 4 100% 8 43 1 15% 98% 2% France 9 24 2 26% 92% 8% Germany 1 100% Guatemala 1 100% India 236 214 1 52% 100% 0% 543 235 63 65% 79% 21% 235 190 7 54% 96% 4% Indonesia 2 100% Jamaica 8 5 62% 100% 0% Japan 6 100% 2 100% Kenya 2 100% Lebanon 3 100% Lithuania 6 1 86% 100% 0% Macedonia 1 100% Moldova 1 100% Netherlands 1 100% Pakistan 1 100% 4 1 80% 100% 0% 1 100% Philippines 8 100% 3 100% Poland 1 100% Portugal 2 100% Romania 1 100% 1 100% 5 100% Saudi Arabia 3 1 75% 100% 0% 4 4 50% 100% 0% Slovenia 2 1 67% 100% 0% 3 1 75% 100% 0% 8 2 80% 100% 0% Spain 16 100% 15 100% 9 100% Switzerland 1 100% 7 100% United Arab Emirates 27 12 1 68% 92% 8% 35 9 80% 100% 0% United Kingdom 8 3 73% 100% 0% 18 18 2 47% 90% 10% 8 16 33% 100% 0% United States 246 166 1 60% 99% 1% 381 110 28 73% 80% 20% 242 174 5 57% 97% 3% Yugoslavia 17 14 2 52% 88% 13% 1 100%
Average 527 435 3 77% 99% 1% 1086 428 100 83% 90% 10% 579 435 13 85% 99% 1%
![Page 24: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/24.jpg)
Results (unofficial) Consensus Truth Acc Rec Prec Spec AP NDCG
Median .761 .752 .789 .700 .798 .831 graded .667 .702 .742 .651 .731 .785 slider .659 .678 .710 .632 .778 .819
hterms .725 .725 .781 .726 .818 .846
NIST Truth Acc Rec Prec Spec AP NDCG
Median .623 .729 .773 .536 .931 .922 graded .748 .802 .841 .632 .922 .958 slider .690 .720 .821 .607 .889 .935
hterms .731 .737 .857 .728 .894 .932
![Page 25: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/25.jpg)
Task II
Aggregating Multiple Relevance Judgments
![Page 26: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/26.jpg)
Good and Bad Workers
• Bad ones in politics might still be good in sports
• Topic categories to distinguish
– Type: Closed, limited, navigational, open-ended, etc.
– Subject: politics, people, shopping, etc.
– Rareness: topic keywords in Wordnet?
– Readability: Flesch test
![Page 27: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/27.jpg)
GetAnotherLabel
• Input
– Some known labels
– Worker responses
• Output
– Expected label of unknowns
– Expected quality for each worker
– Confusion matrix for each worker
27
![Page 28: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/28.jpg)
Step-Wise GetAnotherLabel
• For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj.
• For topics in tj, use only workers with qij>qi
• We didn’t use all known labels by good workers to compute their expected quality (and final label), but only labels in the topic category
• Rareness seemed to work slightly better
![Page 29: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/29.jpg)
Train Rule and SVM Models
• Relevant-to-nonrelevant ratio
– Unbiased majority voting
• For all workers , average correct-to-incorrect ratio when saying relevant/nonrelevant
• For all workers, average posterior probability of relevant/nonrelevant
– Based on the confusion matrix from GerAnotherLabel
![Page 30: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/30.jpg)
Results (unofficial) Consensus Truth Acc Rec Prec Spec AP NDCG
Median .811 .818 .830 .716 .806 .921 rule .854 .818 .943 .915 .791 .904 svm .847 .798 .953 .931 .855 .958
wordnet .630 .663 .730 .574 .698 .823
NIST Truth Acc Rec Prec Spec AP NDCG
Median .640 .754 .625 .560 .111 .359 rule .699 .754 .679 .644 .166 .415 svm .714 .750 .700 .678 .082 .331
wordnet .571 .659 .560 .484 .060 .299
![Page 31: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/31.jpg)
Sum Up
![Page 32: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Notebook Paper](https://reader033.fdocuments.net/reader033/viewer/2022060108/554de37db4c905c70e8b56a0/html5/thumbnails/32.jpg)
• Really work the task design
– “Make it simple, but not simpler” (A. Einstein)
– Make sure they understand it before scaling up
• Find good QC methods at the explicit task level for arbitrary Web pages
– Was our question too obvious?
• Pretty decent judgments compared to NIST’s
• Look at the whole picture: system rankings
• Study long-term reliability of Crowdsourcing
– You can’t prove God doesn’t exist
– You can’t prove Crowdsourcing works