Can I use crowdsourcing to process my data?
Maxine EskenaziLanguage Technologies Institute
Carnegie Mellon University
What is the problem? How to insure that crowdsourcing results
are reliable The solutions:
◦ Testing the equipment◦ Framing the task◦ Testing the workers◦ Training the workers◦ Assessing the work
In this talk
Crowdsourcing is a great resource!◦ You have large amounts of data to process◦ It’s faster and cheaper while maintaining high quality
But, you can make it say what you want◦ Example: Looking for sentences that include a well-
pronounced example of the word, “table”: “Do you agree that the word “table” was said in this
sentence?”vs
“Please annotate this sentence” You can get results that are meaningless But you can get great results if you are careful!
What is the problem?
Testing the equipment - for those who will listen to something (to annotate, for example)◦ Ask them to use a headset and then ask them to
click yes if they can hear something Relying on worker self-assessment is nice, but not very
reliable◦ Play something to them and ask them to write
down what they heard Compare what they wrote to what they heard (you had
already written this down) and give them feedback, if they still can’t hear, on how to connect the headset
How to insure that crowdsourcing results are reliable
Testing the equipment - for those who will record something
Ask them to speak into the microphone and then play it back to them and ask them if they heard something Relying on worker self-assessment has sometimes worked
in this case. Ask them to read something from the screen and then
use a speech recognizer to align what they said with what they read MIT has the WAMI toolkit for this, and there are others as
well Have some other worker listen to what they said and
annotate it, then compare that annotation to the text This may take too much time
How to insure that crowdsourcing results are reliable
Framing the task - Workers need to know what the task is and how to do it◦ Write a description of the task and instructions on
what to do Get others to read that description and follow your
instructions -sandbox Revise and try out again
◦ Give examples and counterexamples Give at least two to three of each
◦ Become a worker and try others’ tasks yourself!! You understand issues better when you put yourself in
their shoes
How to insure that crowdsourcing results are reliable
Framing the task VERY IMPORTANT
◦ Keep the cognitive load as low as possible! Break one complex task into several tasks
◦ Example – instead of “label the words you hear as well as the non-words, parts of words and pauses”, you would ask “label the words you hear”, then in a separate task “label the non-words, like
lipsmacks, you hear” in a separate task “label the parts of words, like
restarts, you hear” In a separate task “label where the pauses are”
How to insure that crowdsourcing results are reliable
Framing the task◦ Another example
Interspeech2013 – 25th anniversary Statistics on past 25 years – 18 categories
Total number of papers Total number of different authors 2 harder-to-define categories - Total number of cohorts
of authors 1500 attendees were quizzed Crowd had close to correct or right answer on the
first 16, nothing close on the last 2
How to insure that crowdsourcing results are reliable
Framing the task◦ Workers will choose the task they want to work on for
several reasons: How much they can make per hour
Calculate how much you should pay them so they make at least minimum wage (how much time it takes to complete one task)
How can you make the task go faster? Putting all of one task on one page without scrolling No scrolling saves their time Example, ten sentences to annotate plus the instructions
Let them minimize the instructions if they want Change font size and space between sentences to get it all
on the screen at the same time Eliminate any other unnecessary keystrokes
How to insure that crowdsourcing results are reliable
Framing the task◦ What it will be used for
You make your task more appealing when you tell people why you want them to do this task
Example from our work: We are asking you to simplify some sentences. They are
taken from everyday documents like driver license applications. This is so that we can automatically simplify everyday documents
◦ How nice it looks Subliminal detail that has been shown to be effective
How to insure that crowdsourcing results are reliable
Testing the workers – why?◦ Do not assume they are native speakers of X –
test them! Just because you have geolocation, that does not
mean the person fluently speaks the language of that country
◦ Do not assume that all speakers of Y can write down what they hear – test them!
◦ Not everyone is honest and there are bots
How to insure that crowdsourcing results are reliable
Testing the workers – How?◦ To test for speakers of X, you could ask them to
translate (type in) something from English into the target language Make sure that there is some word or expression that
Google Translate or other would get wrong You have already translated this sentence by hand Compare the two texts
How to insure that crowdsourcing results are reliable
Testing the workers – How?◦ Give a new worker three items to do
Say you want them to listen to a sentence and annotate it
Give them three sentences to annotate Compare their annotation with the hand annotation
you already have done for this◦ Getting good work often requires some human
expert work to establish a “gold standard” ahead of time!! So if you have lots of data, the investment is worth
it, but it may not be for small datasets
How to insure that crowdsourcing results are reliable
• Training the workers - the pretesting you have done should serve as training for most tasks• You could give more specific feedback if there is
something they are doing that can be corrected• Example, you asked for annotation that ends with a $ and
one worker is not adding that $ but is annotating well. Just send that person a message to add the $. And keep the worker.
How to insure that crowdsourcing results are reliable
Training the workers• You can put up a small amount of tasks to start• Say 100 tasks (for example, 100 utterances to
annotate)• Check whether the tasks are being done correctly • Check whether each worker is doing the work correctly
• Revise your task if all workers are not doing well• Or notify a worker if they are not doing as well as the
other workers• they risk not being paid and may want to abandon
your tasks
How to insure that crowdsourcing results are reliable
Assessing the work◦ There are three places where you can assess
work: Before starting the task
See training and testing While tasks are still live
Here is the best place to get rid of bots and cheaters After tasks are done (post-processing)
How to insure that crowdsourcing results are reliable
◦ During the task Compare work to “golden standard”
Create a dataset (about 10 percent of total items to be processed), for example of human expert labelled items For every ten items, put in 1 gold standard item Compare worker output to that item
Compare one worker’s output to that of others (inter-worker) Majority wins, so have an odd number of workers for
each task Compare one worker’s output to their own work
(intra-worker) Give the worker the same item every 20 or 30 items and
compare his/her performance on that item - consistency
How to insure that crowdsourcing results are reliable
Assessing the work during the task◦ Another thing to watch out for is bots and cheaters
Bots – creators model the task Cheaters – get through the task as quickly as possible While you would pay a poor worker, you should refuse to pay a
bot and someone who you are sure is a cheater◦ For cheaters, look at how much time it took to do each item
too fast? It’s a cheater◦ Give a series of multiple choice items
If a worker answers B consistently they are either a bot or a cheater
◦ Put up small groups of tasks with different names The tasks will be finished too quickly for a bot to be created
(model of your task to be made)
How to insure that crowdsourcing results are reliable
Assessing the work - after the task, on all of the data at once Gold standard
Pull out the gold standard you created and compare the work that you have collected to it
Intraworker comparison Does a worker consistently agree with the crowd? Ask the worker if they are confident in their answer – if
they consistently say no, do not use their work Note that consulting the workers often brings in
good feedback!
How to insure that crowdsourcing results are reliable
Assessing the work - after the task, on all of the data at once Interworker comparison
In the same way that you would compare the work of one worker to the gold standard, you can compare the work of one worker to another.
Look for one worker who does not agree with all of the others (uneven numbers again)
No need for gold standard for this, so your expert might need to label less data
Assess the work of one crowd by another Ask one crowd to do the task Give the same task to another crowd, showing the first crowd’s work,
for example: “Please correct the following” “Does this text match what was said?” (yes-no or change what was
wrong)
How to insure that crowdsourcing results are reliable
We have seen ways to ensure that what you get is high quality and makes sense
Equipment can be tested reliably Instructions and all of the setup that ensures the
task makes sense can be tested Workers can be pretested and trained Bots and cheaters can be eliminated The work can be assessed before, during or after
the task is completed.
Summing up
Too much information?These slides will be up on my website
Google for Maxine Eskenazi Research
Any questions from the crowd?
Top Related