Download - Maxine Eskenazi Language Technologies Institute Carnegie Mellon University.

Can I use crowdsourcing to process my data?

Maxine EskenaziLanguage Technologies Institute

Carnegie Mellon University

What is the problem? How to insure that crowdsourcing results

are reliable The solutions:

◦ Testing the equipment◦ Framing the task◦ Testing the workers◦ Training the workers◦ Assessing the work

In this talk

Crowdsourcing is a great resource!◦ You have large amounts of data to process◦ It’s faster and cheaper while maintaining high quality

But, you can make it say what you want◦ Example: Looking for sentences that include a well-

pronounced example of the word, “table”: “Do you agree that the word “table” was said in this

sentence?”vs

“Please annotate this sentence” You can get results that are meaningless But you can get great results if you are careful!

What is the problem?

Testing the equipment - for those who will listen to something (to annotate, for example)◦ Ask them to use a headset and then ask them to

click yes if they can hear something Relying on worker self-assessment is nice, but not very

reliable◦ Play something to them and ask them to write

down what they heard Compare what they wrote to what they heard (you had

already written this down) and give them feedback, if they still can’t hear, on how to connect the headset

How to insure that crowdsourcing results are reliable

Testing the equipment - for those who will record something

Ask them to speak into the microphone and then play it back to them and ask them if they heard something Relying on worker self-assessment has sometimes worked

in this case. Ask them to read something from the screen and then

use a speech recognizer to align what they said with what they read MIT has the WAMI toolkit for this, and there are others as

well Have some other worker listen to what they said and

annotate it, then compare that annotation to the text This may take too much time


Framing the task - Workers need to know what the task is and how to do it◦ Write a description of the task and instructions on

what to do Get others to read that description and follow your

instructions -sandbox Revise and try out again

◦ Give examples and counterexamples Give at least two to three of each

◦ Become a worker and try others’ tasks yourself!! You understand issues better when you put yourself in

their shoes


Framing the task VERY IMPORTANT

◦ Keep the cognitive load as low as possible! Break one complex task into several tasks

◦ Example – instead of “label the words you hear as well as the non-words, parts of words and pauses”, you would ask “label the words you hear”, then in a separate task “label the non-words, like

lipsmacks, you hear” in a separate task “label the parts of words, like

restarts, you hear” In a separate task “label where the pauses are”


Framing the task◦ Another example

Interspeech2013 – 25th anniversary Statistics on past 25 years – 18 categories

Total number of papers Total number of different authors 2 harder-to-define categories - Total number of cohorts

of authors 1500 attendees were quizzed Crowd had close to correct or right answer on the

first 16, nothing close on the last 2


Framing the task◦ Workers will choose the task they want to work on for

several reasons: How much they can make per hour

Calculate how much you should pay them so they make at least minimum wage (how much time it takes to complete one task)

How can you make the task go faster? Putting all of one task on one page without scrolling No scrolling saves their time Example, ten sentences to annotate plus the instructions

Let them minimize the instructions if they want Change font size and space between sentences to get it all

on the screen at the same time Eliminate any other unnecessary keystrokes


Framing the task◦ What it will be used for

You make your task more appealing when you tell people why you want them to do this task

Example from our work: We are asking you to simplify some sentences. They are

taken from everyday documents like driver license applications. This is so that we can automatically simplify everyday documents

◦ How nice it looks Subliminal detail that has been shown to be effective


Testing the workers – why?◦ Do not assume they are native speakers of X –

test them! Just because you have geolocation, that does not

mean the person fluently speaks the language of that country

◦ Do not assume that all speakers of Y can write down what they hear – test them!

◦ Not everyone is honest and there are bots


Testing the workers – How?◦ To test for speakers of X, you could ask them to

translate (type in) something from English into the target language Make sure that there is some word or expression that

Google Translate or other would get wrong You have already translated this sentence by hand Compare the two texts


Testing the workers – How?◦ Give a new worker three items to do

Say you want them to listen to a sentence and annotate it

Give them three sentences to annotate Compare their annotation with the hand annotation

you already have done for this◦ Getting good work often requires some human

expert work to establish a “gold standard” ahead of time!! So if you have lots of data, the investment is worth

it, but it may not be for small datasets


• Training the workers - the pretesting you have done should serve as training for most tasks• You could give more specific feedback if there is

something they are doing that can be corrected• Example, you asked for annotation that ends with a $ and

one worker is not adding that $ but is annotating well. Just send that person a message to add the $. And keep the worker.


Training the workers• You can put up a small amount of tasks to start• Say 100 tasks (for example, 100 utterances to

annotate)• Check whether the tasks are being done correctly • Check whether each worker is doing the work correctly

• Revise your task if all workers are not doing well• Or notify a worker if they are not doing as well as the

other workers• they risk not being paid and may want to abandon

your tasks


Assessing the work◦ There are three places where you can assess

work: Before starting the task

See training and testing While tasks are still live

Here is the best place to get rid of bots and cheaters After tasks are done (post-processing)


◦ During the task Compare work to “golden standard”

Create a dataset (about 10 percent of total items to be processed), for example of human expert labelled items For every ten items, put in 1 gold standard item Compare worker output to that item

Compare one worker’s output to that of others (inter-worker) Majority wins, so have an odd number of workers for

each task Compare one worker’s output to their own work

(intra-worker) Give the worker the same item every 20 or 30 items and

compare his/her performance on that item - consistency


Assessing the work during the task◦ Another thing to watch out for is bots and cheaters

Bots – creators model the task Cheaters – get through the task as quickly as possible While you would pay a poor worker, you should refuse to pay a

bot and someone who you are sure is a cheater◦ For cheaters, look at how much time it took to do each item

too fast? It’s a cheater◦ Give a series of multiple choice items

If a worker answers B consistently they are either a bot or a cheater

◦ Put up small groups of tasks with different names The tasks will be finished too quickly for a bot to be created

(model of your task to be made)


Assessing the work - after the task, on all of the data at once Gold standard

Pull out the gold standard you created and compare the work that you have collected to it

Intraworker comparison Does a worker consistently agree with the crowd? Ask the worker if they are confident in their answer – if

they consistently say no, do not use their work Note that consulting the workers often brings in

good feedback!


Assessing the work - after the task, on all of the data at once Interworker comparison

In the same way that you would compare the work of one worker to the gold standard, you can compare the work of one worker to another.

Look for one worker who does not agree with all of the others (uneven numbers again)

No need for gold standard for this, so your expert might need to label less data

Assess the work of one crowd by another Ask one crowd to do the task Give the same task to another crowd, showing the first crowd’s work,

for example: “Please correct the following” “Does this text match what was said?” (yes-no or change what was

wrong)


We have seen ways to ensure that what you get is high quality and makes sense

Equipment can be tested reliably Instructions and all of the setup that ensures the

task makes sense can be tested Workers can be pretested and trained Bots and cheaters can be eliminated The work can be assessed before, during or after

the task is completed.

Summing up

Too much information?These slides will be up on my website

Google for Maxine Eskenazi Research

Any questions from the crowd?