Confusing the Crowd: Task Instruction Quality on Amazon ...

10
Confusing the Crowd: Task Instruction Quality on Amazon Mechanical Turk Meng-Han Wu, Alexander J. Quinn Purdue University West Lafayette, Indiana, USA {wu784, aq}@purdue.edu Abstract Task instruction quality is widely presumed to affect out- comes, such as accuracy, throughput, trust, and worker sat- isfaction. Best practices guides written by experienced re- questers share their advice about how to craft task interfaces. However, there is little evidence of how specific task design attributes affect actual outcomes. This paper presents a set of studies that expose the relationship between three sets of mea- sures: (a) workers’ perceptions of task quality, (b) adherence to popular best practices, and (c) actual outcomes when tasks are posted (including accuracy, throughput, trust, and worker satisfaction). These were investigated using collected task in- terfaces, along with a model task that we systematically mu- tated to test the effects of specific task design guidelines. Introduction Task interfaces are the primary link between requesters and workers on microtask platforms. Without a clear specifica- tion of what information is requested, and how to produce it, workers can hardly be expected to produce consistently good quality results. Advice, in the form of “best practices” abounds, but links between the guidelines and actual out- comes have only just begun to be measured (Jain et al. 2017). Through our other work, we have long experienced the challenges of designing tasks. With early versions of sys- tems, we received subpar results, and sometimes complaints from frustrated workers. In conversations with other re- questers and workers, we have heard widespread complaints about poorly designed tasks. However, we are aware of no systematically obtained evidence of this supposed problem. This paper presents a multidimensional analysis of task instruction quality, including (a) workers’ perceptions of task quality, (b) adherence to popular best practices, and (c) actual outcomes when tasks are posted (including accuracy, throughput, trust, and worker satisfaction). The studies were performed with Mechanical Turk, but should apply to any microtask platform supporting freeform task design. Besides communicating a request, task interfaces also form a semi-formal contract between the worker and re- quester. On Mechanical Turk, they typically express the cri- teria by which requesters will decide whether to approve submitted work, or reject it (and deny payment). Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The challenge of writing clear instructions is due, in part, to the diversity of workers’ backgrounds, including cul- tural context, life experience, and educational level (Ipeiro- tis 2010). Also, unlike in-person workers, web workers nor- mally have no ongoing investment in—or even awareness of—the overarching project the tasks are intended to sup- port. Thus, extra care is necessary to ensure efficiency and clarity. Tasks should not require workers to understand any more than they need to complete the task at hand. This poses an impediment to the attainment of quality results, especially for less experienced users. We view this as a problem because it impedes users from realizing the potential benefits of delegating data-intensive work to online workers. We hope that with the increasing popularity of crowdsourcing, even novices without exper- tise in interface design and crowdsourcing can participate in a freely flowing market for labor transfer. Thus, it is impor- tant to understand what is bad instruction and how to write instructions that can be efficiently understood by people. The key contributions of this work are as follows: 1. Evaluations of tasks scraped from Mechanical Turk mea- sured (a) workers’ assessment of the task quality, and (b) adherence to established best practices for task design. 2. Actual effects of specific best practices on desired out- comes (e.g., accuracy, worker acceptability, etc.) were measured for a single task by creating systematic muta- tions and measuring workers perceptions (by inspection) and actual performance (by posting the tasks). 3. Results of the experiments show that (a) adherence to best practices can affect outcome, but that (b) workers are more resilient to flaws in task quality than current popular belief might suggest. Background The impact of interface design (in general) on human per- formance is well-established (Nielsen 1994; Shneiderman 2010). When building systems that leverage crowd work for new application types, or using new strategies, task design can be a formidable challenge. Despite the importance of task design, research about improving quality of data ob- tained from crowds has long been dominated by studies of incentives (Ho et al. 2015; Horton and Chilton 2010; Mason and Watts 2010; Rogstadius et al. 2011; Yin, Chen, Proceedings of the Fifth Conference on Human Computation and Crowdsourcing (HCOMP 2017) 206

Transcript of Confusing the Crowd: Task Instruction Quality on Amazon ...

Page 1: Confusing the Crowd: Task Instruction Quality on Amazon ...

Confusing the Crowd: Task InstructionQuality on Amazon Mechanical Turk

Meng-Han Wu, Alexander J. QuinnPurdue University

West Lafayette, Indiana, USA{wu784, aq}@purdue.edu

Abstract

Task instruction quality is widely presumed to affect out-comes, such as accuracy, throughput, trust, and worker sat-isfaction. Best practices guides written by experienced re-questers share their advice about how to craft task interfaces.However, there is little evidence of how specific task designattributes affect actual outcomes. This paper presents a set ofstudies that expose the relationship between three sets of mea-sures: (a) workers’ perceptions of task quality, (b) adherenceto popular best practices, and (c) actual outcomes when tasksare posted (including accuracy, throughput, trust, and workersatisfaction). These were investigated using collected task in-terfaces, along with a model task that we systematically mu-tated to test the effects of specific task design guidelines.

Introduction

Task interfaces are the primary link between requesters andworkers on microtask platforms. Without a clear specifica-tion of what information is requested, and how to produceit, workers can hardly be expected to produce consistentlygood quality results. Advice, in the form of “best practices”abounds, but links between the guidelines and actual out-comes have only just begun to be measured (Jain et al. 2017).

Through our other work, we have long experienced thechallenges of designing tasks. With early versions of sys-tems, we received subpar results, and sometimes complaintsfrom frustrated workers. In conversations with other re-questers and workers, we have heard widespread complaintsabout poorly designed tasks. However, we are aware of nosystematically obtained evidence of this supposed problem.

This paper presents a multidimensional analysis of taskinstruction quality, including (a) workers’ perceptions oftask quality, (b) adherence to popular best practices, and (c)actual outcomes when tasks are posted (including accuracy,throughput, trust, and worker satisfaction). The studies wereperformed with Mechanical Turk, but should apply to anymicrotask platform supporting freeform task design.

Besides communicating a request, task interfaces alsoform a semi-formal contract between the worker and re-quester. On Mechanical Turk, they typically express the cri-teria by which requesters will decide whether to approvesubmitted work, or reject it (and deny payment).

Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

The challenge of writing clear instructions is due, in part,to the diversity of workers’ backgrounds, including cul-tural context, life experience, and educational level (Ipeiro-tis 2010). Also, unlike in-person workers, web workers nor-mally have no ongoing investment in—or even awarenessof—the overarching project the tasks are intended to sup-port. Thus, extra care is necessary to ensure efficiency andclarity. Tasks should not require workers to understand anymore than they need to complete the task at hand. This posesan impediment to the attainment of quality results, especiallyfor less experienced users.

We view this as a problem because it impedes users fromrealizing the potential benefits of delegating data-intensivework to online workers. We hope that with the increasingpopularity of crowdsourcing, even novices without exper-tise in interface design and crowdsourcing can participate ina freely flowing market for labor transfer. Thus, it is impor-tant to understand what is bad instruction and how to writeinstructions that can be efficiently understood by people.

The key contributions of this work are as follows:1. Evaluations of tasks scraped from Mechanical Turk mea-

sured (a) workers’ assessment of the task quality, and (b)adherence to established best practices for task design.

2. Actual effects of specific best practices on desired out-comes (e.g., accuracy, worker acceptability, etc.) weremeasured for a single task by creating systematic muta-tions and measuring workers perceptions (by inspection)and actual performance (by posting the tasks).

3. Results of the experiments show that (a) adherence tobest practices can affect outcome, but that (b) workers aremore resilient to flaws in task quality than current popularbelief might suggest.

Background

The impact of interface design (in general) on human per-formance is well-established (Nielsen 1994; Shneiderman2010). When building systems that leverage crowd work fornew application types, or using new strategies, task designcan be a formidable challenge. Despite the importance oftask design, research about improving quality of data ob-tained from crowds has long been dominated by studiesof incentives (Ho et al. 2015; Horton and Chilton 2010;Mason and Watts 2010; Rogstadius et al. 2011; Yin, Chen,

Proceedings of the Fifth Conference on Human Computation and Crowdsourcing

(HCOMP 2017)

206

Page 2: Confusing the Crowd: Task Instruction Quality on Amazon ...

and Sun 2013; Shaw, Horton, and Chen 2011, etc.) and op-timization algorithms (Ipeirotis, Provost, and Wang 2010;Barowy et al. 2012; Dai, Mausam, and Weld 2010, etc.).The assumption behind these approaches is that workers arelazy, and will submit poor quality work, unless the rewardstructure incents them to do otherwise. This is at odds withthe prevailing belief in usability circles, that design is theresponsibility of designers.

Importance of instructions

Unclear instructions are often cited as a main challengefor crowd workers. Misunderstandings can adversely affectworker’s rating by forcing them to abandon a HIT or leaveit incomplete (Silberman et al. 2010). A subsequent studyby Schulze et al. reported that task clarity impacts work-ers’ acceptance of a task (Schulze et al. 2011). Clarity of in-structions has also been found to improve usability for low-income workers in India (Khanna et al. 2010).

Poorly designed instructions affect not only the resultquality (Kittur et al. 2013) but the relationship between re-questers and workers (McInnis et al. 2016). Long-standingpolicies of Mechanical Turk give workers little recourse incase of disputes over the correctness of their work. The taskinstructions are the only explicit expression of the task re-quirements, and hence serve as an informal contract.

Studies of task design

A recent1 by Gadiraju et al. (2017) investigated task clarityusing sample of 7,100 tasks from Mechanical Turk. Work-ers for rated them for clarity, which the authors frame interms of “goal” (what is needed) and “role” (steps to achieveit). The sampled task instructions were found to be moder-ately clear. They also tested whether a computational modelcould predict clarity based on a set of features, including taskmetadata, type, content, and readability. Using the model toanalyze clarity on the site over a 1-year period, they foundthat no monotonous trend in the overall average task clarityover a year. (In other words, clarity of tasks waxes and wanesover time.) The goals of that work are similar to ours—tobetter understand the role of task clarity in microtasks. Thispaper is complementary in that it tests interactions with pop-ular best practices (i.e., by inspection and by a random selec-tion experiment) and with desirable outcomes (i.e., accuracy,worker acceptability, etc.).

Jain et al. recently analyzed the 27 million microtasksfrom the marketplace. Their examination included a spec-trum of facets, including marketplace dynamics, task design,and worker behavior. Their work examined a far greaternumber of task designs, but using mostly automated meth-ods. For example, they focus on what components in apage—e.g., examples, images, number of inputs—predictagreement (Jain et al. 2017).

Alagarai et al. studied the task design in a psychologicalway. They modified the visual saliency of the target fieldsand working memory requirements of the task to see how

1The study by Gadiraju et al. was published in July 2017. Welearned of it just prior to the final submission of this paper.

these cognitively inspired features affect workers. The re-sults suggested that the performance of crowd workers canbe maximized by using these features for task design (Ala-garai Sampath, Rajeshuni, and Indurkhya 2014).

Alonso and Baeza-Yates (2011) explored the design andexecution of relevance judgments. The design here has abroader meaning, which not only refers to the interface de-sign but also includes the task mechanics such as when topost the task and how to filter the workers.

Complementary approaches

If task design is an acquired skill, then one possible mitiga-tion is to aid novice requesters in producing clearer tasks.Fantasktic was a system developed to test that approach. Ituses a wizard interface to elicit task requirements from re-questers, and allowed them to preview the task before post-ing. It led to more consistent results, but was limited to anarrow set of task types (Gutheim and Hartmann 2012).

Instructions can only be effective if they are read. Oneapproach is to use a modified Instructional ManipulationCheck (IMC) (Oppenheimer, Meyvis, and Davidenko 2009)to check if workers are reading instructions carefully (Kapel-ner and Chandler 2010; Goodman, Cryder, and Cheema2013). These are often referred to as “attention checks” inthe context of crowdsourcing (Peer, Vosgerau, and Acquisti2013; Hauser and Schwarz 2016, etc.).

As mentioned above, incentives are often a focus of re-search on quality in crowdsourcing. Besides incenting good-faith effort, some incentive structures can also increaseworkers’ intrinsic motivation or help to select personalityprofiles that are well-suited to the task. Studies of worker’sintrinsic motivation (Law et al. 2016) found that workerstend to be more productive if tasks are framed as somethingmeaningful. For tasks that benefit from workers with partic-ular motivation profiles, it is possible to design incentives toattract those workers (Hsieh and Kocielnik 2016).

While our study was focused on the task design—the in-structions and form fields a worker sees after agreeing to dothe task—others have tested the effects of other parameters,such as the number of labels requested per image, number ofHITs posted at once, and the reward. Huang et al. generatedbehavioral models to predict the results of image labelingtask from those factors (Huang et al. 2010). They showedthat simple models can accurately predict the quality of out-put per unit task. On the other hand, Grady and Lease testedthe effects of a similar set of factors (e.g., query, terminol-ogy, pay, bonus) on accuracy, cost, and time spent, for thetask of relevance assessment (Grady and Lease 2010). Al-though their findings were largely inconclusive, they iden-tify obstacles and potential paths for future investigation.

Evaluation of existing HIT instructionsTo assess the status quo of task design on Mechanical Turk,workers were engaged to assess the quality of tasks foundon the site, based on screenshots.

Justification

The goal was to acquire descriptive metrics of existing tasks,and learn how the tasks would perform. Several possible ap-

207

Page 3: Confusing the Crowd: Task Instruction Quality on Amazon ...

Figure 1: This questionnaire was used to measure workers’perceptions of task quality for scraped HITs.

proaches were considered.The metrics of interest (Figure 1) all relate to interactions

with the workers (e.g., whether they understand the vocabu-lary) so any approach would need to involve workers. Thus,a computational analysis (i.e., NLP) would be inappropriate.

Ideally, we could have measured the actual uptake rate (%of workers who accept after previewing), enjoyment, andaccuracy. However, many HITs are backed by web appli-cations which could not be easily replicate without accessto requesters’ code. Directly engaging with requesters wasruled out, since participation bias would be inevitable.

Instruction Metrics

The questions used to elicit workers’ assessments of tasks(Figure 1) were based on two types of metrics: descriptive(properties of the task itself) and prospective (workers’ pre-diction of outcomes related to accuracy, worker acceptabil-ity, etc.).

Descriptive metrics Although most instructions on Ama-zon Mechanical Turk are within the scope of procedural in-structions, they are different in the way that people are cre-ating crowdsourcing tasks for different purposes. Therefore,common guideline or tips for creating instructions are notapplicable to these tasks. We chose three descriptive metricsthat are relevant to instructions in general (Wright 1998) andalso applicable to crowdsourcing.

1. Vocabulary use in instructions. The difference in domainknowledge influences how people interpret and complywith instructions. Not only technical jargon but uncom-mon vocabularies can fail to communicate. The vocabu-lary use in instructions is critical; it is the fundamentalelement to compose a sentence and the key to helpingworkers understand the instructions.

2. Data specification. The specification of the desired datais critical for microtask crowdsourcing, because it iswidely used for data collection. Workers are at high riskof being rejected if they do not know the criteria of dataand understand what they should submit.

3. Logical order of the instruction layout. People form men-tal model in the process of reading instructions, they aimto find the information that fills the slot in their currentmental schema. Any contradiction may lead to confusionand misunderstanding. Also, the order of instructions willbe critical especially when people are unfamiliar with thetask.

Prospective metrics The other three questions were de-signed to measure workers expected reactions, if they en-countered the HIT. These are based on the outcomes men-tioned throughout this paper: accuracy, throughput, trust,and worker satisfaction.

4. Confidence that the worker “could provide correct an-swers” is a proxy for accuracy. This presumes that con-fident answers are more likely to be accurate that uncon-fident ones.

5. Enjoyment is important because we presume that work-ers who enjoy a task will be more likely to continue it,thus learning to perform the task well and contributing toaccuracy and throughput.

6. Acceptance refers to the likelihood that a worker wouldaccept the task after previewing it. This is important forthroughput. (In this paper, “acceptance” refers to a sin-gle worker, while “uptake” refers to the proportion of allworkers who accept a HIT after previewing it. They referto the same thing.)

The prospective metrics are essentially expressed prefer-ences. These were used, despite the inherent limitations, be-cause revealed preferences would be infeasible to observefor any substantial sample of in-the-wild tasks.

Data collection

The metrics described above were measured using a 5-pointLikert scale presented directly above a screenshot of thescraped task design (Figure 1). One task design was pre-sented in each of our HITs. Workers were allowed to ratemultiple HITs, but were not required to rate all of them.The evaluation form was available on Mechanical Turk for3 days. A total of 167 distinct workers rated 135 HITs, with3 judgments per HIT (405 judgments total). Workers werepaid $0.50 per HIT (average hourly rate: $10.70 per hour).

Sampling of scraped HITs Scraped HITs were sampledto represent the population of requesters, rather than the set

208

Page 4: Confusing the Crowd: Task Instruction Quality on Amazon ...

���������

������ �����

������ �����

���� �������������

���������� ���

� �� �� �� �� ���

� � ! � "

Figure 2: Most HITs were evaluated positively by workerson the descriptive metrics. (Note: Definitions of each fac-tor were given on the previous page. Display order in thisfigure and Figure 3 are different from the questionnaire inFigure 1.)

Factor Mean Std. Dev

Vocabulary use 4.25 0.70Data specification 3.99 0.78

Logical order 4.07 0.79Overall score 4.10 0.66

Table 1: Descriptive metrics, mean of 3 ratings × 135 HITs

of all HITs or HIT groups available at a given time. Toprequesters account for more than 30 percent of the over-all activity of the market (Ipeirotis 2010). To avoid pickingtoo many HITs from the top requester and neglecting theminority, we categorized the HITs by the requesters name.There were 1300 requesters in total, from which we ran-domly selected 135 requesters, and one HIT per requester(135 HITs total). Web-scraped HITs that were obviouslynot answerable, such as totally blank one, consent form andthose which hide instructions in preview mode or need tolink to an external website for further information, werefiltered out. Each screenshot was collected before the taskstarted. While we were collecting screenshots, we also col-lected requesters’ history of HIT post by using MechanicalTurk-Tracker (Ipeirotis 2010).

Results

Results were analyzed based on the average of the 3 judg-ments, with values between 1 and 5. The most positive value(e.g., ”very easy to understand”, etc.) was coded as 5.

Descriptive metrics

Among the three descriptive metrics, ”Vocabulary use” hadthe highest mean score (4.25 out of 5.00). This indicates thatthe overall vocabulary used in these HITs was easy for work-ers to understand.

The lowest mean scores were for Data specification (3.99)and Logical order (4.07), respectively. (See Table 1.)

Most HITs had a mean score between 3.00 and 5.00 for allof the descriptive metrics, with relatively high proportions

���������

������ �����

����������

���������

��������

� � � � � �

�� �� �! �� �"

Figure 3: Prospective metrics were widely distributed. Themost positive expected outcome was acceptance (of a HITif it were offered). The most negative was enjoyment. Thismight suggest some degree of willingness to do tasks theydo not enjoy.

Expectation Mean Std. Dev

Acceptance 3.33 1.07Enjoyment 3.09 0.92Confidence 3.60 0.98

Overall score 3.35 0.93

Table 2: Prospective metrics, mean of 3 ratings × 135 HITs

between 4.00 and 5.00 (Figure 2.) The mean score acrossall of the descriptive metrics was 4.10. These results suggestthat the instructions created by most requesters on AmazonMechanical Turk may actually be designed adequately, de-spite the prevalent concerns about poor task design.

Prospective metrics

The means of each of the three prospective metrics wereall over 3.00. The average of the three together was 3.35.This indicates an overall positive assessment with respect toworkers’ expected reaction if they encountered these HITs.This is not unexpected. If most workers truly felt that theywould not accept most HITs, and would be unconfident oftheir accuracy and enjoyment when they did accept HITs,the market would not be expected to function productively.However, variation in the response distribution suggests thatdifferences among tasks may affect these metrics. For eachprospective measure, significant proportions of the HITs re-ceived average scores between 0.00 and 3.00. (Table 2.) Forthose HITs, workers did not express confidence that theywould be accepted, enjoyed, and/or performed correctly.

Using the results in these two parts, we performed a lin-ear regression test to find whether there is a linear relation-ship between the factors we measured in previous part andworkers’ expectation toward instructions. We summarize theresults in Table 3.

Except for Logical order v.s. Confidence, the adjusted R2

of each pair is larger than 0.3, which means, from work-ers’ view, a portion of the variation in their expected reac-tions if the HITs were encountered (prospective metrics) is

209

Page 5: Confusing the Crowd: Task Instruction Quality on Amazon ...

Acceptance Enjoyment Confidence

Vocabulary use 0.369 0.391 0.339Data specs 0.429 0.381 0.488

Logical order 0.393 0.427 0.243

Table 3: Adjusted R2 of each pair from linear regression test,p<0.05

accounted for by the quality of the instructions (descriptivemetrics). If the prospective metrics are predictive of work-ers’ actual reactions, then this would suggest that the qualityof instructions do affect the outcomes (accuracy, worker ac-ceptability, etc.).

Instructions are not the only property expected to influ-ence these outcomes (Schulze et al. 2011). Therefore, an ad-justed R2 of 0.3 indicates that, of the variance that can beexplained by instructions, these facets of task design qualityhave an important role.

Causation cannot be inferred from this study, since a ran-dom selection design was not feasible for this study within-the-wild task designs. In the next section, we present aseparate, complementary study, which used random selec-tion (of best practices) for a model task design.

Adherence to best practices

For an objective assessment, all of the scraped HITs werechecked for adherence to the following best practices:

1. BULLETS. Use bullets for steps, data items, or rules.

2. MEANING. Criteria and meaning of inputs should beconcretely defined.

3. FORMAT. Specify formatting requirements explicitly.

4. TOOLS. Specify how the task should be performed, in-cluding any tools or methods required.

5. EXAMPLES. Use examples to illustrate expectations.

6. EXPLANATIONS. Explain criteria for acceptance in de-tail and clearly state what kind of errors would trigger arejection of the HIT to avoid unfair rejections.

7. INDEPENDENCE. Ensure that every question will beanswerable, regardless of the answers to the others.

These guidelines are recommended by several guides andblogs written by experienced requesters. Our primary sourcewas the Amazon Requester Best Practices Guide2, since itsauthors presumably have a particularly strong interest inhelping requesters produce task instructions that will helpthe market function producively. We also surveyed severalothers guides, including ProPublica Guide to MechanicalTurk3, Mturk blog4, Crowdsource.com Do’s and Don’ts5,

2https://mturkpublic.s3.amazonaws.com/docs/MTURK BP.pdf3https://www.propublica.org/article/propublicas-guide-to-

mechanical-turk4https://blog.mturk.com/5https://www.crowdsource.com/blog/2012/07/hit-layouts-dos-

and-donts-part-1

Adherence best requester guide for in−the−wild task

0

TOOLSBULLETSFORMAT

INDEPENDENCEEXAMPLESMEANING

EXPLANATIONS

50 50No Yes

Figure 4: The 135 task instructions were categorized by anindependent rater to determine which of the guidelines werefollowed. Tasks for which a particular guideline would beirrelevant or could not be evaluated were marked as N/A,and are excluded from the given column. These guidelinesare explained in the text.

a guide by Edwin Chen 6, detailed forum posts found atQuora.com7 and TurkerNation8.

The seven guidelines below were widely repeated acrossthese sources. Note that we excluded guides that focuson compensation or ethical considerations (e.g., WeAreDy-namo guidelines, etc.).

These guidelines do not serve as golden rules for creatinginstructions. Some of the guidelines could not be applied tocertain kinds of tasks. For example, answering a survey doesnot need examples for inputs or specify tools or methods.Instructions that can be clearly explained in one sentencedo not need to use bullets for writing. Therefore, before therating, we set up rules to separate these cases out and dothe measures for those instructions which apply. An expertrated the 135 instructions in 5 hours, and the result is in Fig-ure 4. From the results, except ”Explain accept/reject criteriaand approval process,” in other criteria, the proportion of re-questers who used it was more than the percentage of therequester who did not. Up to 81% of the requesters clearlyexplained the criteria and meaning of input, while 77% ofrequesters did not explain the accept/reject criteria. Theseresults show that the sampling instructions follow most ofthe sorted guidelines, which adds the validity for “negative”results in previous sections.

Discussion

We set out to document the problems with task design thathave been widely presumed to be prevalent. We evaluatedthe instructions in three different ways: (a) descriptive met-rics rated by workers, (b) prospective metrics rated by work-ers, and (c) adherence to best practices.

Descriptive metrics were more positive than expected,across all three. For 92.6% (125/135) of the HITs, scoreswere between 3.00 and 5.00, indicating that from the work-ers’ perspective, most instructions were properly designed.

6http://archive.is/tBmxM#selection-21.0-32.07http://qr.ae/Tbc2qa8http://turkernation.com/showthread.php?21300-How-to-

improve-the-poor-task-instruction

210

Page 6: Confusing the Crowd: Task Instruction Quality on Amazon ...

One possibility is that our metrics did not include some othermetrics that bother workers. It is possible that additionalfactors might emerge only in the actual performance of thetasks, a situation which we were unable to measure. Anotherpossibility is that confusing tasks cause memorable annoy-ance to workers that is disproportionate with the actual prob-lem. Given the productive functioning of the market, we findthis explanation plausible.

Prospective metrics revealed much greater variation,compared with the descriptive metrics. In this case, the pos-sibility of missed metrics would not be a factor, since theprospective metrics are related to desirable outcomes (ac-curacy, worker acceptability, etc.). One possible explana-tion for the difference in variation between prospective anddescriptive metrics is that workers response to the HITsmight be influenced by exogenous factors (e.g., interest inthe topic) that are unrelated to the quality of the task design.

Adherence to best practices was relatively high. Themost violated guideline was Explanations; many requestersdo not explicitly explain their criteria for acceptance of re-sults.

The positive assessment of the scraped task instructionscontradicts our observations and experiences, and thoseof many other crowdsourcing researchers with whom wehave discussed this. One possible explanation is that as re-searchers, the interfaces we need to create are inherentlymore demanding, and thus our experiences are not repre-sentative.

Another possible factor is that in our evaluation, workersonly inspected static screenshots of task interfaces, and didnot perform them under organic conditions. Thus, they maynot have encountered the problems that would cause frustra-tion due to poor instruction quality. In other words, our eval-uation measured expressed preferences rather than revealedpreferences. Measuring workers’ actual behavior while do-ing these tasks would normally be impossible.

Since usability is well-known to impact performance andsatisfaction, and learning to design interfaces takes somecombination of experience and possibly aptitude, then thoserequesters who are unable to achieve good results in theirinitial attempts may be expected to leave, and try other av-enues. The type of task may also affect this, since some typesof tasks are much easier to design for than others, in our ex-perience. When requesters with such tasks receive poor re-sults, they may also leave. What would remain would beonly the requesters with the resources to create good qual-ity tasks, and with tasks that are amenable. Therefore, theassessment of the scraped task instructions is biased by re-questers who “survived” and are still active on the platform.

This theory can be framed in terms of survivorship bias(Brown et al. 1992), a type of selection bias in experiments.The stereotypical example of survivorship bias is a pharma-ceutical trial in which participants who do not benefit froma treatment are allowed to exit the study, leading to a fi-nal evaluation based on participants who disproportionatelybenefited. In our case, we are supposing that requesters whodid not benefit from the help of crowd workers—perhaps be-cause the requesters’ task designs were inadequate and ledto poor data—chose to stop using the platform, leaving only

proficient task designers.These assumptions are supported by the historical data

from MTurk-Tracker (Difallah et al. 2015), a site, whichmonitors activity on Mechanical Turk. Of the 135 requesterscovered by our sample, 128 had posted HITs in the past.If task design is an acquired skill, then inexperienced re-questers would be expected to produce less clear instructionsthan experts. More evidence is needed to test this theory.

Systematically mutated task instructions

To see how instructions influence workers, we systemati-cally mutated instructions based on the same best practicesused in the previous study. The goal was to understand howthese influence actual outcomes (e.g., accuracy, worker ac-ceptability, etc.). We also measured the mutated tasks usingthe questionnaire from the previous study, though this partwas inconclusive.

Methods

A single model task was used (Figure 5). It was chosen be-cause it met a few necessary criteria: (a) Ground truth isknown. (b) An accurate response depends on careful readingof the instructions. (c) All of the best practices are directlyapplicable, and can be applied independent of one another.

The task asked workers extract information from a pic-ture and provide the right information based on the givenconstraints. Note that the purpose of this experiment is tostudy how the design principles affect workers’ actual per-formance. Since wide variation in the task would make thisstudy infeasible, we kept the task contents static for all work-ers (except for the variations in adherence to best practices).Although the results would be more generalizable if moretask types were tested, we believe this study design still ad-vances understanding of how task design guideline affectoutcomes.

To effect a random selection experiment, we selectivelyfollowed one of the design principles at a time, while de-liberately violating the rest. For example, we can choosewhether to use bullet when writing instructions.

To validate that the mutations matched our study design,an expert rater (graduate student from outside our lab) re-viewed all of the variations and confirmed that they meetor do not meet the best practices, as we intended. (In otherwords, the code for mutating the tasks works correctly.)

We measured workers’ performance in the followingways: workers’ uptake (i.e., meaning the percentage ofworkers who accept versus those who viewed); workers’ en-joyment of doing the task; time spent on the task; and theaccuracy of the results.

Each version of the task adhered to one of the principleswhile deliberately violating the others. For the control, weused a version that deliberately violated all of the best prac-tices. Thus, eight versions of the instructions (1 control +7 experimental)—all requesting the same work—were used.To measure workers’ enjoyment, we added a question at thebottom of the instructions and asked them to report theirfeeling. Each worker was only allowed to do the task once.

211

Page 7: Confusing the Crowd: Task Instruction Quality on Amazon ...

Figure 5: Task used in the last part. Instructions in the left is the control group. We use the guidelines to modify the task andcreate 7 different instructions. Each version of instructions only applied with one design principle.

Results for systematically mutated instructions

A total of 120 unique workers performed the task. Each ofthe 8 versions of the task (7 best practices + 1 control) wasperformed by 15 workers, who were paid $0.75 each (aver-age hourly rate: $9.59 per hour).

We performed a one-way between-subjects ANOVA totest whether there exist significant differences between eachgroup. For time spent (p = 0.457) and satisfaction (p =0.865), there are no significant differences between eachgroup (p>0.05). This was not unexpected since there aremany other factors involved. Time spent may be influencedby a worker’s environment or overall capability to per-form the task. Satisfaction may be influenced by a worker’sinterest in the task content, or the adequacy and impor-tance of the reward (Kaufmann, Schulze, and Veit 2011;Rogstadius et al. 2011). We believe this experiment may nothave been powerful enough to determine if there was an ef-fect. As for uptake rate and accuracy, we found significantdifferences among the groups (p<0.05).

We use Dunnett’s test as post-hoc test to find which prac-tice affects uptake rate. The results are reported in Table 4.From the table, only the absolute difference of the instruc-tions with explained accept/reject criteria is larger than Crit-ical Difference, which means applying this practice signifi-cantly affects workers’ uptake rate. However, it also meansthat applying this practice lowers workers’ willingness toaccept the task, because the control group has the highest

Bullets Meaning Format Tools Examples Exp Ind

0.195 0.262 0.209 0.182 0.108 0.454 0.199

Table 4: Uptake absolute differences between control andexperiment groups. Critical Difference = 0.283. “Exp” =“Explanation” (of acceptance criteria); Ind = “Indepen-dence” (of questions in the HIT). p<0.05.

uptake rate among all groups. To understand the issue, wecount the words in each group and find that the instruc-tions with accept/reject criteria have 199 words, which isthe largest word count among the group and is much largerthan the average word count of the remaining groups (90.7).Based on this observation, the significance might be causedby the length of the instructions. Workers are likely to skipthe content-heavy task and thus lower the uptake rate. Onthe other hand, the control group only has 89 words, whichis nearly close to the average word count (90.7). The differ-ences might be too subtle to affect a typical worker’s impres-sion of the task or explain the results, which echoes priorfindings that workers prefer short and straightforward task(Schulze et al. 2011).

Lesson: Compactness has the greatest impact onuptake, among all of these guidelines. To reduce job

212

Page 8: Confusing the Crowd: Task Instruction Quality on Amazon ...

Bullets Meaning Format Tools Examples Exp Ind

0.033 0.100 0.100 0.300 0.200 0.033 0.200

Table 5: Accuracy absolute difference between the controlgroup and experiment groups. Critical Difference = 0.320.“Exp” = “Explanation” (of acceptance criteria); Ind = “In-dependence” (of questions in the HIT). p<0.05.

turnaround time, keep tasks short.

For accuracy, we adopt the same method to compare ex-periment groups with the control group. The results showthat there is no significant difference between the controlgroup and experiment groups because all the absolute dif-ferences are less than Critical Difference, see Table 5. Wethen use Tukey’s method to make pairwise comparisons forall possible pairs. The critical difference for this method is0.372. Results show that there are significant differences be-tween Tools and Explanations (difference = 0.4 >0.372),Tools and Independent (difference = 0.5 >0.372), Exam-ples and Independent (difference = 0.4 >0.372). In ourtask, we ask workers to provide shortened URLs. However,many workers did not shorten the URL as requested. Fromthe results above, we find that two guidelines—Tools andExamples—are related to the shortened URLs. These prin-ciples may help workers enter the correct results.

Lesson: For accurate results, explicitly state the toolsneeded for the task, and give concrete examples.

In this experiment, we systematically modified the in-structions based on the seven design principles. Eight dif-ferent sets of instructions with the same task type were gen-erated. From the results, we find that the design principlesdo not affect worker’s time spent on a task and enjoyment.However, they directly or indirectly affect worker’s uptakeand accuracy. For example, applying these design princi-ples changes the number of words in the instructions and af-fect workers’ uptake. Workers are likely to skip the content-heavy instructions even when the task type is the same.

On the other hand, applying these principles directly af-fect the accuracy of the results. For example, the commonmistakes made by workers were that they did not shorten theURL. In that case, we find a significant accuracy improve-ment between the instructions applied with design princi-ples that are related to how to shorten URL and those whichdid not apply. Based on the result above, it implies that thedesign of instructions does have an influence on worker’sbehavior. However, it does not mean that using certain prin-ciples guarantees the quality of the result. Requesters shouldconsider the requirements of the task and appropriately ap-ply the principles in instructions design based on differentsituations.

Ratings. As before, each HIT was evaluated by 3 work-ers. We compared HITs on the average of 3 ratings. Workerswere paid $0.75 (average hourly rate: $4.90 per hour).

To test for an effect between the ratings and actual perfor-mance, we used a two-way ANOVA on the enjoyment data,which is the common data we collected in the first study andthe mutated instructions, to see if there exists a differencebetween self-report workers and workers who did the task.We found no statistically significant effect (p=0.0556).

Conclusion

We examined the relationship between task design best prac-tices, worker perceptions of quality, and actual outcomes.

1. Collected HITs were assessed by workers, and checkedby an expert for adherence to best practices.

2. A model HIT was systematically mutated to learn howbest practices affect outcomes.

The first study found that, contrary to our expecta-tions, tasks created by requesters currently using the mar-ket were positively viewed by most workers. We also foundwidespread adherence to most of the best practices, exceptthat many tasks fail to explain criteria for acceptance. Wesuspect survivorship bias may be a factors. Requesters whoget poor results from workers may either improve their taskdesigns or stop using the platform. This would suggest thatthese results are higher than what would be expected if allrequesters were novices.

The second study showed substantial variation in the ef-fects of the best practice guidelines on outcomes. One in-sight was that worker uptake—the proportion of workerswho accept the task, relative to those who viewed it—ishigher for short tasks. This illustrates a tension: Althoughmore details can increase trust (e.g., clarity criteria) and ac-curacy (e.g., clear steps, format specifications), it may alsomakes tasks less appealing to workers, leading to longer de-lays to receive final results.

Acknowledgement

We gratefully acknowledge Gaoping Huang and HsuanHsieh for thoughtful feedback on the experiment design, Yu-Hsuan Lai for evaluating workers’ instructions, and the 597workers who contributed throughout all phases of this work.

References

Alagarai Sampath, H.; Rajeshuni, R.; and Indurkhya, B.2014. Cognitively inspired task design to improve user per-formance on crowdsourcing platforms. In Proceedings ofthe SIGCHI Conference on Human Factors in ComputingSystems, CHI ’14, 3665–3674. New York, NY, USA: ACM.Alonso, O., and Baeza-Yates, R. 2011. Design and imple-mentation of relevance assessments using crowdsourcing. InAdvances in information retrieval. Springer. 153–164.Barowy, D. W.; Curtsinger, C.; Berger, E. D.; and McGregor,A. 2012. AutoMan: A Platform for Integrating Human-based and Digital Computation. In Proceedings of the ACMInternational Conference on Object Oriented ProgrammingSystems Languages and Applications, OOPSLA ’12, 639–654. New York, NY, USA: ACM.

213

Page 9: Confusing the Crowd: Task Instruction Quality on Amazon ...

Brown, S. J.; Goetzmann, W.; Ibbotson, R. G.; and Ross,S. A. 1992. Survivorship bias in performance studies. Re-view of Financial Studies 5(4):553–580.

Dai, P.; Mausam; and Weld, D. S. 2010. Decision-theoreticControl of Crowd-sourced Workflows. In Proceedings of theTwenty-Fourth AAAI Conference on Artificial Intelligence,AAAI’10, 1168–1174. Atlanta, Georgia: AAAI Press.

Difallah, D. E.; Catasta, M.; Demartini, G.; Ipeirotis, P. G.;and Cudre-Mauroux, P. 2015. The dynamics of micro-taskcrowdsourcing: The case of amazon mturk. In Proceed-ings of the 24th International Conference on World WideWeb, WWW ’15, 238–247. Republic and Canton of Geneva,Switzerland: International World Wide Web ConferencesSteering Committee.

Gadiraju, U.; Yang, J.; and Bozzon, A. 2017. Clarity is aworthwhile quality: On the role of task clarity in microtaskcrowdsourcing. In Proceedings of the 28th ACM Conferenceon Hypertext and Social Media, HT ’17, 5–14. New York,NY, USA: ACM.

Goodman, J. K.; Cryder, C. E.; and Cheema, A. 2013. Datacollection in a flat world: The strengths and weaknesses ofmechanical turk samples. Journal of Behavioral DecisionMaking 26(3):213–224.

Grady, C., and Lease, M. 2010. Crowdsourcing docu-ment relevance assessment with mechanical turk. In Pro-ceedings of the NAACL HLT 2010 Workshop on Creat-ing Speech and Language Data with Amazon’s MechanicalTurk, CSLDAMT ’10, 172–179. Stroudsburg, PA, USA: As-sociation for Computational Linguistics.

Gutheim, P., and Hartmann, B. 2012. Fantasktic: Improvingquality of results for novice crowdsourcing users. Mastersthesis, EECS Department, University of California, Berke-ley.

Hauser, D. J., and Schwarz, N. 2016. Attentive Turk-ers: MTurk participants perform better on online attentionchecks than do subject pool participants. Behavior ResearchMethods 48(1):400–407.

Ho, C.-J.; Slivkins, A.; Suri, S.; and Vaughan, J. W. 2015.Incentivizing high quality crowdwork. In Proceedings of the24th International Conference on World Wide Web, WWW’15, 419–429. Republic and Canton of Geneva, Switzerland:International World Wide Web Conferences Steering Com-mittee.

Horton, J. J., and Chilton, L. B. 2010. The labor eco-nomics of paid crowdsourcing. In Proceedings of the 11thACM Conference on Electronic Commerce, EC ’10, 209–218. New York, NY, USA: ACM.

Hsieh, G., and Kocielnik, R. 2016. You get who you pay for:The impact of incentives on participation bias. In Proceed-ings of the 19th ACM Conference on Computer-SupportedCooperative Work and Social Computing, CSCW ’16, 823–835. New York, NY, USA: ACM.

Huang, E.; Zhang, H.; Parkes, D. C.; Gajos, K. Z.; and Chen,Y. 2010. Toward automatic task design: A progress report.In Proceedings of the ACM SIGKDD Workshop on Human

Computation, HCOMP ’10, 77–85. New York, NY, USA:ACM.Ipeirotis, P. G.; Provost, F.; and Wang, J. 2010. Quality man-agement on amazon mechanical turk. In Proceedings of theACM SIGKDD Workshop on Human Computation, HCOMP’10, 64–67. New York, NY, USA: ACM.Ipeirotis, P. G. 2010. Analyzing the amazon mechanical turkmarketplace. XRDS 17(2):16–21.Jain, A.; Sarma, A. D.; Parameswaran, A.; and Widom,J. 2017. Understanding workers, developing effectivetasks, and enhancing marketplace dynamics: A study ofa large crowdsourcing marketplace. Proc. VLDB Endow.10(7):829–840.Kapelner, A., and Chandler, D. 2010. Preventing satisficingin online surveys. In Proceedings of 2010 CrowdConf.Kaufmann, N.; Schulze, T.; and Veit, D. 2011. More thanfun and money. worker motivation in crowdsourcing-a studyon mechanical turk. In AMCIS, volume 11, 1–11.Khanna, S.; Ratan, A.; Davis, J.; and Thies, W. 2010. Eval-uating and improving the usability of mechanical turk forlow-income workers in india. In Proceedings of the FirstACM Symposium on Computing for Development, ACMDEV ’10, 12:1–12:10. New York, NY, USA: ACM.Kittur, A.; Nickerson, J. V.; Bernstein, M.; Gerber, E.; Shaw,A.; Zimmerman, J.; Lease, M.; and Horton, J. 2013. Thefuture of crowd work. In Proceedings of the 16th Confer-ence on Computer Supported Cooperative Work and SocialComputing, CSCW ’13, 1301–1318. New York, NY, USA:ACM.Law, E.; Yin, M.; Goh, J.; Chen, K.; Terry, M. A.; and Gajos,K. Z. 2016. Curiosity killed the cat, but makes crowdworkbetter. In Proceedings of the 2016 CHI Conference on Hu-man Factors in Computing Systems, CHI ’16, 4098–4110.New York, NY, USA: ACM.Mason, W., and Watts, D. J. 2010. Financial incentivesand the ”Performance of Crowds”. SIGKDD Explor. Newsl.11(2):100–108.McInnis, B.; Cosley, D.; Nam, C.; and Leshed, G. 2016.Taking a hit: Designing around rejection, mistrust, risk, andworkers’ experiences in amazon mechanical turk. In Pro-ceedings of the 2016 CHI Conference on Human Factors inComputing Systems, CHI ’16, 2271–2282. New York, NY,USA: ACM.Nielsen, J. 1994. Usability engineering. Elsevier.Oppenheimer, D. M.; Meyvis, T.; and Davidenko, N. 2009.Instructional manipulation checks: Detecting satisficing toincrease statistical power. Journal of Experimental SocialPsychology 45(4):867–872.Peer, E.; Vosgerau, J.; and Acquisti, A. 2013. Reputation asa sufficient condition for data quality on Amazon Mechani-cal Turk. Behavior Research Methods 46(4):1023–1031.Rogstadius, J.; Kostakos, V.; Kittur, A.; Smus, B.; Laredo,J.; and Vukovic, M. 2011. An assessment of intrinsic andextrinsic motivation on task performance in crowdsourcingmarkets. In Proceedings of the Fifth International AAAI

214

Page 10: Confusing the Crowd: Task Instruction Quality on Amazon ...

Conference on Weblogs and Social Media, ICWSM ’11,321–328.Schulze, T.; Seedorf, S.; Geiger, D.; Kaufmann, N.; andSchader, M. 2011. Exploring task properties incrowdsourcing-an empirical study on mechanical turk. In19th European Conference on Information Systems, ECIS’11. Atlanta, GA: AISeL.Shaw, A. D.; Horton, J. J.; and Chen, D. L. 2011. Designingincentives for inexpert human raters. In Proceedings of theACM 2011 Conference on Computer Supported CooperativeWork, CSCW ’11, 275–284. New York, NY, USA: ACM.Shneiderman, B. 2010. Designing the user interface: strate-gies for effective human-computer interaction. Pearson Ed-ucation India.Silberman, M. S.; Ross, J.; Irani, L.; and Tomlinson, B.2010. Sellers’ problems in human computation markets.In Proceedings of the ACM SIGKDD Workshop on HumanComputation, HCOMP ’10, 18–21. New York, NY, USA:ACM.Wright, P. 1998. Printed instructions: Can research make adifference. Visual information for everyday use: Design andresearch perspectives 45–66.Yin, M.; Chen, Y.; and Sun, Y.-A. 2013. The effects ofperformance-contingent financial incentives in online labormarkets. In Proceedings of the Twenty-Seventh AAAI Con-ference on Artificial Intelligence, AAAI ’13, 1191–1197.AAAI Press.

215