Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf ·...

12
Improving Worker Engagement Through Conversational Microtask Crowdsourcing Sihang Qiu Delft University of Technology Delft, The Netherlands [email protected] Ujwal Gadiraju L3S Research Center, Leibniz Universität Hannover Hannover, Germany [email protected] Alessandro Bozzon Delft University of Technology Delft, The Netherlands [email protected] ABSTRACT The rise in popularity of conversational agents has enabled humans to interact with machines more naturally. Recent work has shown that crowd workers in microtask marketplaces can complete a variety of human intelligence tasks (HITs) using conversational interfaces with similar output quality compared to the traditional Web interfaces. In this paper, we investigate the effectiveness of using conversational interfaces to improve worker engagement in microtask crowdsourcing. We designed a text-based conversational agent that assists workers in task execution, and tested the performance of workers when in- teracting with agents having different conversational styles. We conducted a rigorous experimental study on Amazon Me- chanical Turk with 800 unique workers, to explore whether the output quality, worker engagement and the perceived cog- nitive load of workers can be affected by the conversational agent and its conversational styles. Our results show that con- versational interfaces can be effective in engaging workers, and a suitable conversational style has potential to improve worker engagement. Our findings have important implications on workflows and task design with regard to better engaging workers in microtask crowdsourcing marketplaces. Author Keywords Microtask crowdsourcing; conversational interface; conversational style; user engagement; cognitive task load. CCS Concepts Information systems Chat; Crowdsourcing; Human- centered computing Empirical studies in HCI; INTRODUCTION There has been a gradual rise in the use of conversational in- terfaces aiming to provide seamless means of interaction with virtual assistants, chatbots, or messaging services. There is a growing familiarity of people with conversational interfaces owing to the widespread proliferation of mobile devices and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CHI’20, April 25–30, 2020, Honolulu, HI, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-6708-0/20/04. . . $15.00 DOI: https://doi.org/10.1145/3313831.3376403 messaging services such as WhatsApp, Telegram, and Messen- ger. Today, over half the population on our planet has access to the Internet with ever-lowering barriers of accessibility. This has led to flourishing paid crowdsourcing marketplaces like Amazon’s Mechanical Turk (AMT) or Figure-Eight (F8), where people around the world can participate in online work with an aim to earn their primary livelihood, or as a secondary source of income. Recent work by Mavridis et al. [29] has explored the suitability of conversational interfaces for microtask crowdsourcing by juxtaposing them with standard Web interfaces in a variety of popularly crowdsourced tasks. The authors found that conver- sational interfaces were positively received by crowd workers, who indicated an overall satisfaction and an intention for future use of similar interfaces. The tasks executed using the conver- sational interfaces took similar execution times as those using the standard Web interfaces, and yielded comparable output quality. Although these findings suggest the use of conversa- tional interfaces as a viable alternative to the existing standard, little is known about the impact of conversational microtasking on the engagement of workers. Previous works have studied the nature of tasks that are popularly crowdsourced on AMT, showing that tasks are often deployed in large batches consist- ing of similar HITs (human intelligence tasks) [1, 8]. Long and monotonous batches of HITs pose challenges with regards to engaging workers, potentially leading to sloppy work due to boredom and fatigue [6]. There is a lack of understanding of whether conversational microtasking would either alleviate or amplify the concerns surrounding worker engagement. In this work, we aim to fill this knowledge gap. We conducted a study on AMT, involving 800 unique workers across 16 different experimental conditions to address the following research questions. #RQ1: To what extent can conversational agents improve the worker engagement in microtask crowdsourcing? #RQ2: How do conversational agents with different conver- sational styles affect the performance of workers and their cognitive load while completing tasks? We deployed batches of different types of HITs; information finding, sentiment analysis, CAPTCHA recognition, and im- age classification tasks on the traditional web interface and

Transcript of Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf ·...

Page 1: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

Improving Worker Engagement Through ConversationalMicrotask Crowdsourcing

Sihang QiuDelft University of Technology

Delft, The [email protected]

Ujwal GadirajuL3S Research Center,

Leibniz Universität HannoverHannover, Germany

[email protected]

Alessandro BozzonDelft University of Technology

Delft, The [email protected]

ABSTRACTThe rise in popularity of conversational agents has enabledhumans to interact with machines more naturally. Recent workhas shown that crowd workers in microtask marketplaces cancomplete a variety of human intelligence tasks (HITs) usingconversational interfaces with similar output quality comparedto the traditional Web interfaces. In this paper, we investigatethe effectiveness of using conversational interfaces to improveworker engagement in microtask crowdsourcing. We designeda text-based conversational agent that assists workers in taskexecution, and tested the performance of workers when in-teracting with agents having different conversational styles.We conducted a rigorous experimental study on Amazon Me-chanical Turk with 800 unique workers, to explore whetherthe output quality, worker engagement and the perceived cog-nitive load of workers can be affected by the conversationalagent and its conversational styles. Our results show that con-versational interfaces can be effective in engaging workers,and a suitable conversational style has potential to improveworker engagement. Our findings have important implicationson workflows and task design with regard to better engagingworkers in microtask crowdsourcing marketplaces.

Author KeywordsMicrotask crowdsourcing; conversational interface;conversational style; user engagement; cognitive task load.

CCS Concepts•Information systems→ Chat; Crowdsourcing; •Human-centered computing→ Empirical studies in HCI;

INTRODUCTIONThere has been a gradual rise in the use of conversational in-terfaces aiming to provide seamless means of interaction withvirtual assistants, chatbots, or messaging services. There is agrowing familiarity of people with conversational interfacesowing to the widespread proliferation of mobile devices and

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CHI’20, April 25–30, 2020, Honolulu, HI, USA

© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-6708-0/20/04. . . $15.00

DOI: https://doi.org/10.1145/3313831.3376403

messaging services such as WhatsApp, Telegram, and Messen-ger. Today, over half the population on our planet has accessto the Internet with ever-lowering barriers of accessibility.This has led to flourishing paid crowdsourcing marketplaceslike Amazon’s Mechanical Turk (AMT) or Figure-Eight (F8),where people around the world can participate in online workwith an aim to earn their primary livelihood, or as a secondarysource of income.

Recent work by Mavridis et al. [29] has explored the suitabilityof conversational interfaces for microtask crowdsourcing byjuxtaposing them with standard Web interfaces in a variety ofpopularly crowdsourced tasks. The authors found that conver-sational interfaces were positively received by crowd workers,who indicated an overall satisfaction and an intention for futureuse of similar interfaces. The tasks executed using the conver-sational interfaces took similar execution times as those usingthe standard Web interfaces, and yielded comparable outputquality. Although these findings suggest the use of conversa-tional interfaces as a viable alternative to the existing standard,little is known about the impact of conversational microtaskingon the engagement of workers. Previous works have studiedthe nature of tasks that are popularly crowdsourced on AMT,showing that tasks are often deployed in large batches consist-ing of similar HITs (human intelligence tasks) [1, 8]. Longand monotonous batches of HITs pose challenges with regardsto engaging workers, potentially leading to sloppy work dueto boredom and fatigue [6]. There is a lack of understandingof whether conversational microtasking would either alleviateor amplify the concerns surrounding worker engagement. Inthis work, we aim to fill this knowledge gap.

We conducted a study on AMT, involving 800 unique workersacross 16 different experimental conditions to address thefollowing research questions.

#RQ1: To what extent can conversational agents improve theworker engagement in microtask crowdsourcing?

#RQ2: How do conversational agents with different conver-sational styles affect the performance of workers and theircognitive load while completing tasks?

We deployed batches of different types of HITs; informationfinding, sentiment analysis, CAPTCHA recognition, and im-age classification tasks on the traditional web interface and

Page 2: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

three conversational interfaces having different conversationalstyles (4 task types × 4 interface variants).

We first investigated the effect of conversational interfaces withdifferent conversational styles on quality related outcomes incomparison to the traditional web interfaces. We addressedRQ1 by using two measures of worker engagement; (i) workerretention in the batches of tasks, and (ii) self-reported scoreson the short-form user engagement scale [32, 44]. We ad-dressed RQ2 by considering different conversational styleswithin conversational interfaces that workers interact with,and by using the NASA-TLX instrument to measure cognitiveload after workers complete the tasks they wish to. Our resultsshow that conversational interfaces have positive effects onworker engagement, as well as the perceived cognitive load incomparison to traditional web interfaces. We found that a suit-able conversational style has the potential to engage workersfurther (in specific task types), although our results were incon-clusive in this regard. Our work takes crucial strides towardsfurthering the understanding of conversational interfaces formicrotasking, revealing insights into the role of conversationalstyles across a variety of tasks.

RELATED WORK

Conversational AgentsConversational interfaces have been argued to have advantagesover traditional graphical user interfaces due to having a morehuman-like interaction [30]. Owing to this, conversationalinterfaces are on the rise in various domains of our everydaylife and show great potential to expand [43]. Recent workin the HCI community has investigated the experiences ofpeople using conversational agents, understanding user needsand user satisfaction [4, 5, 27]. Other works have studiedthe scope of using conversational agents in specific domains.Vandenberghe introduced the concept of bot personas, whichact as off-the-shelf users to allow design teams to interact withrich user data throughout the design process [40]. Others havestudied the use of conversational agents in the domains ofcomplex search [2, 22, 41] or food tracking [14]. These workshave shown that conversational agents can improve user expe-riences and have highlighted the need to further investigate theuse of conversational agents in different scenarios. In contrastto existing works, we explore the use of conversational agentsin improving worker engagement in microtask crowdsourcing.

Crowdsourced Conversational InterfacesPrior research has combined crowdsourcing and the conver-sational agent for training the dialogue manager or naturallanguage processing component [24]. Lasecki et al. designedand developed Chorus, a conversational assistant able to assistusers with general knowledge tasks [26]. Conversations withChorus are powered by workers who propose responses inthe background, encouraged by a game-theoretic incentivescheme. Workers can see the working memory (chat his-tory) and vote on candidate responses on a web-based workerinterface. Based on Chorus, an improved conversational assis-tant named Evorus was proposed. It can reduce the effort ofworkers by partially automating the voting process [19]. Thesame authors also developed a crowdsourced system called

Guardian, which enables both expert and non-expert workersto collaboratively translate Web APIs into a dialogue systemformat [20].

A conversational agent called Curious Cat was proposed tocombine the crowdsourcing approach from a different per-spective [3]. While most crowdsourced conversational agentsprovide information to users according to their requests, theCurious Cat was designed as a knowledge acquisition tool,which actively asked data from users. In our study, we pro-pose a conversational agent that serves in a conversationalinterface for workers, to perform different types of popularcrowdsourcing tasks.

Conversational StylesPrevious works have already shown that conversational styleshave played an important role in human lexical communica-tion. Lakoff suggested that conversational styles could be clas-sified into four categories from the least relationship betweenparticipants to the most relationship between participants: 1)Clarity, an ideal mode of discourse; 2) Distance, a style thatdoes not impose others; 3) Deference, a style giving options;and 4) Camaraderie, direct expression of desires [25]. Basedon Lakoff’s system, Tannen performed a systematic analy-sis on conversational style using the conversations recordedfrom a Thanksgiving dinner [36, 37]. Tannen concluded sev-eral important features, and accordingly distinguished conver-sational style from Involvement (overlapping with Lakoff’scamaraderie strategy) to Considerateness (overlapping withLakoff’s distance strategy).

Researchers have also attempted to apply theories pertainingto conversational styles in the field of human computer inter-action. [34] studied the preferred conversational style for aconversational agent. Their results suggested that users pre-ferred the agent whose style matched their own. A similarconclusion was drawn from an analytical study of informa-tion seeking conversation conducted by [38] using the MISCdataset [39]. [21] compared survey response data quality ac-quired from the web platform and chatbot. Particularly, theyperformed the experiment using formal and casual styles. Thechatbot using “casual” conversational style in their study triedto establish relationship with users, where we can find lin-guistic features from both Tannen’s “High-Involvement” and“High-Considerateness”. The chatbot using “formal” style isakin to the “clarity style” (showing no involvement and theleast relationship with the user) as summarized by Lakoff [25].

Our work uses features and linguistic devices from Tannen’s“High-Involvement” and “High-Considerateness” styles to de-sign the conversation for microtask crowdsourcing, and weconduct experiments to see the effects of using different con-versational styles.

Worker EngagementCrowdsourcing microtasks can often be monotonous and repet-itive in nature. Previous works have attempted to tackle theissues of boredom and fatigue manifesting in crowdsourcingmarketplaces as a result of long batches of similar tasks thatworkers often encounter. A variety of methods to retain and en-gage workers have been proposed. [33] suggested introducing

Page 3: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

micro-breaks into workflows to refresh workers, and showedthat under certain conditions micro-breaks aid in worker re-tention and improve their accuracy marginally. Similarly, [6]proposed to intersperse diversions (small periods of entertain-ment) to improve worker experience in lengthy, monotonousmicrotasks and found that such micro-diversions can signifi-cantly improve worker retention rate while maintaining workerperformance. Other works proposed the use of gamification toincrease worker retention and throughput [10]. [28] studiedworker engagement, characterized how workers perceive tasksand proposed to predict when workers would stop performingtasks. [7] introduced pricing schemes to improve worker re-tention, and showed that paying periodic bonuses accordingto pre-defined milestones has the biggest impact on retentionrate of workers.

In this work, we measure worker engagement by using theproxy of worker retention, and a standardized questionnairecalled the ‘user engagement scale’ (UES), introduced by [32].The UES was recently used by [44] to study the impact ofworker moods on their engagement in crowdsourced informa-tion finding tasks.

METHOD: CONVERSATIONAL INTERFACE FOR MICRO-TASK CROWDSOURCINGIn this study, we design and implement conversational inter-faces that enable the entire task execution process, while ex-ploring the impact of different conversational styles on workerperformance and engagement. The reader can directly ex-perience interaction with the conversational interface on thecompanion page.1

Workflow of Conversational MicrotaskingThe conversational interface is designed to help workers incarrying out crowdsourcing tasks. The main building blocksof conversational microtasks are similar to those of traditionalWeb interfaces; they include initiating the conversation (start-ing the task execution), answering questions, and finally pay-ing the workers. To assist the workers in task execution, theworkflow of conversational microtask crowdsourcing, as real-ized in our study is depicted in Figure 1 and described below.

1) After a worker accepts the task and opens the task page, theconversational interface is initialized with opening greetingsfrom the conversational agent. The worker can respond by se-lecting one of two options. During this step, the conversationalagent prompts brief information about the task, such as thetask name and the time limit. The goal of this step is twofold:to make users familiar with the conversational interface; and toestimate the conversational style of the worker. As explainedlater (in Section Aligning Conversational Styles), this step isneeded to align the agent’s conversational style with that ofthe worker.

2) If the worker asks for the task instructions after the openinggreetings, the conversational agent prompts the task instruc-tions. Otherwise, this step is skipped.

3) Next, the conversational agent presents tasks framed asquestions to the worker. On answering a question, another one1https://qiusihang.github.io/csbot

is presented in sequence. Each new question contains a brieftransition sentence (e.g. ”Good! The next one.”), the questionnumber (helping workers find and edit previous questions),and the content itself (which can contain any HTML-basedtask type). Furthermore, the conversational interface supportstwo modes of input from workers; in the form of free text andmultiple choices. When the expected input form of the answeris free text (e.g. in character recognition or audio transcriptiontasks), the worker must type the answer in the text area ofthe conversational interface. When the task includes multiple-choice answers, the worker can either type the answer as freetext (exactly the same value as one of the options), or simplyclick the corresponding UI button.

4) After the worker has answered 10 questions, the con-versational agent gives a break to relieve workers from themonotony of the batch of tasks. During the break, the conver-sational agent may send a “meme” or a joke for amusement,and then remind workers that they can stop answering andsubmit answers whenever they want.

5) When a worker decides to stop task execution, or when nomore pending questions are available, the conversational agentsends a list of answers provided by the worker, for review.The worker is then allowed to review one or more previousanswers and make any preferred edits.

6) The conversational agent then uploads the worker’s finalanswers to the server. Once it confirms the answers have beensuccessfully uploaded, a Task Token is given to the worker.

7) By pasting the Task Token on AMT, the worker can claimthe corresponding monetary compensation, proportional to thenumber of answered questions.

Send task instructions

Start task execution

Questions pending

Send the question

The answeris valid

Send the answer review

No

Yes

The worker wantsto modify answersYes

Yes

No

No

Answer the question

Activities of the worker

Activities of the chatbot

The workerwants to stop

Yes

No

Get paid

Upload answers

The workerfeels tired

Give a break

No

Yes

Figure 1. The workflow of conversational microtask crowdsourcing.

Page 4: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

Table 1. Design criteria for conversation styles of the agent.

Criteria High-Involvement High-ConsideratenessC1. Rate of speech fast slowC2. Turn taking fast slowC3. Introduction of topics w/o hesitation w/ hesitationC4. Use of syntax simple complexC5. Directness of content direct indirectC6. Utterance of questions frequent rare

Conversational Styles: Involvement or ConsideratenessTannen’s analysis of conversational style [36] is based on anaudio-taped conversation at a Thanksgiving dinner that tookplace in Berkeley, California, on November 23, 1978. Tan-nen found that, among the 6 participants present, 3 of themwere New Yorkers and shared a conversational style. Tannennamed the style of New Yorkers “High-Involvement”, whichcan be characterized as follows: “When in doubt, talk. Askquestions. Talk fast, loud, soon. Overlap. Show enthusiasm.Prefer personal topics, and so on.” The conversational styleof non-New Yorkers was called “High-Considerateness”, andcan be characterized as follows: “Allow longer pauses. Hesi-tate. Don’t impose one’s topics, ideas, personal information.Use moderate paralinguistic effects, and so on”. We selectedTannen’s classification of style to define conversational stylesof agents, since recent work has shown its suitability in un-derstanding styles in human-human conversations, and alsoin human-agent conversations [34]. Moreover, Tannen’s clas-sification has served as the basis for aligning the style of anend-to-end voice-based agent with that of an interlocutor [17].

Tannen identified four main features of the conversational style,namely topic, pacing, narrative strategies, and expressive par-alinguistics [37]. Based on these features and some linguisticdevices used in the conversation of the Thanksgiving dinner,we created the following criteria to design conversation con-sistent with the High-Involvement and High-Consideratenessstyles for the conversational agent, as shown in Table 1. Thecriteria can be organised into two categories:

1) Pacing (C1, C2): Since the conversational agent commu-nicates with the worker by typing text instead of via voiceutterances, we use typing speed and the pause before sendinga bubble (message) to simulate the rate of speech and the pausebefore turn taking. The High-Involvement style has a fasterrate of speech and turn taking. Hence, we set a 1 ms delayper character to simulate typing speed (C1), and 100 ms pausebefore sending a bubble for simulating turn taking (C2). Asfor the High-Considerateness style, which corresponds to aslower pace, we set a 2 ms delay per character and a 200 mspause before animating the bubble.

2) Content (C3, C4, C5, C6): The conversational agent cor-responding to the High-Involvement style introduces a newtopic to the worker (for instance, telling workers how to an-swer questions, how to edit answers, and how to submit an-swers) without hesitation (C3). On the contrary, we use somewords or paralanguage such as “Well..” and “Hmm..” to simu-late the hesitation of the High-Considerateness conversationalagent (C3). Furthermore, the conversational agent of High-Involvement style uses less syntax (C4) and chats directly

Go ahead.

Understand the timerequirement.

Show instructions

Sure!

Hmm... Let me have a look.

Absolutely!

Well... It should be enough.

Give me instructions!

Skip instructions!

Let me think... I need instructions.

Mhm... I don't think I need it.

1. Opening greetings

2. Time requirement

3. Task instructions

Skip instructions

Visible optionsActual optionsInteractions

High-Involvement style High-Considerateness style

Figure 2. Options given to the worker for conversational style estimationand alignment.

(C5), while the agent of High-Considerateness style uses rela-tively complex syntax (C4) and tends to express ideas/topicsin an indirect or polite way (C5). Tannen also emphasized theimportance of asking questions for the High-Involvement style[37]. Therefore, we use the frequency of questioning as oneof the criteria (C6) for conversation design.

Based on the content criteria described above, we createdtemplates of conversation for microtask crowdsourcing, asshown in Table 2.

Aligning Conversational StylesPrevious studies suggest that there is no such thing as the bestconversational style, since a style needs to be adapted to theinterlocutor [34, 38]. We therefore estimate the conversationalstyle of the worker, and investigate whether aligning the styleof the conversational agent with the conversational style of theworker can positively effect quality related outcomes in thetasks being completed.

To estimate the conversational style of the worker, a basicstrategy could be to analyze features of the worker’s repliesand classify the replies using these features. Note that theconversational style of a worker must be estimated and alignedbefore the worker starts answering questions, since repliesgiven during the actual task execution are in essence answersto the crowdsourcing tasks, rather than natural conversation.Therefore, the conversational style of the agent should bealigned right after the “opening greetings”, “time requirement”and “task instruction” interactions (in Table 2). However,such conversational elements are typically not rich enoughto enable feature extraction and style classification. In thisstudy, we therefore give workers dual options of conversationalstyles to select from (Figure 2), and then adapt the style of theconversational agent according to the worker selection.

We estimate the conversational style of workers as follows: 1)For each interaction, we provide one or two options that leadthe worker to the next interaction (we call these actual options).These options serve the purpose of ensuring progressivity inthe interaction [11, 35]. Note that actual options are invisible

Page 5: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

Table 2. Conversation templates for conversational agents with high-involvement and high-considerateness styles designed according to criteria distilledfrom Tannen’s characterization of conversation styles (cf. Table 1).

Interactions High-Involvement High-Considerateness Criteria

Opening greetings Hey! Can you help me with a task called [TASK NAME]? Thank you in advance for helping me with a task called [TASKNAME].

C4, C6.

Time requirements You must complete this task within 30 minutes, otherwise Iwon’t pay you :-)

I think 30 minutes should be more than enough for you tofinish :-)

C5.

Task instructions Here is the task instructions. Take a look! I kindly ask you to have a look at the task instructions. C4.

Introducing questions Listen, the first question! / OK! The next one. / Here yougo.

Good! Here is the first question. / Okay, I got it. Here is thenext question. / Alright, this is the question you want to havea look again.

C4.

Completing mandatoryquestions

Hey, good job! The mandatory part has been done! I knowyou want to continue, right?

OK, you have finished the mandatory part of the task. Well...please let me know if you want to answer more questions.

C3, C6.

Receiving an invalid an-swer

Oops, I don’t understand your answer. Do you forget how toanswer the question? Just type ”instruction”.

Hmm... Sorry, I don’t get it. Maybe you can type ”instruction”to learn how to answer the question.

C3, C6.

Break Are you feeling tired? If I’m driving you crazy, you can type”stop task” to leave me.

Well... alright, it seems that you have answered a lot of ques-tions. No worries, you can type ”stop task” if you don’t wantto continue.

C3, C5, C6.

Review You have completed the task! Here are your answers: [AN-SWERS]. Something wrong? Just edit the answer by typingits question number, or type "submit" to submit your an-swers.

Good job! The task has been completed. Here is the reviewof your answers: [ANSWERS]. Well... if you find somethingwrong here, please edit the answer by typing its questionnumber. Otherwise, you can type "submit" to submit youranswers.

C3, C4, C6.

Bye Your task token is [TASK TOKEN]. I’m off ;) Your task token is [TASK TOKEN]. Thank you! Your answershave been submitted. Nice talking to you. Bye!

C4.

to workers. The only actual option corresponding to “openinggreetings” is go ahead, while the only actual option of “timerequirement” is understand the time requirement. Forthe “task instructions” interaction, there are two actual options:show instructions and skip instructions, where theformer elucidates how to answer the crowdsourcing questionand the latter directly leads the worker through to the taskexecution stage. 2) As actual options are invisible to workers,we create two visible options (referring to High-Involvementand High-Considerateness respectively) for each actual option.To proceed, workers select a single response from the providedvisible options. 3) As a result of these three interactions, weobtain three specifically selected responses from each worker.If two or more replies refer to a High-Involvement style, weconsider the conversational style of the worker to be that ofHigh-Involvement, and vice versa.

On determining the conversational style of the worker, thestyle of the conversational agent is spontaneously aligned withthat of the worker.

EXPERIMENTAL DESIGNThe main goal of our study is to investigate the impact ofthe conversational interface on the output quality, workerengagement, and cognitive task load, we therefore considerthe traditional web interface (Web) for comparison, whereinthe input elements are default HTML-based question wid-gets provided by AMT. This will allow us to analyse ourresults in the light of recent findings from work by Mavridiset al. [29]. Another important objective is to study the effectthat different conversational styles have on the performanceof workers, completing microtasks through conversationalinterfaces. We thereby set up three different conversationalinterfaces; one with a High-Involvement style (Con+I), a High-Considerateness style (Con+C), and an aligned style (aligningthe style of the agent with the estimated style of the worker,

Con+A). The conversational interface with High Involvementor High Considerateness (namely, Con+I or Con+C) initi-ates with its corresponding conversational style and maintainsit through all interactions, while the conversation interfacewith style alignment (Con+A) initiates with either High In-volvement or High Considerateness randomly, and adjust itsconversational style after conversational style estimation.

In terms of the task types, we consider two input types (freetext and multiple choices) and two data types (text and images),resulting in a cross-section of 4 different types of tasks (asshown in Table 3): Information Finding, Sentiment Analysis,CAPTCHA Recognition, and Image Classification [12].

Table 3. Summary of task types.Input type Text ImageryFree text Information Finding CAPTCHA RecognitionMultiple choices Sentiment Analysis Image Classification

Information Finding (IF). Workers are asked to find a givenstore on Google Maps and report its rating (i.e., the numberof stars). The information corresponding to stores is obtainedfrom a publicly available Yelp dataset2.

Sentiment Analysis (SA). Workers are asked to read givenreviews of restaurants from the Yelp dataset, and judge theoverall sentiment of the review.

CAPTCHA Recognition (CR). Workers are asked to report thealphanumeric string contained in a CAPTCHA generated byClaptcha3, in the same order as they appear in the image.

Image Classification (IC). Workers are asked to analyse imagespertaining to 6 animal species (butterfly, crocodile, dolphin,2Yelp Open Dataset. https://www.yelp.com/dataset3 https://github.com/kuszaj/claptcha

Page 6: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

panda, pigeon, and rooster) selected from Caltech101 Dataset[9]. They are tasked with determining which animal a givenimage contains, and selecting the corresponding option.

Our experimental study is therefore composed of 16 experi-mental conditions (4 task types × 4 interfaces).

Task DesignThe task is organised in four steps: a demographic survey, themicrotask, the User Engagement Scale Short Form (UES-SF),and the NASA Task Load Index form (NASA-TLX).

The demographic survey consists of 6 general backgroundquestions. The microtask contains 5 mandatory questionsand 45 optional questions. When a worker completes the 5mandatory questions, the conversational agent asks the workerwhether he/she wants to continue, while the traditional Webinterface features a button named I want to answer morequestions that prompts additional questions when clicked.During task execution, both the Web interface and conversa-tional agent induce a small break after 10 consecutive ques-tions. During the breaks, the conversational agent (as well asthe Web interface) show a “meme” for amusement. The ratio-nale behind such a micro-diversion is to ensure that workerresponses are not affected by boredom or fatigue [6, 33], mak-ing our experimental setup robust while measuring workerengagement across different conditions. Thereafter, the con-versational agent periodically reminds workers that they canstop anytime and asks the worker if he/she wants to continue.Similarly on the Web interface, a click on the I want toanswer more questions button prompts a meme and 10more questions. Workers could quit at any point after themandatory questions by entering ‘stop task’ in the conversa-tional interfaces or clicking a stop button on the Web interface;this could be used by workers to exit the tasks and claimrewards for work completed.

Next, workers are asked to complete the short-form of theUser Engagement Scale (UES-SF) [31, 32]. The UES-SF con-tains four sub-scales with 12 items, comprising a tool that iswidely used for measuring user engagement in various digi-tal domains. Each item is presented as a statement using a 7point Likert-scale from “1: Strongly Disagree” to “7: StronglyAgree”. We chose the UES-SF since it has been validated ina variety of HCI contexts, and to date, it is the most testedquestionnaire that measures user engagement. UES-SF per-fectly fits our context of online crowdsourcing. With a totalof only 12 items, it is easy to motivate workers to respond.Finally, workers are asked to complete the NASA Task LoadIndex (NASA-TLX) questionnaire, where workers rate theirfeelings about the task workload4. The questionnaire hassix measurements (questions) about Mental Demand, Phys-ical Demand, Temporal Demand, Performance, Effort, andFrustration respectively. We use the NASA-TLX due to con-siderable evidence of its robustness in measuring the cognitivetask load (across 6 dimensions) of users accomplishing giventasks, which aligns with the goal of our study [16].

4NASA-TLX: Task Load Index. https://humansystems.arc.nasa.gov/groups/TLX/

Worker InterfaceBoth the Web interface and the conversational interface aredesigned and implemented on top of AMT (see Figure 3). Forboth interfaces, the demographic survey, UES-SF and NASA-TLX are created using default HTML-based questions widgetsprovided by AMT; using the Crowd HTML Element.

The element crowd-radio-group including severalcrowd-radio-buttons is used for creating all the back-ground questions from the demographic survey. Theworker can select only one crowd-radio-button from thecrowd-radio-group. The element crowd-slider is usedfor creating all the questions from UES-SF and NASA-TLX,since corresponding responses are on an integer scale rangingfrom 1 to 7 (UES-SF) or from 0 to 100 (NASA-TLX).

In recent work that has explored conversational interfaces formicrotasking, the interface was built on top of platforms suchas Telegram [29], demanding extra effort from crowd work-ers to register an account (if they were not registered usersbefore) and redirecting workers to the social platform fromthe crowdsourcing platform. To overcome this limitation andfairly compare the conversational interface with the traditionalweb interface, we designed and implemented the conversa-tional agent purely based on HTML and Javascript. Thus, itcan be perfectly embedded on the AMT task page without anyrestrictions. Finally, a crowd-input element is placed belowthe conversational agent for entering the Task Token receivedon completion of the tasks.

The only difference between the interfaces (traditional Webversus conversational) is in the interaction with the user andhow input is received. The Web interface contains eithercrowd-input or crowd-radio-group, respectively for freetext and multiple choices, whereas the conversational interfaceuses textarea (shown at the bottom) and bubble-like buttonsfor each. As shown in Figure 3, we developed a rule-basedconversational agent based on chat-bubble5.

Experimental SetupEach experimental condition (modeled as a batch of HITs)consists of 50 questions and we recruit 50 unique workers toanswer these 50 questions. Each worker is asked to completeat least 5 mandatory questions. Across the 16 experimentalconditions, we thereby acquired responses from 16×50= 800unique workers in total.

When a worker successfully completes the demographic sur-vey, UES-SF, NASA-TLX and at least 5 mandatory questions,the worker immediately receives 0.5$. The reward for theoptional questions is given to workers through the “bonusing”function on AMT. We estimated the execution time and paidworkers 0.01$ per optional task as a bonus for the image tasks(Image Classification and CAPTCHA Recognition), 0.02$ peroptional task for the text tasks (Information Finding and Sen-timent Analysis). On task completion, we instantly bonusedworkers the difference required to meet an hourly pay of 7.25$based on the total time they spent on tasks (including thetime for breaks). The instructions clearly explained rewardsfor each optional task; workers knew of the base reward and5https://github.com/dmitrizzle/chat-bubble

Page 7: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

bubble-like buttons textarea

question question

crowd-radio-group

questionquestion

crowd-input

(a) Conversational interface forMultiple-Choices task

(c) Conversational interface forFree-Text task

(b) Traditional web interface forMultiple-Choices task

(d) Traditional web interface forFree-Text task

Figure 3. The comparison of conversational interfaces embedded on the user interface of AMT and traditional web interfaces using HTML elementsprovided by AMT, where the worker needs to provide a Task Token acquired from the conversational interface after the task is completed.

bonuses at the onset, ensuring that there was no unnaturalfinancial uncertainty other than what is typical on AMT.

Quality ControlTo prevent malicious workers from executing the crowdsourc-ing tasks, we only accept participants whose overall HIT ap-proval rates are greater than 95%. Using Javascript and track-ing worker-ids, we also ensure that each worker submits atmost one assignment across all experimental conditions, toavoid learning biases due to repeated participation.

Evaluation MetricsThe dependent variables in our experiments are output quality,worker engagement, and cognitive task load. We use pairwiseindependent tests to test for statistical significance (expectedα = 0.05, two-tailed, corrected to control for Type-I errorinflation in our multiple comparisons). We use the Holm-Bonferroni correction to control the family-wise error rate(FWER) [18].

Output quality, is measured in terms of the judgment accu-racy of workers. It is measured by comparing the workers’responses with the ground truth. Thus, a given worker’s accu-racy is the fraction of correct answers provided by the workeramong all the provided answers. In case of Information Find-ing tasks, the stars provided by workers should exactly matchthe stars from Google Maps. For the other task types, theworkers’ answers (string) should be identical to the groundtruth (case insensitive).

Worker engagement, is measured using 2 popular approaches:1) the worker retention, i.e. the number of answered optionalquestions, and the proportion of workers answering at least oneoptional question; and 2) the UES-SF overall score (rangingfrom 1 to 7; the higher the UES score is, the more engaged theworker is).

Cognitive task load, is evaluated by unweighted NASA-TLXtest. Through the scores (ranging from 0 to 100: higher scoremeans the heavier task load) of the TLX test, we study if andhow conversational interfaces affect perceived cognitive loadfor the executed task.

RESULTSWorker DemographicsOf the unique 800 workers, 37.8% were female and 62.2%were male. Most workers (89.8%) were under 45 years old.

72.7% of workers reported that their education levels werehigher than (or equal to) Bachelor’s degree. 37.9% of theworkers claimed AMT as their primary source of income,while about half of the workers (55.8%) reported that AMTwas their secondary source of income.

Distribution of Conversational StylesWe estimated the conversational style of workers across all theconversational interface conditions using the method proposedin Figure 2. The number of workers whose conversationalstyles were estimated as High Involvement and High Consid-erateness are shown in Figure 4.

Con+I

Con+C

Con+A

Con+I

Con+C

Con+A

Con+I

Con+C

Con+A

Con+I

Con+C

Con+A

01020304050

# wo

rker

s

32

18

30

20

23

27

35

15

25

25

38

12

42

8

45

5

41

9

42

8

37

13

41

9

High Involvement (est.) High Considerateness (est.)

InformationFinding

SentimentAnalysis

CAPTCHARecognition

ImageClassification

Figure 4. Distributions of estimated styles across all conditions.

Inv. Con. Inv. Con. Inv. Con. Inv. Con.Initial style of the conversational agent

0

10

20

30

# wo

rker

s

9

16

14

11

19

7

19

5

17

6

24

3

16

6

25

3

High Involvement (est.) High Considerateness (est.)

InformationFinding

SentimentAnalysis

CAPTCHARecognition

ImageClassification

Figure 5. Distributions of estimated styles of conversational interfaceswith style alignment by two initial styles.

As we described earlier, the conversational agent maintains aHigh-Involvement and High-Considerateness styles in Con+Iand Con+C conditions respectively, while in Con+A con-ditions the conversational agent initiates with either High-Involvement or High-Considerateness style randomly. Figure5 shows the number of workers whose conversational styleswere estimated as High Involvement and High Consideratenessrespectively in conversational interfaces with style alignment

Page 8: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

(Con+A), across all task types with two initial conversationalstyles (High Involvement and High Considerateness).

Output QualityMain result: In terms of output quality, conversational inter-faces have no significant difference (min p = 0.09) comparedto the traditional web interface, and there is no significantdifference across conversational styles.

Table 4 shows the mean and standard deviation of workers ac-curacy across the 16 experimental conditions. Since the ImageClassification task is objective and simple, we obtained high-accuracy (98%-100%) results across the 4 different interfaceconditions.

Pairwise independent t-tests revealed no significant differencein the output qualities across four interfaces (conversationalstyles) within each task type. This aligns with the findingsfrom previous work [29]. For Image Classification tasks, theworker accuracy across all interfaces and conversational stylesis higher than other types of tasks due to the relative simplicity.

Table 4. Worker accuracy (µ ±σ : mean and standard deviation) andp-values across different task types and interface conditions.

Task type Web(vs. Con+I,C,A)

Con+I(vs. Con+C,A)

Con+C(vs. Con+A)

Con+A

IF 0.66±0.29(p = 0.69, 0.2, 0.88)

0.63±0.3(p = 0.37, 0.81)

0.58±0.3(p = 0.26)

0.65±0.29

SA 0.62±0.27(p = 0.18, 0.99, 0.74)

0.54±0.29(p = 0.16, 0.09)

0.62±0.26(p = 0.74)

0.64±0.27

CR 0.72±0.16(p = 0.33, 0.13, 0.23)

0.69±0.14(p = 0.48, 0.02)

0.67±0.19(p = 0.01)

0.75±0.12

IC 1.0±0.03(p = 0.19, 0.39, 0.09)

0.98±0.09(p = 0.41, 0.95)

0.99±0.04(p = 0.29)

0.98±0.07

Worker EngagementWorker Retention.Main result: Conversational interfaces lead to significantlyhigher worker retention in multiple-choice tasks compared tothe traditional web interface. Particularly, a High-Involvementstyle corresponds to significantly higher worker retentionacross all task types compared to the web interface.

Figure 6 shows a violin plot representing the number of op-tional tasks completed by workers. In this figure, each “violin”represents the distribution of workers in each of the experimen-tal conditions. The width of the violin at any point, representsthe number of workers who answered the corresponding num-ber of optional questions. The distribution does not meet anyassumptions for parametric tests. Thus, we use the WilcoxonRank-Sum test (expected α = 0.05, two-tailed, corrected byHolm-Bonferroni method) to test the significance of the pair-wise difference. Results are shown in Table 5. We found thatacross all task types, the number of optional tasks completedby workers using the Web interface was significantly lowerthan that in the conversational interface with Involvementstyle (RQ2). Compared with Web, the conversational interfacewith style alignment (Con+A) also shows significantly higherworker retention except in the Information Finding task, whilethe Considerateness style shows significantly higher workerretention in multiple-choice tasks (RQ2). We found that the

workers using conversational interfaces were generally betterretained than the Web workers in multiple-choice tasks, andnone of the Web workers completed all the available optionalquestions in the Information Finding task (RQ1).

Table 5. The worker retention (µ ± σ : mean and standard deviation,unit: the number of optional tasks completed by workers) and p-valuesacross different task types and interface conditions.

Task type Web(vs. Con+I,C,A)

Con+I(vs. Con+C,A)

Con+C(vs. Con+A)

Con+A

IF 4.18±8.66(p = 1.8e-4*, 0.02, 5.1e-3)

15.47±18.0(p = 0.08, 0.17)

7.55±12.14(p = 0.67)

9.4±13.63

SA 5.4±11.99(p = 2.7e-5*, 8.3e-5*, 2.3e-6*)

11.78±15.26(p = 0.6, 0.29)

8.63±10.32(p = 0.09)

14.92±16.09

CR 8.4±16.75(p = 1.3e-3*, 2.3e-3, 2.0e-4*)

14.96±16.73(p = 0.98, 0.22)

15.14±16.8(p = 0.19)

21.37±19.9

IC 8.7±17.17(p = 1.2e-6*, 6.1e-5*, 2.9e-5*)

28.6±17.97(p = 0.07, 0.67)

20.29±19.08(p = 0.34)

25.61±20.52

* = statistically significant (corrected Wilcoxon Rank-Sum test)

Table 6 lists the number and percentage of the workers whoanswered at least one optional question. While only 26%-32%of workers decided to answer at least one optional questionin the Web condition, 60%-84% of the workers operatingwith the conversational agents answered at least one optionalquestion. This result also suggests a higher degree of retentionassociated with the conversational interface.

Table 6. The number of workers (with percentages) who completed atleast one optional question across all task types and the four interfaces.

Task type Web Con+I Con+C Con+AIF 16 (32%) 34 (68%) 30 (60%) 32 (64%)SA 13 (26%) 40 (80%) 39 (78%) 41 (82%)CR 14 (28%) 34 (68%) 34 (68%) 35 (70%)IC 13 (26%) 42 (84%) 38 (76%) 37 (74%)

Overall 56 (28%) 150 (75%) 141 (70.5%) 145 (72.5%)

User Engagement Scale (UES-SF).Main result: Input and data types can significantly affect theUES-SF score, while interfaces and conversational styles werefound to have no significant impact.

Table 7 lists the UES-SF scores across all the experimentalconditions. Pairwise independent t-tests (expected α = 0.05,two-tailed, corrected by Holm-Bonferroni method) betweenweb and conversational interfaces (RQ1) with different con-versational styles (RQ2) show that the UES-SF scores haveno significant difference across four interfaces (conversationalstyles) within each task type.

However, as shown in Table 8 (p-values), between-task pair-wise independent t-tests (expected α = 0.05, two-tailed, cor-rected by Holm-Bonferroni method) revealed that the overallPerceived Usability of image-based tasks (CAPTCHA Recog-nition and Image Classification) is significantly higher thantext-based tasks (Information Finding and Sentiment Anal-ysis). In terms of overall Aesthetic Appeal, Reward Factorand Overall UES score, the scores of multiple-choice tasks(Sentiment Analysis and Image Classification) are higher thanfree-text tasks (Information Finding and CAPTCHA Recogni-tion) with statistical significance.

Page 9: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

Web Con+IInformation Finding

Con+C Con+A Web Con+ISentiment Analysis

Con+C Con+A Web Con+ICAPTCHA Recognition

Con+C Con+A Web Con+IImage Classification

Con+C Con+A0

10

20

30

40#

answ

ered

optio

nal q

uest

ions

Figure 6. A violin plot representing the number of optional questions answered by workers across different task types and different interfaces, wherethe black dots represent the mean value. A violin plot is a hybrid of a box plot and a kernel density plot, revealing peaks in the data that cannot bevisualized using box plots.

Table 7. The UES-SF score (µ±σ : mean and standard deviation) of all task types with four interfaces.Categories Web Con+I Con+C Con+A Overall Web Con+I Con+C Con+A Overall

Information Finding Sentiment AnalysisFocused attention 4.12±1.40 4.39±1.44 3.68±1.45 3.81±1.53 3.98±1.51 4.12±1.30 4.43±1.20 4.07±1.46 4.28±1.38 4.21±1.37Perceived usability 3.71±1.67 3.70±1.61 3.86±1.70 4.24±1.61 3.86±1.68 3.91±1.83 3.86±1.67 4.19±1.63 4.42±1.85 4.08±1.78Aesthetic appeal 4.23±1.46 4.29±1.29 4.10±1.12 4.01±1.58 4.14±1.40 4.75±1.28 4.67±1.51 4.84±1.31 4.86±1.12 4.76±1.35Reward factor 4.35±1.23 4.44±1.53 4.41±1.33 4.17±1.49 4.33±1.44 4.99±1.23 4.90±1.33 4.95±1.31 5.05±1.37 4.95±1.36Overall 4.10±0.85 4.21±0.85 4.01±0.69 4.06±1.00 4.07±0.90 4.44±0.87 4.46±0.98 4.51±0.90 4.65±0.88 4.50±0.97

CAPTCHA Recognition Image ClassificationFocused attention 3.83±1.76 3.92±1.61 3.93±1.56 4.39±1.55 4.00±1.66 4.30±1.45 4.21±1.77 4.35±1.62 4.16±1.85 4.23±1.70Perceived usability 4.95±1.57 4.71±1.66 4.56±1.64 4.74±1.43 4.71±1.62 4.41±1.93 4.91±1.53 4.90±1.67 4.68±1.78 4.70±1.77Aesthetic appeal 3.74±1.71 4.10±1.73 3.95±1.81 3.94±1.69 3.92±1.76 4.73±1.42 4.53±1.56 4.75±1.37 4.75±1.65 4.67±1.54Reward factor 4.43±1.71 4.42±1.79 4.50±1.68 4.25±1.66 4.38±1.74 4.97±1.32 5.09±1.73 4.87±1.56 5.14±1.60 5.00±1.60Overall 4.24±1.27 4.29±1.11 4.23±1.12 4.33±1.14 4.25±1.20 4.60±0.91 4.69±1.20 4.72±1.03 4.68±1.19 4.65±1.13

Table 8. p-values of between-task statistical tests of UES-SF score.

Categories IF vs. SA IF vs. CR IF vs IC SA vs. CR SA vs. IC CR vs. ICFocused attention 0.11 0.90 0.11 0.17 0.85 0.16Perceived usability 0.21 2.8e-7* 1.2e-6* 1.8e-4* 4.1e-4* 0.96Aesthetic appeal 8.0e-6* 0.16 3.6e-4* 1.1e-7* 0.53 6.5e-6*Reward factor 9.4e-6* 0.73 1.2e-5* 2.8e-4* 0.75 2.4e-4*

Overall 7.3e-6* 9.4e-2 3.2e-8* 2.4e-2* 0.14 6.6e-4*

* = statistically significant (corrected t-test)

Cognitive Task LoadMain result: We found no significant difference in NASA-TLX scores across different interfaces (web vs. conversationalinterface and between conversational styles).

To answer RQ2, we calculated and listed unweighted NASA-TLX scores in Table 9. According to pairwise independentt-tests (expected α = 0.05, two-tailed, corrected by Holm-Bonferroni method), the NASA-TLX scores have no signifi-cant difference across four interfaces (conversational styles)within each task type. However the conversational interfacewith aligned style has the potential to reduce the cognitivetask load for Information Finding task compared with the webinterface (no significance, p = 0.033, which is less than 0.05but higher than corrected α).

DISCUSSIONAspects such as task complexity [42], task types, instruc-tions [13] are instrumental in shaping crowd work [23]. How-ever, previous work has shown that conversational interfacescan effectively benefit workers from different perspectives,such as satisfaction [29] and effort [26, 19]. Conversationalinterfaces are on the rise across different domains and it is

Table 9. The unweighted NASA-TLX score (µ±σ : mean and standarddeviation) and p-values of all task types with four interfaces.

Task type Web(vs. Con+I,C,A)

Con+I(vs. Con+C,A)

Con+C(vs. Con+A)

Con+A

IF 52.35±20.75(p = 0.51, 0.12, 0.03)

49.62±20.39(p = 0.37, 0.13)

46.05±19.63(p = 0.51)

43.4±20.25

SA 50.27±17.76(p = 0.95, 0.31, 0.17)

50.02±20.54(p = 0.37, 0.22)

46.54±18.26(p = 0.71)

45.15±18.85

CR 38.23±19.56(p = 0.81, 0.74, 0.6)

37.29±20.26(p = 0.58, 0.78)

39.54±20.2(p = 0.4)

36.14±19.89

IC 43.38±22.64(p = 0.46, 0.07, 0.44)

40.22±19.56(p = 0.23, 0.94)

35.57±19.11(p = 0.3)

39.89±21.94

important to study how conversational styles and alignmentcan improve worker experience and satisfaction.

Through our experiments, we found that workers preferred us-ing High-Considerateness style while conducting InformationFinding and Sentiment Analysis tasks. In contrast, we foundthat workers tend to use High-Involvement style while com-pleting CAPTCHA Recognition and Image Classification tasks.This suggests that workers are likely to exhibit an involved con-versational style when they are relatively more confident, orthe tasks are less difficult (RQ2). The results of style alignmentfurther show that workers’ conversational styles are mainlyaffected by task types rather than initial styles of the agent. Wenote that Information Finding and Sentiment Analysis tasksare typically more complex [42] in comparison to CAPTCHARecognition and Image Classification. This calls for furtherexploration of the impact of task complexity on task outcomeswithin conversational microtask crowdsourcing.

In terms of the effect of conversational styles on worker reten-tion, there was no significant difference between the different

Page 10: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

styles. A possible explanation can be the maximum limit (45)of the available optional tasks that a worker can answer, as wefound that many workers who conducted image-based tasks(i.e. CAPTCHA Recognition and Image Classification) onthe conversational interfaces with High-Involvement and stylealignment completed all the available 45 optional tasks. Ourfindings regarding the impact of conversational style on workerretention suggests that a High-Involvement conversation stylecan provide workers with engagement stimuli for long-termretention (RQ2). As 45 optional tasks limit the scale of workerretention in this study, using an unlimited number of optionalquestions to analyze the impact on worker retention should beconsidered in the future research.

Our results showed significant differences between image-based tasks and text-based tasks with regard to UES-SF scores.This is potentially due to the complexity of the tasks (the twotext-based tasks are more taxing than the two image-basedtasks). The results also suggest that the input type (free textvs. multiple choices) have a principal impact on the UES-SFscores, which weaken the effect of different interfaces andconversational styles. The influence of task complexity andits mediating interaction with conversational styles should beconsidered in the imminent future.

There was no significant difference in NASA-TLX scores ofworkers between web and conversational interfaces. As Infor-mation Finding and Sentiment Analysis are more demandingthan the CAPTCHA Recognition and Image Classification, theresults of NASA-TLX also suggest that the task complexityhas an impact on the perceived cognitive load. We aim tostudy this further and tease out the interaction between taskcomplexity and cognitive load in conversational microtasking.

Design Implications for HCIWe found that workers tend to exhibit different conversationalstyles due to the effect of task complexity. However, ourresults of aligning conversational styles of the agent with thatof the workers suggest that giving the conversational agenta High-Involvement style can generally improve the workerretention in conversational microtask crowdsourcing.

A healthy relationship between workers and requesters is criti-cal to the sustainability of microtask marketplaces. It is in theinterest of requesters to take steps to ensure this. By adoptingconversational interfaces, requesters can improve worker en-gagement, particularly in less complex tasks as suggested byour findings, allowing workers to complete more work, earnmore money, and foster good faith in the requester-workerlong term relationship.

These constitute important design implications that task re-questers can consider while optimizing for worker engagementin long batches of HITs. Distilling the complex interactionsbetween task difficulty, conversational styles and quality re-lated outcomes in conversational microtasking can help makecrowdsourcing systems more engaging and effective. The HCIcommunity is uniquely suited to further explore the impact ofconversational styles on quality related outcomes in microtaskcrowdsourcing, and we believe our work presents an importantfirst step in this direction. Accurately estimating the general

or preferred conversational styles of individuals, so as to adaptconversational styles of agents can bear great dividends indomains beyond conversational microtasking.

In this study, we have designed and developed a web-basedconversational agent that is able to execute crowdsourcingmicrotasks of common task types (including Information Find-ing, Sentiment Analysis, Optical Character Recognition, Im-age Classification, Audio Transcription, Survey, etc.). Fur-thermore, since the conversational interface is purely HTML-based, elements used in traditional web interfaces can be easilyported into conversational interfaces. Therefore, the overheadsof designing and implementing conversational interfaces canbe easily reduced, which is a small price to pay for an increasein worker engagement. The code corresponding to our con-versational agent is available publicly on the companion page,alongside data for the benefit of the community.

Caveats and LimitationsOur findings with respect to the impact of conversational inter-faces on worker engagement across different task types suggestthat different conversational styles of the agent can affect theworker retention, albeit not consistently. Moreover, furtherexperiments that decouple the impact of task difficulty [42]are needed to fully uncover the impact of conversational stylesin conversational microtask crowdsourcing. Having said that,our findings are an important first step towards optimizingnovel conversational interfaces for microtask crowdsourcing.

Influence of Monetary Incentives. Workers earned monetaryrewards across all conditions in our study. Monetary rewardshave been shown to incentivize workers to complete morework [7]. However, we ensured that the pay per unit time (re-ward) is identical across all conditions and task types; makingcomparisons across conditions in our study valid and meaning-ful. Our long-term goal through conversational microtaskingis to improve engagement, help workers overcome fatigue orboredom and reduce task abandonment [15].

Implementing Conversational Interfaces. For task requesters,it can be difficult to adapt some types of tasks to conversa-tional interfaces (such as drawing free-form boundaries aroundobjects). However, as research in conversational microtaskingadvances, so will the support for requester assistance in realiz-ing such interfaces with ease. Requesters can further considerthe trade-off between implementation costs and the benefits ofincreased worker engagement.

CONCLUSIONSIn this paper, we studied how workers engagement can beaffected by conversational interfaces and conversational styles.We conducted online crowdsourcing experiments to studywhether the worker engagement can be affected by the con-versational interface (RQ1). We used post-task surveys to testworkers’ user engagement and cognitive load while complet-ing tasks using conversational interfaces with different conver-sational styles (RQ2). We show that the use of conversationalinterfaces can improve the perceived worker engagement, andthat the adopted conversational style can also have an effecton worker retention.

Page 11: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

REFERENCES[1] Alan Aipe and Ujwal Gadiraju. 2018. SimilarHITs:

Revealing the Role of Task Similarity in MicrotaskCrowdsourcing. In Proceedings of the 29th on Hypertextand Social Media. ACM, 115–122.

[2] Sandeep Avula, Gordon Chadwick, Jaime Arguello, andRobert Capra. 2018. SearchBots: User Engagement withChatBots During Collaborative Search. In Proceedingsof the 2018 Conference on Human InformationInteraction & Retrieval. ACM, 52–61.

[3] Luka Bradeško, Michael Witbrock, Janez Starc, ZalaHerga, Marko Grobelnik, and Dunja Mladenic. 2017.Curious Cat–Mobile, Context-Aware ConversationalCrowdsourcing Knowledge Acquisition. ACMTransactions on Information Systems (TOIS) 35, 4(2017), 33.

[4] Phil Cohen, Adam Cheyer, Eric Horvitz, RanaEl Kaliouby, and Steve Whittaker. 2016. On the futureof personal assistants. In Proceedings of the 2016 CHIConference Extended Abstracts on Human Factors inComputing Systems. ACM, 1032–1037.

[5] Benjamin R Cowan, Nadia Pantidi, David Coyle, KellieMorrissey, Peter Clarke, Sara Al-Shehri, David Earley,and Natasha Bandeira. 2017. What can i help you with?:infrequent users’ experiences of intelligent personalassistants. In Proceedings of the 19th InternationalConference on Human-Computer Interaction withMobile Devices and Services. ACM, 43.

[6] Peng Dai, Jeffrey M Rzeszotarski, Praveen Paritosh, andEd H Chi. 2015. And now for something completelydifferent: Improving crowdsourcing workflows withmicro-diversions. In Proceeding of The 18th ACMConference on Computer-Supported Cooperative Workand Social Computing. ACM, 628–638.

[7] Djellel Eddine Difallah, Michele Catasta, GianlucaDemartini, and Philippe Cudré-Mauroux. 2014.Scaling-up the crowd: Micro-task pricing schemes forworker retention and latency improvement. In SecondAAAI Conference on Human Computation andCrowdsourcing.

[8] Djellel Eddine Difallah, Michele Catasta, GianlucaDemartini, Panagiotis G Ipeirotis, and PhilippeCudré-Mauroux. 2015. The dynamics of micro-taskcrowdsourcing: The case of amazon mturk. InProceedings of the 24th international conference onworld wide web. International World Wide WebConferences Steering Committee, 238–247.

[9] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2007.Learning generative visual models from few trainingexamples: An incremental bayesian approach tested on101 object categories. Computer vision and Imageunderstanding 106, 1 (2007), 59–70.

[10] Oluwaseyi Feyisetan, Elena Simperl, Max Van Kleek,and Nigel Shadbolt. 2015. Improving paid microtasksthrough gamification and adaptive furtheranceincentives. In Proceedings of the 24th International

Conference on World Wide Web. International WorldWide Web Conferences Steering Committee, 333–343.

[11] Joel E Fischer, Stuart Reeves, Martin Porcheron, andRein Ove Sikveland. 2019. Progressivity for voiceinterface design. In Proceedings of the 1st InternationalConference on Conversational User Interfaces. ACM,26.

[12] Ujwal Gadiraju, Ricardo Kawase, and Stefan Dietze.2014. A taxonomy of microtasks on the web. InProceedings of the 25th ACM conference on Hypertextand social media. ACM, 218–223.

[13] Ujwal Gadiraju, Jie Yang, and Alessandro Bozzon. 2017.Clarity is a worthwhile quality: On the role of taskclarity in microtask crowdsourcing. In Proceedings ofthe 28th ACM Conference on Hypertext and SocialMedia. ACM, 5–14.

[14] Bettina Graf, Maike Krüger, Felix Müller, AlexanderRuhland, and Andrea Zech. 2015. Nombot: simplifyfood tracking. In Proceedings of the 14th InternationalConference on Mobile and Ubiquitous Multimedia.ACM, 360–363.

[15] Lei Han, Kevin Roitero, Ujwal Gadiraju, CristinaSarasua, Alessandro Checco, Eddy Maddalena, andGianluca Demartini. 2019. All those wasted hours: Ontask abandonment in crowdsourcing. In Proceedings ofthe Twelfth ACM International Conference on WebSearch and Data Mining. ACM, 321–329.

[16] Sandra G Hart. 2006. NASA-task load index(NASA-TLX); 20 years later. In Proceedings of thehuman factors and ergonomics society annual meeting,Vol. 50. Sage publications Sage CA: Los Angeles, CA,904–908.

[17] Rens Hoegen, Deepali Aneja, Daniel McDuff, and MaryCzerwinski. 2019. An End-to-End Conversational StyleMatching Agent. arXiv preprint arXiv:1904.02760(2019).

[18] Sture Holm. 1979. A simple sequentially rejectivemultiple test procedure. Scandinavian journal ofstatistics (1979), 65–70.

[19] Ting-Hao Kenneth Huang, Joseph Chee Chang, andJeffrey P Bigham. 2018. Evorus: A Crowd-poweredConversational Assistant Built to Automate Itself OverTime. In Proceedings of the 2018 CHI Conference onHuman Factors in Computing Systems. ACM, 295.

[20] Ting-Hao Kenneth Huang, Walter S Lasecki, andJeffrey P Bigham. 2015. Guardian: A crowd-poweredspoken dialog system for web apis. In Third AAAIconference on human computation and crowdsourcing.

[21] Soomin Kim, Joonhwan Lee, and Gahgene Gweon.2019. Comparing Data from Chatbot and Web Surveys:Effects of Platform and Conversational Style on SurveyResponse Quality. In Proceedings of the 2019 CHIConference on Human Factors in Computing Systems(CHI ’19). ACM, New York, NY, USA, Article 86, 12pages. DOI:http://dx.doi.org/10.1145/3290605.3300316

Page 12: Improving Worker Engagement Through Conversational ...gadiraju/publications/CHI2020.pdf · conversational interfaces with similar output quality compared to the traditional Web interfaces.

[22] Julia Kiseleva, Kyle Williams, Jiepu Jiang, AhmedHassan Awadallah, Aidan C Crook, Imed Zitouni, andTasos Anastasakos. 2016. Understanding usersatisfaction with intelligent assistants. In Proceedings ofthe 2016 ACM on Conference on Human InformationInteraction and Retrieval. ACM, 121–130.

[23] Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein,Elizabeth Gerber, Aaron Shaw, John Zimmerman, MattLease, and John Horton. 2013. The future of crowd work.In Proceedings of the 2013 conference on Computersupported cooperative work. ACM, 1301–1318.

[24] Pavel Kucherbaev, Alessandro Bozzon, and Geert-JanHouben. 2018. Human-Aided Bots. IEEE InternetComputing 22, 6 (2018), 36–43.

[25] Robin Tolmach Lakoff. 1979. Stylistic strategies withina grammar of style. Annals of the New York Academy ofSciences 327, 1 (1979), 53–78.

[26] Walter S Lasecki, Rachel Wesley, Jeffrey Nichols,Anand Kulkarni, James F Allen, and Jeffrey P Bigham.2013. Chorus: a crowd-powered conversational assistant.In Proceedings of the 26th annual ACM symposium onUser interface software and technology. ACM, 151–162.

[27] Ewa Luger and Abigail Sellen. 2016. Like having areally bad PA: the gulf between user expectation andexperience of conversational agents. In Proceedings ofthe 2016 CHI Conference on Human Factors inComputing Systems. ACM, 5286–5297.

[28] Andrew Mao, Ece Kamar, and Eric Horvitz. 2013. Whystop now? predicting worker engagement in onlinecrowdsourcing. In First AAAI Conference on HumanComputation and Crowdsourcing.

[29] Panagiotis Mavridis, Owen Huang, Sihang Qiu, UjwalGadiraju, and Alessandro Bozzon. 2019. Chatterbox:Conversational Interfaces for Microtask Crowdsourcing.In Proceedings of the 27th ACM Conference on UserModeling, Adaptation and Personalization. ACM,243–251.

[30] Robert J Moore, Raphael Arar, Guang-Jie Ren, andMargaret H Szymanski. 2017. Conversational UXdesign. In Proceedings of the 2017 CHI ConferenceExtended Abstracts on Human Factors in ComputingSystems. ACM, 492–497.

[31] Heather O’Brien. 2016. Theoretical perspectives on userengagement. In Why Engagement Matters. Springer,1–26.

[32] Heather L O’Brien, Paul Cairns, and Mark Hall. 2018. Apractical approach to measuring user engagement withthe refined user engagement scale (UES) and new UESshort form. International Journal of Human-ComputerStudies 112 (2018), 28–39.

[33] Jeffrey M Rzeszotarski, Ed Chi, Praveen Paritosh, andPeng Dai. 2013. Inserting micro-breaks into

crowdsourcing workflows. In First AAAI Conference onHuman Computation and Crowdsourcing.

[34] Ameneh Shamekhi, Mary Czerwinski, Gloria Mark,Margeigh Novotny, and Gregory A Bennett. 2016. Anexploratory study toward the preferred conversationalstyle for compatible virtual agents. In InternationalConference on Intelligent Virtual Agents. Springer,40–50.

[35] Tanya Stivers and Jeffrey D Robinson. 2006. Apreference for progressivity in interaction. Language insociety 35, 3 (2006), 367–392.

[36] Deborah Tannen. 1987. Conversational style.Psycholinguistic models of production (1987), 251–267.

[37] Deborah Tannen. 2005. Conversational style: Analyzingtalk among friends. Oxford University Press.

[38] Paul Thomas, Mary Czerwinski, Daniel McDuff, NickCraswell, and Gloria Mark. 2018. Style and alignment ininformation-seeking conversation. In Proceedings of the2018 Conference on Human Information Interaction &Retrieval. ACM, 42–51.

[39] Paul Thomas, Daniel McDuff, Mary Czerwinski, andNick Craswell. 2017. MISC: A data set ofinformation-seeking conversations. In SIGIR 1stInternational Workshop on Conversational Approachesto Information Retrieval (CAIR’17), Vol. 5.

[40] Bert Vandenberghe. 2017. Bot personas as off-the-shelfusers. In Proceedings of the 2017 CHI ConferenceExtended Abstracts on Human Factors in ComputingSystems. ACM, 782–789.

[41] Alexandra Vtyurina, Denis Savenkov, Eugene Agichtein,and Charles LA Clarke. 2017. Exploring conversationalsearch with humans, assistants, and wizards. InProceedings of the 2017 CHI Conference ExtendedAbstracts on Human Factors in Computing Systems.ACM, 2187–2193.

[42] Jie Yang, Judith Redi, Gianluca Demartini, andAlessandro Bozzon. 2016. Modeling task complexity incrowdsourcing. In Fourth AAAI Conference on HumanComputation and Crowdsourcing.

[43] Xi Yang, Marco Aurisicchio, and Weston Baxter. 2019.Understanding Affective Experiences WithConversational Agents. In Proceedings of the 2019 CHIConference on Human Factors in Computing Systems.ACM, 542.

[44] Mengdie Zhuang and Ujwal Gadiraju. 2019. In WhatMood Are You Today?: An Analysis of Crowd Workers’Mood, Performance and Engagement. In Proceedings ofthe 11th ACM Conference on Web Science, WebSci 2019,Boston, MA, USA, June 30 - July 03, 2019. 373–382.DOI:http://dx.doi.org/10.1145/3292522.3326010