Automatic Argument Quality Assessment - New Datasets and ...of the task, both annotation schemes led...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 5625–5635,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

5625

Automatic Argument Quality Assessment - New Datasets and Methods

Assaf Toledo∗, Shai Gretz∗, Edo Cohen-Karlik∗, Roni Friedman∗, Elad Venezian,Dan Lahav, Michal Jacovi, Ranit Aharonov and Noam Slonim

IBM Research

Abstract

We explore the task of automatic assessmentof argument quality. To that end, we activelycollected 6.3k arguments, more than a factorof five compared to previously examined data.Each argument was explicitly and carefully an-notated for its quality. In addition, 14k pairsof arguments were annotated independently,identifying the higher quality argument in eachpair. In spite of the inherent subjective natureof the task, both annotation schemes led to sur-prisingly consistent results. We release the la-beled datasets to the community. Furthermore,we suggest neural methods based on a recentlyreleased language model, for argument rank-ing as well as for argument-pair classification.In the former task, our results are comparableto state-of-the-art; in the latter task our resultssignificantly outperform earlier methods.

1 Introduction

Computational argumentation has been receivinggrowing interest in the NLP community in recentyears (Reed, 2016). With this field rapidly expand-ing, various methods have been developed for sub-tasks such as argument detection (Lippi and Tor-roni, 2016; Levy et al., 2014; Rinott et al., 2015),stance detection (Bar-Haim et al., 2017) and argu-ment clustering (Reimers et al., 2019).

Recently, IBM introduced Project Debater, thefirst AI system able to debate humans on complextopics. The system participated in a live debateagainst a world champion debater, and was able tomine arguments, use them for composing a speechsupporting its side of the debate, and also rebut itshuman competitor.1 The underlying technology isintended to enhance decision-making.

∗These authors equally contributed to this work.1For more details: https://www.research.

ibm.com/artificial-intelligence/project-debater/live/

More recently, IBM also introduced Speech byCrowd, a service which supports the collection offree-text arguments from large audiences on de-batable topics to generate meaningful narratives.A real-world use-case of Speech by Crowd is inthe field of civic engagement, where the aim is toexploit the wisdom of the crowd to enhance deci-sion making on various topics. There are alreadyseveral public organizations and commercial com-panies in this domain, e.g., Decide Madrid2 andZencity.3 As part of the development of Speech byCrowd, 6.3k arguments were collected from con-tributors of various levels, and are released as partof this work.

An important sub-task of such a service is theautomatic assessment of argument quality, whichhas already shown its importance for prospec-tive applications such as automated decision mak-ing (Bench-Capon et al., 2009), argument search(Wachsmuth et al., 2017b), and writing support(Stab and Gurevych, 2014). Identifying argumentquality in the context of Speech by Crowd allowsfor the top-quality arguments to surface out ofmany contributions.

Assessing argument quality has driven practi-tioners in a plethora of fields for centuries fromphilosophers (Aristotle et al., 1991), through aca-demic debaters, to argumentation scholars (Waltonet al., 2008). An inherent difficulty in this domainis the presumably subjective nature of the task.Wachsmuth et al. (2017a) proposed a taxonomy ofquantifiable dimensions of argument quality, com-prised of high-level dimensions such as cogencyand effectiveness, and sub-dimensions such as rel-evance and clarity, that together enable the assign-ment of a holistic quality score to an argument.

Habernal and Gurevych (2016b) and Simpson

2https://decide.madrid.es3https://zencity.io

https://www.research.ibm.com/artificial-intelligence/project-debater/live/



https://decide.madrid.es

https://zencity.io

5626

and Gurevych (2018) take a relative approachand treat the problem as relation classification.They focus on convincingness – a primary dimen-sion of quality – and determine it by compar-ing pairs of arguments with similar stance. Inthis view, the convincingness of an individual ar-gument is a derivative of its relative convincing-ness: arguments that are judged as more convinc-ing when compared to others are attributed higherscores. These works explore the labeling and au-tomatic assessment of argument convincingnessusing two datasets introduced by Habernal andGurevych (2016b): UKPConvArgRank (hence-forth, UKPRank) and UKPConvArgAll, whichcontain 1k and 16k arguments and argument-pairs, respectively.

Gleize et al. (2019) also take a relative approachto argument quality, focusing on ranking convinc-ingness of evidence. Their solution is based ona Siamese neural network, which outperforms theresults achieved in Simpson and Gurevych (2018)on the UKP datasets, as well as several baselineson their own dataset, IBM-ConvEnv.4

Here, we extend earlier work in several ways:(1) introducing a large dataset of actively col-lected arguments, carefully annotated for quality;(2) suggesting a method for argument-pair classi-fication, which outperforms state-of-the-art accu-racy on available datasets; (3) suggesting a methodfor individual argument ranking, which achievesresults comparable to the state of the art.

Our data was collected actively, via a dedicateduser interface. This is in contrast to previousdatasets, which were sampled from online debateportals. We believe that our approach to data col-lection is more controlled and reduces noise in thedata, thus making it easier to utilize it in the con-text of learning algorithms (see Section 7).

Moreover, we applied various cleansing meth-ods to ensure the high quality of the contributeddata and the annotations, as detailed in Section 3.

We packaged our data in the following datasets,which are released to the research community5:

• IBM-ArgQ-6.3kArgs - the full dataset, com-prised of all 6.3k arguments that were col-lected and annotated with an individual qual-ity score in the range [0, 1].

4As this work is relatively recent and was published afterour submission, we were not able to compare to it.

5https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#ArgumentQuality

• IBM-ArgQ-14kPairs - 14k argument pairs an-notated with a relative quality label, indicat-ing which argument is of higher quality.

• IBM-ArgQ-5.3kArgs - the subset of 5.3karguments from IBM-ArgQ-6.3kArgs thatpassed our cleansing process. This set is usedin the argument ranking experiments in Sec-tion 9.2, henceforth: IBMRank.

• IBM-ArgQ-9.1kPairs - the subset of 9.1k ar-gument pairs from IBM-ArgQ-14kPairs thatpassed our cleansing process, used in theargument-pair classification experiments inSection 9.1. Henceforth: IBMPairs.

The dataset IBMRank differs from UKPRankin a number of ways. Firstly, IBMRank includes5.3k arguments, which make it more than 5 timeslarger than UKPRank. Secondly, the arguments inIBMRank were collected actively from contribu-tors. Thirdly, IBMRank includes explicit quality-labeling of all individual arguments, which is ab-sent from earlier data, enabling us to explore thepotential of training quality-prediction methods ontop of such labels, presumably easier to collect.

Finally, with the abundance of technologiessuch as automated personal assistants, we envi-sion automated argument quality assessment ex-panding to applications that include oral commu-nication. Such use-cases pose new challenges,overlooked by prior work, that mainly focused onwritten arguments. As an initial attempt to ad-dress these issues, in the newly contributed datawe guided annotators to assess the quality of anargument within the context of using the argumentas-is to generate a persuasive speech on the topic.Correspondingly, we expect these data to reflectadditional quality dimensions – e.g., a quality pre-mium on efficiently phrased arguments, and lowtolerance to blunt mistakes such as typos that maylead to poorly stated arguments.

2 Argument Collection

As part of the development of Speech by Crowd,online and on-site experiments have been con-ducted, enabling to test the ability of the serviceto generate a narrative based on collected argu-ments. Arguments were collected from two mainsources: (1) debate club members, including alllevels, from novices to experts; and (2) a broadaudience of people attending the experiments.

https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Argument Quality



5627

For the purpose of collecting arguments, wefirst selected 11 well known controversial con-cepts, common in the debate world, such as So-cial Media, Doping in Sports and Flu vaccina-tion. Using debate jargon, each concept is used tophrase two “motions”, by proposing two specificand opposing policies or views towards that con-cept. For example, for the concept AutonomousCars, we suggested the motions We should pro-mote Autonomous Cars and We should limit Au-tonomous Cars.6 The full list of motions appearsin Table 1 with the number of arguments collectedfor each.7

Guidelines Contributors were invited to adedicated user interface in which they wereguided to contribute arguments per concept, usingthe following concise instructions:

You can submit as many arguments as youlike, both pro and con, using original languageand no personal information (i.e. informationabout an identifiable person).

In addition, to exemplify the type of argu-ments that we expect to receive, contributors wereshown an example of one argument related to themotion, provided by a professional debater. Thearguments collected had to have 8 – 36 words,aimed at obtaining efficiently phrased arguments(longer/shorter arguments were rejected by theUI). In total, we collected 6, 257 arguments.

3 Argument Quality Labeling

We explored two approaches to labeling argumentquality: (a) labeling individual arguments (abso-lute approach): each individual argument is di-rectly labeled for its quality; and (b) labeling ar-gument pairs (relative approach): each argumentpair is labeled for which of the two arguments isof higher quality. In this section we describe thepros and cons of each approach as well as the as-sociated labeling process.

Approaches to Argument Quality LabelingThe effort in labeling individual arguments scaleslinearly with the number of arguments, comparedto the quadratic scaling of labeling pairs (withinthe same motion); thus, it is clearly more feasiblewhen considering a large number of arguments.However, the task of determining the quality of

6Habernal and Gurevych (2016b) uses the term topic forwhat we refer to as motion.

7In Table 1, vvg stands for violent video games.

Motion #ArgsFlu vaccination should be mandatory 204Flu vaccination should not be mandatory 174Gambling should be banned 342Gambling should not be banned 382Online shopping brings more harm than good 198Online shopping brings more good than harm 215Social media brings more harm than good 879Social media brings more good than harm 686We should adopt cryptocurrency 172We should abandon cryptocurrency 160We should adopt vegetarianism 221We should abandon vegetarianism 179We should ban the sale of vvg to minors 275We should allow the sale of vvg to minors 240We should ban fossil fuels 146We should not ban fossil fuels 116We should legalize doping in sport 212We should ban doping in sport 215We should limit autonomous cars 313We should promote autonomous cars 480We should support information privacy laws 355We should discourage information privacy laws 93

Table 1: Motion list and statistics on data collection.

arguments in isolation is presumably more chal-lenging; it requires to evaluate the quality of anargument without a clear reference point (exceptfor the motion text). This is where the relativeapproach has its strength, as it frames the label-ing task in a specific context of two competingarguments, and is expected to yield higher inter-annotator agreement. Indeed, a comparative ap-proach is widely used in many NLP applications,e.g. in Chen et al. (2013) for assessing reading dif-ficulty of documents and in Aranberri et al. (2017)for machine translation. In light of these consider-ations, here we decided to investigate and compareboth approaches. We used the Figure Eight plat-form8, with a relatively large number of 15 − 17annotators per instance, to improve the reliabilityof the collected annotations.

3.1 Labeling Individual Arguments

The goal of this task is to assign a quality scorefor each individual argument. Annotators werepresented with the following binary question perargument:

Disregarding your own opinion on the topic,would you recommend a friend preparing aspeech supporting/contesting the topic to usethis argument as is in the speech? (yes/no)

All arguments that were collected as described in

8http://figure-eight.com/

http://figure-eight.com/

5628

Section 2 were labeled in this task. We model thequality of each individual argument as a real valuein the range of [0, 1], by calculating the fractionof ‘yes’ answers. To ensure the annotators willcarefully read each argument, the labeling of eachargument started with a test question about thestance of the argument towards the concept (proor con). The annotators’ performance on these testquestions was used in the quality control processdescribed in Section 4, and also in determiningwhich pairs of arguments to label.

3.2 Labeling Argument PairsIn this task, annotators were presented with a pairof arguments, having the same stance towardsthe concept (to reduce bias due to the annotator’sopinion), and were asked the following:

Which of the two arguments would have beenpreferred by most people to support/contest thetopic?

Table 2 presents an example of such an ar-gument pair in which the annotators unanimouslypreferred the first argument.

Argument 1 Argument 2Children emulate themedia they consumeand so will be moreviolent if you don’tban them from violentvideo games

These are less fun andmore harmful gamesbut specifically violentgames are played ingroups and excludesofter souls

Table 2: An example of an argument pair for the mo-tion We should ban the sale of violent video games tominors. The first argument was unanimously preferredby all annotators.

As mentioned, annotating all pairs in a largecollection of arguments is often not feasible. Thus,we focused our attention on pairs that are presum-ably most valuable to train a learning algorithm.Specifically, we annotated 14k randomly selectedpairs, that satisfy the following criteria:

1. At least 80% of the annotators agreed on thestance of each argument, aiming to focus onclearly stated arguments.

2. Individual quality scores in each pair differby at least 0.2, aiming for pairs with a rela-tively high chance of a clear winner.

3. The length of both arguments, as measuredby number of tokens, differs by ≤ 20%, aim-

ing to focus the task on dimensions beyondargument length.

4 Quality Control

To monitor and ensure the quality of collectedannotations, we employed the following analyses:

Kappa Analysis –

1. Pairwise Cohen’s kappa (κ) (Cohen, 1960)is calculated for each pair of annota-tors that share at least 50 common argu-ment/argument pairs judgments, and basedonly on those common judgments.

2. Annotator-κ is obtained by averaging all pair-wise κ for this annotator as calculated in Step1, and if and only if this annotator had ≥ 5pairwise κ values estimated. This is used toignore annotators as described later.

3. Averaging all Annotator-κ, calculated in Step2, results in Task-Average-κ.9

Test Questions Analysis – Hidden embeddedtest questions, based on ground truth, are oftenvaluable for monitoring crowd work. In our setup,at least one fifth of the judgments provided by eachannotator are on test questions. When annotatorsfail a test question, they are alerted. Thus, beyondmonitoring annotator’s quality, test questions alsoprovide annotators feedback on task expectations.In addition, an annotator that fails more than a pre-specified fraction (e.g., 20%) of the test questionsis removed from the task, and his judgments areignored.

High Prior Analysis – An annotator that al-ways answers ’yes’ to a particular question shouldobviously be ignored; more generally, we dis-carded the judgments contributed by annotatorswith a relatively high prior to answer positively onthe presented questions.

Note, if an annotator is discarded due to fail-ure in any of the above analyses, he is furtherdiscarded from the estimation of Annotator-κ andTask-Average-κ.

9It is noted that some annotators remain without validAnnotator-κ and cannot be filtered out based on their κ. Sim-ilarly, those annotators do not contribute to the Task-Average-κ. However, in both annotation tasks, those annotators con-tributed only 0.01− 0.03 of the judgments collected.

5629

5 Data Cleansing

5.1 Cleansing of Individual ArgumentsJudgments

To enhance the quality of the collected data, wediscard judgments by annotators who (1) failed≥ 20% of the test-questions10; and/or (2) obtainedAnnotator-κ ≤ 0.35 in the stance judgment task;and/or (3) answered ‘yes’ for ≥ 80% of the qual-ity judgment questions. Finally, we discarded ar-guments that were left with less than 7 valid judg-ments. This process left us with 5.3k arguments,each with 11.4 valid annotations on average. TheTask-Average-κ was 0.69 on the stance questionand 0.1 on the quality question. We refer to thefull, unfiltered, set as IBM-ArgQ-6.3kArgs, and tothe filtered set as IBM-ArgQ-5.3kArgs (IBMRank).

For completeness, we also attempted to utilizean alternative data cleansing tool, MACE (Hovyet al., 2013). We ran MACE with a threshold k,keeping the top k percent of arguments accord-ing to their entropy. We then re-calculated Task-Average-κ on the resulting dataset. We ran MACEwith k=0.95, as used in Habernal and Gurevych(2016b), and with k=0.85, as this results in adataset similar in size to IBMRank. The result-ing Task-Average-κ is 0.08 and 0.09, respectively,lower than our reported 0.1. We thus maintain ourapproach to data cleansing as described above.

The low average κ of 0.1 for quality judgmentsis expected due to the subjective nature of thetask, but nonetheless requires further attention.Based on the following observations, we arguethat the labels inferred from these annotations arestill meaningful and valuable: (1) the high Task-Average-κ on the stance task conveys the annota-tors carefully read the arguments before provid-ing their judgments; (2) we report high agreementof the individual quality labels with the argument-pair annotations, to which much better κ valueswere obtained (see Section 6.1); (3) we demon-strate that the collected labels can be successfullyused by a neural network to predict argument rank-ing (see section 9.2), suggesting these labels carrya real signal related to arguments’ properties.

5.2 Cleansing of Argument Pair Labeling

To enhance the quality of the collected pairwisedata, we discard judgments by annotators who (1)

10Since quality judgments are relatively subjective, we fo-cused the test questions on the stance question.

failed ≥ 30% of the test-questions; and/or (2) ob-tained Annotator-κ ≤ 0.15 in this task. Here, thetest questions were directly addressing the (rela-tive) quality judgment of pairs, and not the stanceof the arguments. In initial annotation rounds thetest questions were created based on the previ-ously collected individual arguments labels - con-sidering pairs in which the difference in individualquality scores was ≥ 0.6.11 In following annota-tion rounds, the test questions were defined basedon pairs for which≥ 90% of the annotators agreedon the winning pair. Following this process wewere left with an average of 15.9 valid annotationsfor each pair, and with Task-Average-κ of 0.42 onthe quality judgments – a relatively high value forsuch a subjective task. As an additional cleans-ing, for training the learning algorithms, we con-sidered only pairs for which ≥ 70% of the anno-tators agreed on the winner, leaving us with a totalof 9.1k pairs for training and evaluation. We referto the full, unfiltered, set as IBM-ArgQ-14kPairs,and to the filtered set as IBM-ArgQ-9.1kPairs.

6 Data Consistency

6.1 Consistency of Labeling Tasks

Provided with both individual and pairwise qual-ity labeling, we estimated the consistency of thesetwo approaches. For each pair of arguments,we define the expected winning argument as theone with the higher individual argument score,and compare that to the actual winning argument,namely the argument preferred by most annota-tors when considering the pair directly. Overall,in 75% of the pairs the actual winner was the ex-pected one. Moreover, when focusing on pairsin which the individual argument scores differ by> 0.5, this agreement reaches 84.3% of pairs.

6.2 Reproducibility Evaluation

An important property of a valid annotation isits reproducibility. For this purpose, a randomsample of 500 argument pairs from the IBMPairsdataset was relabeled by the crowd. This relabel-ing took place a few months after the main anno-tation tasks, with the exact task and data cleans-ing methods that were employed originally. Formeasuring correlation, the following A score wasdefined: the fraction of valid annotations selecting

11Note, that annotators have the option of contesting prob-lematic test questions, and thus unfitting ones were disabledduring the task by our team.

5630

“argument A” in an argument pair (A,B) as hav-ing higher quality, out of the total number of validannotations. Pearson’s correlation coefficient be-tweenA score in initial and secondary annotationof the defined sample was 0.81.

A similar process was followed with the indi-vidual arguments quality labeling. Instead of re-labeling, we split existing annotations to two evengroups. We chose only individual arguments inwhich at least 14 valid annotations remained afterdata cleansing (1, 154 such arguments). This re-sulted in two sets of labels for the same data, eachbased on at least 7 annotations. Pearson’s correla-tion coefficient between quality scores of the twosets was 0.53. We then divided the quality score,which ranges between 0 to 1, to 10 equal bins. Thebin frequency counts between the two sets are dis-played in the heatmap in Figure 1.

Figure 1: Counts of quality score bins between twoequally sized sets of annotators.

6.3 Transitivity EvaluationFollowing Habernal and Gurevych (2016b), wefurther examined to what extent our labeled pairssatisfy transitivity. Specifically, a triplet of argu-ments (A,B,C) in which A is preferred over B,and B is preferred over C, is considered transi-tive if and only if A is also preferred over C. Weexamined all 892 argument triplets for which allpair-wise combinations were labeled, and foundthat transitivity holds in 96.2% of the triplets, fur-ther strengthening the validity of our data.

7 Comparison of IBMRank andUKPRank

A distinctive feature of our IBMRank dataset isthat it was collected actively, via a dedicated

user interface with clear instructions and en-forced length limitations. Correspondingly, weend up with cleaner texts, that are also more ho-mogeneous in terms of length, compared to theUKPRank that relies on arguments collected fromdebate portals.

Text CleanlinessWe counted tokens representing a malformed spanof text in IBMRank and UKPRank. These areHTML markup tags, links, excessive punctua-tion12, and tokens not found in GloVE vocabulary(Pennington et al., 2014). Our findings show that94.78% of IBMRank arguments contain no mal-formed text, 4.38% include one such token, and0.71% include two such tokens. In the case ofUKPRank, only 62.36% of the arguments are freeof malformed text, 17.59% include one such to-ken, and 20.05% include two or more tokens ofmalformed text.

Text LengthAs depicted in Figure 2, the arguments in IBM-Rank are substantially more homogeneous in theirlength compared to UKPRank. A potential draw-back of the length limitation is that it possiblyprevents any learning system from being able tomodel long arguments correctly. However, by im-posing this restriction we expect our quality label-ing to be less biased due to argument length, hold-ing greater potential to reveal other properties thatcontribute to argument quality. We confirmed thisintuition with respect to the argument pair labelingas described in Section 9.1.

Data Size and Individual Argument LabelingFinally, IBMRank covers 5, 298 arguments, com-pared to 1, 052 in UKPRank. In addition, inUKPRank no individual labeling is provided, andindividual quality scores are inferred from pairs la-beling. In contrast, for IBMRank each argument isindividually labeled for quality, and we explicitlydemonstrate the consistency of these individual la-beling with the provided pairwise labeling.

8 Methods

In this section we describe neural methods forpredicting the individual score and the pair-wise

12Sequences of three or more punctuation characters, e.g.“?!?!?!”

5631

0

50

100

150

200

250

300

350

0 10 20 30 40

Argument Length in IBMRank

0

5

10

15

20

25

30

0 10 20 30 40 50 60 70 80 90 100 110

Argument Length in UKPRank

Figure 2: Histograms of argument length in IBMRankand UKPRank. X-axis: length (token count). Y-axis:the number of arguments at that length.

classification of arguments. We devise two meth-ods corresponding to the two newly introduceddatasets. Our methods are based upon a powerfullanguage representational model named Bidirec-tional Encoder Representations from Transform-ers (BERT) (Devlin et al., 2018) which achievesstate-of-the-art results on a wide range of tasks inNLP (Wang et al. (2018), Rajpurkar et al. (2016,2018)). BERT has been extensively trained overlarge corpora to perform two tasks: (1) MaskedLanguage Model - randomly replace words with apredefined token, [MASK], and predict the miss-ing word. (2) Next Sentence Prediction - given apair of sentences A and B, predict whether sen-tence B follows sentence A. Due to its bidirectionalnature, BERT achieves remarkable results whenfine-tuned to different tasks without the need forspecific modifications per task. For further detailsrefer to Devlin et al. (2018).

8.1 Argument-Pair Classification

We fine-tune BERT’s Base Uncased English pre-trained model for a binary classification task.13

13Initial experiments with BERT’s Large model showedonly minor improvements, so for the purpose of the exper-

The fine-tuning process is initialized with weightsfrom the general purpose pre-trained model anda task specific weight matrix Wout ∈ R768×2 isadded to the 12-layer base network. Followingstandard practice with BERT, given a pair of argu-ments A and B, we feed the network with the fol-lowing sequence ‘[CLS]A[SEP]B’. The [SEP]token indicates to the network that the input is tobe treated as a pair and [CLS] is a token whichis used to obtain contextual embedding for the en-tire sequence. The network is trained for 3 epochswith a learning rate of 2−5. We refer to this modelas Arg-Classifier.

8.2 Argument RankingFor training a model to output a score between[0, 1] we obtain contextual embeddings from theArg-Classifier fine-tuned model. We concatenatethe last 4 layers of the model output to obtain anembedding vector of size 4 × 768 = 3072. Theembedding vectors are used as input to a neuralnetwork with a single output and one hidden layerwith 300 neurons. In order for the network to out-put values in [0, 1], we use a sigmoid activation,σsigmoid(x) =

11+e−x . Denote the weight matrices

W1 ∈ R3072×300 and W2 ∈ R300×1, the regres-sor model, fR, is a 2-layered neural network withσrelu(x) = max{0, x} activation. fR can be writ-ten as:14

fR(x) = σsigmoid

(W T

2 σrelu(WT1 x)

)where x ∈ R3072 is the embedding vector repre-senting an argument. We refer to this regressionmodel as Arg-Ranker.

9 Experiments

9.1 Argument-Pair ClassificationIn this section we evaluate the methods describedin Section 8. First, we evaluate the accuracy ofArg-Classifier on our IBMPairs dataset and onUKPConvArgStrict (henceforth, UKPStrict), thefiltered argument pairs dataset of Habernal andGurevych (2016b), in k-fold cross-validation.15

We calculate accuracy and ROC area under curve(AUC) for each fold, and report the weighted av-erages over all folds. We also evaluate Simp-son and Gurevych (2018)’s GPPL median heuris-tic method with GloVe + ling features in cross-

iments detailed in Section 9 we used the Base model.14We omit bias terms for readability.1522 and 32 folds respectively.

5632

validation on our IBMPairs dataset. For complete-ness, we quote Simpson and Gurevych (2018)’sfigures of GPPL opt. and GPC on UKPStrict.16

We add a simple baseline classifying argumentsbased on their token count (Arg-Length).

IBMPairsArg-Length Arg-Classifier GPPL

Acc. .55 .80 .71AUC .59 .86 .78

UKPStrictArg-Length Arg-Classifier GPPL GPPL opt. GPC

Acc. .76 .83 .79 .80 .81AUC .78 .89 .87 .87 .89

Table 3: Accuracy and AUC on IBMPairs and UKPStrict.

As can be seen in Tables 3, Arg-Classifier im-proves on the GPPL method on both datasets(p � .01 using two-tailed Wilcoxon signed-ranktest).17 We note that Arg-Classifier’s accuracyon the UKPStrict set is higher than all methodstested on this dataset in Habernal and Gurevych(2016b); Simpson and Gurevych (2018). Interest-ingly, all methods reach higher accuracy on UKP-Strict compared to IBMPairs, presumably indicat-ing that the data in IBMPairs is more challengingto classify. With regards to Arg-Length, we cansee that it is inaccurate on IBMPairs but achieves arespectable result on UKPStrict. This is in agree-ment with Habernal and Gurevych (2016a) whoanalyzed the reasons that annotators provided fortheir labeling. In most cases the reason indicatedpreference for arguments with more information –which is what longer arguments tend to be betterat. This further strengthens the value of creatingIBMPairs and IBMRank as much more homoge-neous datasets in terms of argument length.

9.2 Argument Ranking

We proceed to evaluate the Arg-Ranker on theIBMRank and UKPRank datasets in k-fold cross-validation, and report weighted correlation mea-sures. We also evaluate the Arg-Ranker by feedingit vanilla BERT embeddings, instead of the fine-tuned embeddings generated by the Arg-Classifiermodel. We refer to this version as Arg-Ranker-

16We were unable to reproduce the results reported inSimpson and Gurevych (2018) by running the GPPL opt.and GPC algorithms on the UKPStrict dataset. We have ap-proached the authors and reported the issue, which was notsolved by the time this paper was published, and hence weonly quote the figures as reported there.

17The results per fold in both tasks are included in the sup-plementary material.

base. In both Arg-Ranker and Arg-Ranker-baseevaluations we report the mean of 3 runs.18

IBMRank UKPRankArg-Ranker-base Arg-Ranker Arg-Ranker-base Arg-Ranker GPPL

r .41 .42 .44 .49 .45ρ .38 .41 .57 .59 .65

Table 4: Pearson’s (r) and Spearman’s (ρ) correla-tion of Arg-Ranker-base, Arg-Ranker and GPPL on theIBMRank and UKPRank datasets.

As can be seen in Table 4, on the UKPRankdataset, Arg-Ranker is slightly better than GPPLfor Pearson’s correlation, but slightly worse forSpearman’s correlation. Additionally, using directBERT embeddings provides worse correlation19

than using the Arg-Classifier embeddings for bothdatasets, justifying its use. Finally, similarly to thefindings in the argument-pair classification task,the IBMRank dataset is harder to predict.20

10 Error Analysis

We present a qualitative analysis of examples thatthe Arg-Classifier and Arg-Ranker models did notpredict correctly. For each of the argument-pairand ranking tasks, we analyzed 50 − 100 argu-ments from three motions on which the perfor-mance of the respective model was poor. For eachmotion we selected the arguments in which themodel was most confident in the wrong direction.

A prominent insight from this analysis, com-mon to both models, is that the model tends tofail when the argument persuasiveness outweighsits delivery quality (such as bad phrasing or ty-pos). An example of this is shown in row 1 ofTable 5. In this case, Argument2 is labeled as hav-ing a higher quality, even though it contains multi-ple typos, and thus is typical to arguments that themodel was trained to avoid selecting.

Another phenomenon that both our models failto address is arguments that are off-topic, tooprovocative or not grounded. An example of this,from the argument-pair task, is shown in row 2 -Argument2 is presumably considered harsh by an-notators, even though it is fine in terms of gram-

18The GPPL regressor of Simpson and Gurevych (2018)relies on pair-wise (relative) labeling of arguments and as aresult it cannot be used for predicting the individual (abso-lute) labeling of arguments, as in IBMRank.

19Significantly for the IBMRank data on both measures,and for the UKPRank on Pearson’s correlation, p� .05.

20For the experiments on IBMRank, we included by mis-take a small fraction of arguments which actually should havebeen filtered. The effect on the results is minimal.

5633

Motion Type Argument1 Argument2We should banfossil fuels

Impact overdelivery

the only way to provide any space forenergy alternatives to enter the marketis by artificially decreasing the power offossil fuels through a ban.

fossil fuels are bad for the environ-ment, they have so2 in them that isthe thing that maks acid rain and it istoday harming the environment andwill only be wors.

Flu vaccinationshould not bemandatory

Provocativeor notgrounded

the only responsible persons for kidsare their parents. if they dont thinkthat their kids should get the vaccineits their own decision.

the body has an automatic vaccinationdue to evolution, those who got sick anddied are the weakest link and we arebetter off without them

We shouldabandon vege-tarianism

Consistentannotatorpreference

it’s harder to get all the things youneed for a balanced diet while beingvegetarian.

animals deserve less rights than hu-mans, and it is legitimate for humans toprioritize their enjoyment over the suf-fering of animals.

Table 5: Examples of argument pairs for which there is a high difference between the argument selected by theannotators, marked in bold, and the argument predicted to be of higher quality by the model, marked in italics.

matical structure and impact on the topic. Thesetypes of arguments are becoming more importantto recognize, especially in the “fake-news” era.We leave dealing with them for future work.

Finally, we also notice certain arguments wereconsistently preferred by annotators, regardless ofthe quality of the opposing argument. This is apattern relevant only to the Arg-Classifier model,shown in row 3.

11 Conclusions and Future Work

A significant barrier in developing automaticmethods for estimating argument quality is thelack of suitable data. An important contributionof this work is a newly introduced data composedof 6.3k carefully annotated arguments, comparedto 1k arguments in previously considered data.Another barrier is the inherent subjectivity of themanual task for determining argument quality. Toovercome this issue, we employed a relativelylarge set of crowd annotators to consider each in-stance, associated with various measures to ensurethe quality of the annotations associated with thereleased data. In addition, while previous workfocused on arguments collected from web debateportals, here we collected arguments via a dedi-cated interface, enforcing length limitations, andproviding contributors with clear guidance. More-over, previous work relied solely on annotatingpairs of arguments, and used these annotations toinfer the individual ranking of arguments; in con-trast, here, we annotated all individual argumentsfor their quality, and further annotated 14k pairs.This two–fold approach allowed us, for the firsttime, to explicitly examine the relation betweenrelative (pairwise) annotation and explicit (indi-

vidual) annotation of argument quality. Our anal-ysis suggests that these two schemes provide rela-tively consistent results. In addition, these annota-tion efforts may complement each other. As pairsof arguments with a high difference in individualquality scores appear to agree with argument-pairannotations, one may deduce the latter from theformer. Thus, it may be beneficial to dedicatethe more expensive pair-wise annotation efforts topairs in which the difference in individual qual-ity scores is small, reminiscent of active learning(Settles, 2009). In future work we intend to fur-ther investigate this approach, as well as explore inmore detail the low fraction of cases where thesetwo schemes led to clearly different results.

The second contribution of this work is suggest-ing neural methods, based on Devlin et al. (2018),for argument ranking as well as for argument-pair classification. In the former task, our re-sults are comparable to state-of-the-art; in the lat-ter task they significantly outperform earlier meth-ods (Habernal and Gurevych, 2016b).

Finally, to the best of our knowledge, currentapproaches do not deal with argument pairs of rel-atively similar quality. A natural extension is todevelop a ternary-class classification model thatwill be trained and evaluated on such pairs, as weintend to explore in future work.

Acknowledgements

We thank Tel Aviv University Debating Society,Ben Gurion University Debating Society, Yale De-bate Association, HWS Debate Team, SeawolfDebate Program of the University of Alaska, andmany other individual debaters.

5634

ReferencesNora Aranberri, Gorka Labaka, Arantza Dı́az de Ilar-

raza, and Kepa Sarasola. 2017. Ebaluatoia: crowdevaluation for english–basque machine translation.Language Resources and Evaluation, 51(4):1053–1084.

Aristotle, G.A. Kennedy, and G.A. Kennedy. 1991. OnRhetoric: A Theory of Civic Discourse. Oxford Uni-versity Press.

Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din-uzzo, Amrita Saha, and Noam Slonim. 2017. Stanceclassification of context-dependent claims. In Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 1, Long Papers, pages 251–261,Valencia, Spain. Association for Computational Lin-guistics.

T Bench-Capon, K Atkinson, and Peter McBurney.2009. Altruism and agents: an argumentation basedapproach to designing agent decision mechanisms,pages 1073 – 1080. Unknown Publisher.

Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson,and Eric Horvitz. 2013. Pairwise ranking aggrega-tion in a crowdsourced setting. In Proceedings of theSixth ACM International Conference on Web Searchand Data Mining, WSDM ’13, pages 193–202, NewYork, NY, USA. ACM.

Cohen. 1960. A coefficient of agreement for nominalscales. Educ Psychol Meas, pages 37–46.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

Martin Gleize, Eyal Shnarch, Leshem Choshen, LenaDankin, Guy Moshkowich, Ranit Aharonov, andNoam Slonim. 2019. Are you convinced? choos-ing the more convincing evidence with a Siamesenetwork. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 967–976, Florence, Italy. Association forComputational Linguistics.

Ivan Habernal and Iryna Gurevych. 2016a. Whatmakes a convincing argument? empirical analysisand detecting attributes of convincingness in webargumentation. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1214–1223, Austin, Texas. Asso-ciation for Computational Linguistics.

Ivan Habernal and Iryna Gurevych. 2016b. Which ar-gument is more convincing? analyzing and predict-ing convincingness of web arguments using bidirec-tional lstm. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1589–1599. Asso-ciation for Computational Linguistics.

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,and Eduard Hovy. 2013. Learning whom to trustwith MACE. In Proceedings of the 2013 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 1120–1130, Atlanta, Georgia.Association for Computational Linguistics.

Ran Levy, Yonatan Bilu, Daniel Hershcovich, EhudAharoni, and Noam Slonim. 2014. Context depen-dent claim detection. In Proceedings of COLING2014, the 25th International Conference on Compu-tational Linguistics: Technical Papers, pages 1489–1500, Dublin, Ireland. Dublin City University andAssociation for Computational Linguistics.

Marco Lippi and Paolo Torroni. 2016. Argumentationmining: State of the art and emerging trends. ACMTrans. Internet Technol., 16(2):10:1–10:25.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In In EMNLP.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. CoRR, abs/1806.03822.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 2383–2392. Asso-ciation for Computational Linguistics.

Chris Reed. 2016. Proceedings of the third work-shop on argument mining (argmining2016). InProceedings of the Third Workshop on ArgumentMining (ArgMining2016). Association for Compu-tational Linguistics.

Nils Reimers, Benjamin Schiller, Tilman Beck, Jo-hannes Daxenberger, Christian Stab, and IrynaGurevych. 2019. Classification and clustering ofarguments with contextualized word embeddings.CoRR, abs/1906.09821.

Ruty Rinott, Lena Dankin, Carlos Alzate Perez,Mitesh M. Khapra, Ehud Aharoni, and NoamSlonim. 2015. Show me your evidence - an auto-matic method for context dependent evidence de-tection. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing, pages 440–450, Lisbon, Portugal. Associationfor Computational Linguistics.

Burr Settles. 2009. Active learning literature survey.Computer Sciences Technical Report 1648, Univer-sity of Wisconsin–Madison.

Edwin D Simpson and Iryna Gurevych. 2018. Findingconvincing arguments using scalable bayesian pref-erence learning. Transactions of the Association forComputational Linguistics, 6:357–371.

https://doi.org/10.1007/s10579-016-9335-x

https://doi.org/10.1007/s10579-016-9335-x

https://books.google.co.il/books?id=QQFc4FwQz-wC

https://books.google.co.il/books?id=QQFc4FwQz-wC

https://www.aclweb.org/anthology/E17-1024

https://www.aclweb.org/anthology/E17-1024

https://doi.org/10.1145/2433396.2433420

https://doi.org/10.1145/2433396.2433420

http://arxiv.org/abs/1810.04805



https://www.aclweb.org/anthology/P19-1093



https://doi.org/10.18653/v1/D16-1129

https://doi.org/10.18653/v1/D16-1129

https://doi.org/10.18653/v1/D16-1129

https://doi.org/10.18653/v1/D16-1129

https://doi.org/10.18653/v1/P16-1150

https://doi.org/10.18653/v1/P16-1150

https://doi.org/10.18653/v1/P16-1150

https://doi.org/10.18653/v1/P16-1150

https://www.aclweb.org/anthology/N13-1132

https://www.aclweb.org/anthology/N13-1132

https://www.aclweb.org/anthology/C14-1141

https://www.aclweb.org/anthology/C14-1141

https://doi.org/10.1145/2850417

https://doi.org/10.1145/2850417



https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

http://aclweb.org/anthology/W16-2800

http://aclweb.org/anthology/W16-2800



https://doi.org/10.18653/v1/D15-1050

https://doi.org/10.18653/v1/D15-1050

https://doi.org/10.18653/v1/D15-1050

http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdf

http://aclweb.org/anthology/Q18-1026



5635

Christian Stab and Iryna Gurevych. 2014. Annotatingargument components and relations in persuasive es-says. In COLING, pages 1501–1510. ACL.

Henning Wachsmuth, Nona Naderi, Yufang Hou,Yonatan Bilu, Vinodkumar Prabhakaran, Tim Al-berdingk Thijm, Graeme Hirst, and Benno Stein.2017a. Computational argumentation quality as-sessment in natural language. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, pages 176–187. Association for Com-putational Linguistics.

Henning Wachsmuth, Martin Potthast, Khalid AlKhatib, Yamen Ajjour, Jana Puschmann, Jiani Qu,Jonas Dorsch, Viorel Morari, Janek Bevendorff, andBenno Stein. 2017b. Building an argument searchengine for the web. In ArgMining@EMNLP, pages49–59. Association for Computational Linguistics.

Douglas Walton, Chris Reed, and Fabrizio Macagno.2008. Argumentation Schemes. Cambridge Univer-sity Press.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. CoRR,abs/1804.07461.

http://aclweb.org/anthology/E17-1017

http://aclweb.org/anthology/E17-1017



Automatic Argument Quality Assessment - New Datasets and ...of the task, both annotation schemes led...

Documents

Transcript of Automatic Argument Quality Assessment - New Datasets and ...of the task, both annotation schemes led...