Comparing Test Sets with Item Response Theory

18
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 1141–1158 August 1–6, 2021. ©2021 Association for Computational Linguistics 1141 Comparing Test Sets with Item Response Theory Clara Vania *† Phu Mon Htut ♣* William Huang ♣* Dhara Mungra Richard Yuanzhe Pang Jason Phang Haokun Liu ♦† Kyunghyun Cho Samuel R. Bowman Amazon New York University Allen Institute for AI [email protected], [email protected] Abstract Recent years have seen numerous NLP datasets introduced to evaluate the perfor- mance of fine-tuned models on natural lan- guage understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely satu- rated and unlikely to be able to detect further progress. What kind of datasets are still ef- fective at discriminating among strong mod- els, and what kind of datasets should we ex- pect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and eval- uate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, Hel- laSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong mod- els. We also observe span selection task for- mat, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating be- tween strong and weak models. 1 Introduction Many datasets have been created to evaluate var- ious aspects of natural language understanding (NLU) in English. These datasets are useful to mea- sure progress; however, it is evident from various leaderboards (Wang et al., 2018, 2019b; Rajpurkar et al., 2016; Zellers et al., 2018) that many of them are no longer challenging or discriminative enough to differentiate strong models such as those based on Transformers (Vaswani et al., 2017). 1 Even if these benchmarks are sound tests of important * Equal contribution. Work done while at New York University. 1 For example, the recent DeBERTa model (He et al., 2020) achieves parity with human annotators on the SuperGLUE benchmark score: https://super.gluebenchmark. com/leaderboard. (and potentially unsolved) tasks, their usefulness is limited if they cannot measure further progress. In this paper, we ask: Which datasets are best in distinguishing current and possible future strong models? We aim to compare datasets using a single metric that accounts for their effectiveness in separating current stronger and weaker models. To that end, we use Item Response Theory (IRT; Baker and Kim, 1993), a statistical framework from psycho- metrics that is widely used for the evaluation of test items in educational assessment. IRT assumes that the probability that a model will correctly handle an example in a test set depends on the model’s latent ability parameter and three example-specific parameters, typically measuring example difficulty (how strong does a model have to be to get it right), discrimination (how effective the example is for differentiating between similar models), and guess- ing (how likely a weak model is to get the example right for spurious reasons). This paper presents a large-scale IRT analy- sis of existing English NLU datasets. Unlike previous work which focuses on example-level analysis within individual datasets (Lalor et al., 2016, 2018), here we analyze example charac- teristics from a larger perspective by compar- ing individual examples across datasets. We evaluate test sets from 29 datasets in different formats—classification, multiple-choice QA, and span-selection QA. As responses, we use model predictions from 18 Transformer-based models, in- cluding some limited-capacity models chosen to expose better the dataset’s ability to discriminate weaker from stronger predictors. We then fit a single IRT model on these responses using a varia- tional inference method. 2 2 Our data and code can be found at https://github. com/nyu-mll/nlu-test-sets.

Transcript of Comparing Test Sets with Item Response Theory

Page 1: Comparing Test Sets with Item Response Theory

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 1141–1158

August 1–6, 2021. ©2021 Association for Computational Linguistics

1141

Comparing Test Sets with Item Response Theory

Clara Vania♠∗† Phu Mon Htut♣∗ William Huang♣∗Dhara Mungra♣ Richard Yuanzhe Pang♣ Jason Phang♣

Haokun Liu♦† Kyunghyun Cho♣ Samuel R. Bowman♣♠Amazon ♣New York University

♦Allen Institute for [email protected], [email protected]

Abstract

Recent years have seen numerous NLPdatasets introduced to evaluate the perfor-mance of fine-tuned models on natural lan-guage understanding tasks. Recent resultsfrom large pretrained models, though, showthat many of these datasets are largely satu-rated and unlikely to be able to detect furtherprogress. What kind of datasets are still ef-fective at discriminating among strong mod-els, and what kind of datasets should we ex-pect to be able to detect future improvements?To measure this uniformly across datasets,we draw on Item Response Theory and eval-uate 29 datasets using predictions from 18pretrained Transformer models on individualtest examples. We find that Quoref, Hel-laSwag, and MC-TACO are best suited fordistinguishing among state-of-the-art models,while SNLI, MNLI, and CommitmentBankseem to be saturated for current strong mod-els. We also observe span selection task for-mat, which is used for QA datasets like QAMRor SQuAD2.0, is effective in differentiating be-tween strong and weak models.

1 Introduction

Many datasets have been created to evaluate var-ious aspects of natural language understanding(NLU) in English. These datasets are useful to mea-sure progress; however, it is evident from variousleaderboards (Wang et al., 2018, 2019b; Rajpurkaret al., 2016; Zellers et al., 2018) that many of themare no longer challenging or discriminative enoughto differentiate strong models such as those basedon Transformers (Vaswani et al., 2017).1 Evenif these benchmarks are sound tests of important

∗Equal contribution.†Work done while at New York University.

1For example, the recent DeBERTa model (He et al., 2020)achieves parity with human annotators on the SuperGLUEbenchmark score: https://super.gluebenchmark.com/leaderboard.

(and potentially unsolved) tasks, their usefulnessis limited if they cannot measure further progress.In this paper, we ask: Which datasets are best indistinguishing current and possible future strongmodels?

We aim to compare datasets using a single metricthat accounts for their effectiveness in separatingcurrent stronger and weaker models. To that end,we use Item Response Theory (IRT; Baker andKim, 1993), a statistical framework from psycho-metrics that is widely used for the evaluation of testitems in educational assessment. IRT assumes thatthe probability that a model will correctly handlean example in a test set depends on the model’slatent ability parameter and three example-specificparameters, typically measuring example difficulty(how strong does a model have to be to get it right),discrimination (how effective the example is fordifferentiating between similar models), and guess-ing (how likely a weak model is to get the exampleright for spurious reasons).

This paper presents a large-scale IRT analy-sis of existing English NLU datasets. Unlikeprevious work which focuses on example-levelanalysis within individual datasets (Lalor et al.,2016, 2018), here we analyze example charac-teristics from a larger perspective by compar-ing individual examples across datasets. Weevaluate test sets from 29 datasets in differentformats—classification, multiple-choice QA, andspan-selection QA. As responses, we use modelpredictions from 18 Transformer-based models, in-cluding some limited-capacity models chosen toexpose better the dataset’s ability to discriminateweaker from stronger predictors. We then fit asingle IRT model on these responses using a varia-tional inference method.2

2Our data and code can be found at https://github.com/nyu-mll/nlu-test-sets.

Page 2: Comparing Test Sets with Item Response Theory

1142

Figure 1: Distribution of test examples according to our proposed locally estimated headroom (LEH) scores(§ 4.1.1), which measure the local slope of the Item Characteristic Curve (ICC) for an example at the abilitylevel corresponding to the best model, and thus reflect the effectiveness of that single example at distinguishingbetween near-state-of-the-art models. Datasets are grouped by task format: classification (green), sentence-levelmultiple-choice (blue), paragraph-level multiple-choice (red), and span selection (grey). Within each format, thedatasets are sorted by their release date. More details on the datasets are given in Table 1.

We find:

• Quoref, HellaSwag, and MC-TACO containthe highest number of examples that can dif-ferentiate between near-state-of-the-art mod-els, making them very likely to be effective attracking near-future progress on the skills thatthey actually test (Figure 1).

• SQuAD2.0, NewsQA, QuAIL, MC-TACO,and ARC-Challenge have the most difficultexamples.

• Span-based QA is an effective task formatfor discriminating between strong and weakmodels.

• CosmosQA, MC-TACO, Winogrande, andARC-Challenge consist mostly of hard exam-ples, while for most datasets, the example dif-ficulty levels are more widely distributed.

2 Item Response Theory

Baker and Kim (1993) introduce Item ResponseTheory (IRT), a statistical framework to measurethe probability of a responder (human or AI system)predicting a correct answer for a given item (testexample). The probability of a responder i answer-ing an item j correctly is estimated as a function ofthe responder’s latent ability θi and the item charac-teristics, referred to as the item characteristic curve(ICC).

We use the 3-parameter (3PL) IRT model, whereitem behavior is governed by discrimination, diffi-culty, and guessing parameters. The discrimination

Figure 2: An example of item characteristic curves(ICCs) with different values for discrimination (α), dif-ficulty (β), and guessing (γ) parameters. p(θ) is theprobability of a correct answer for a given θ. θ mea-sures a model’s ability level (higher is better). α gov-erns the steepness of the function, β determines the θvalue at which the curve is the steepest, while γ definesthe baseline likelihood that an arbitrarily weak modelcan guess correctly.

parameter (α) defines how effective an item is fordistinguishing predictors along the ability axis. Thedifficulty parameter (β) defines a minimum levelof ability at which we expect to see high responderperformance. The guessing parameter (γ) definesthe probability of correctly answering an item byrandom guessing. Figure 2 shows example ICCswith different parameter values.

Formally, the probability of individual i answer-ing item j correctly is modeled as:

pj(θi) = γj +1− γj

1 + e−αj(θi−βj). (1)

Page 3: Comparing Test Sets with Item Response Theory

1143

2.1 IRT with Variational InferenceWe use variational inference to infer IRT parame-ters from model response patterns using Pyro (Ran-ganath et al., 2014; Bingham et al., 2019). Laloret al. (2019) found this method effective when fit-ting IRT models to responses on SNLI. Let n bethe number of items and let m be the number ofresponders. The response patterns is Y ∈ Rn×m,where the i-th row corresponds to responder i andthe j-th column corresponds to item j. We defineyij ∈ [0, 1] as the response of model i to item j,where yij = 1 indicates a correct response andyij = 0 indicates an incorrect response. We ap-proximate the joint probability of the parametersπ(θ, α, β, γ | Y) with a variational posterior:

q(θ, α, β, γ) =

I∏i=1

πθi (θi)

J∏j=1

παj (αi)πβj (βi)π

γj (γi)

(2)

where πρ(·) denotes the density for parameter ρ.For each parameter, we choose the following distri-butions:

θ ∼ N (µθ, σ2θ) (3)

logα ∼ N (µα, σ2α) (4)

β ∼ N (µβ, σ2β) (5)

sigmoid−1(γ) ∼ N (µγ , σ2γ) (6)

We fit the posterior parameters by minimizing theevidence lower bound (ELBO). When calculatingthe ELBO, we weight the log-likelihoods of eachitem’s parameter by the inverse of the item’s datasetsize to control for test set size.

Following Lalor et al. (2019), we use a prior ofN (0, 1) for θ, β, and sigmoid−1(γ). While Laloret al. (2019) usesN (0, 103) for item parameter pri-ors, we encountered degenerate runs and insteaduse N (0, 1). For logα, we use N (0, σ2α) wherewe set σα by searching [0.25, 0.5] by increments of0.05 and use the value yielding the highest ELBOafter excluding degenerate runs. We use a sigmoidtransformation for γ to constrain the guessing prob-ability to (0, 1).

3 Experiments

3.1 DatasetsOur goal is to perform a fine-grained evaluation ofEnglish NLU datasets that appear to discriminateamong widely used Transformer-based models. To

that end, we choose datasets based on the followingcriteria:

• They are plausibly unsolved, in that the best-reported model performance does not exceedestimated human performance (if available)by more than three metric points.

• They are relatively easy to use with currentlarge pretrained models, and in particular,their inputs fit within a typical pretrainedTransformer’s 512-token limits. (This rulesout tasks with full-document contexts or re-trieval components.)

• They are evaluated at example-level, i.e., wefocus our analysis on QA and other classi-fication datasets, where each example corre-sponds to one item in the IRT. (This rules outstructured prediction and sequence taggingtasks.)

• They have simple and reliable automatic met-rics at the example level. (This rules outgeneration-based tasks.)

Table 1 lists the datasets we evaluate. For MNLI,we combine the matched and mismatched portionsof the development and custom test sets for ouranalysis. For ANLI, we train models on SNLI,MNLI, and ANLI training examples. Similarto MNLI, we combine ANLI’s three evaluationrounds of the development and the test sets for ouranalysis.

Custom Test Splits Some of our selecteddatasets do not have publicly available labeled testexamples. For such cases, we create a new customsplit by randomly sampling 50% of the validationexamples as a new test set and keeping the rest forvalidation (“Cust.” column in Table 1). For Nat-ural Questions, we use the MRQA 2019 version(Fisch et al., 2019), as the original version includessome examples with very long contexts.3 For MC-TACO, the original dataset does not come with atraining set. For our experiment, we use 80% ofthe validation set as our training set and the rest asa our validation set while leaving the original testset untouched.

3https://github.com/mrqa/MRQA-Shared-Task-2019

Page 4: Comparing Test Sets with Item Response Theory

1144

|Train| |Dev| |Test| Cust. Metric RoBERTa Human

RTE (Dagan et al., 2005, et seq.) 2,490 138 139 3 Acc. 87.6 93.6SNLI (Bowman et al., 2015) 550,152 10,000 10,000 Acc. 92.7 –MNLI (Williams et al., 2018) 392,702 9,823 9,824 3 Acc. 89.7 92.0

Cla

ssifi

-ca

tion

CommitmentBank (CB; De Marneffe et al., 2019) 250 28 28 3 Acc. 90.5 95.8ANLI (Nie et al., 2020) 1,105,719 3,200 3,200 Acc. 50.8 –

COPA (Roemmele et al., 2011) 400 50 50 3 Acc. 86.0 100.0WSC (Levesque et al., 2012) 554 52 52 3 Acc. 78.8 100.0CommonsenseQA (CSQA; Talmor et al., 2019) 9,741 610 611 3 Acc. 74.6 88.9MC-TACO (Zhou et al., 2019) 3,026 757 9,442 3 EM 55.9 75.8SocialIQA (Sap et al., 2019) 33,410 977 977 3 Acc. 79.9 88.1WiC (Pilehvar and Camacho-Collados, 2019) 5,428 319 319 3 Acc. 71.5 80.0Abductive NLI (AbductNLI; Bhagavatula et al., 2020) 169,654 766 766 3 Acc. 85.0 92.9

Sent

ence

-Lev

elM

ultip

leC

hoic

e

PIQA (Bisk et al., 2020) 16,113 919 919 3 Acc. 77.6 94.9WinoGrande (Sakaguchi et al., 2020) 40,398 633 634 3 Acc. 77.3 94.0

ARC-Easy (Clark et al., 2018) 2,251 570 2,376 Acc. 62.5 –ARC-Challenge (Clark et al., 2018) 1,119 299 1,172 Acc. 37.5 –ARCT (Habernal et al., 2018) 1,211 317 445 Acc. 86.7 79.8MCScript (Ostermann et al., 2018) 14,191 2,020 3,610 Acc. 92.8 98.2BoolQ (Clark et al., 2019) 9,427 1,635 1,635 3 Acc. 85.7 89.0Cosmos QA (Huang et al., 2019) 25,262 1,492 1,493 3 Acc. 79.4 94.0HellaSwag (Zellers et al., 2019) 39,905 5,021 5,021 3 Acc. 84.1 95.6

Para

grap

h-L

evel

Mul

tiple

Cho

ice

MuTual (Cui et al., 2020) 7,088 443 443 3 Acc. 87.8 93.8MuTual+ (Cui et al., 2020) 7,088 443 443 3 Acc. 77.9 93.0QuAIL (Rogers et al., 2020) 10,246 2,164 556 Acc. 73.3 –

QAMR (Michael et al., 2018) 50,615 18,908 18,770 EM 79.6 –NewsQA (Trischler et al., 2017) 76,568 4,343 4,293 EM 57.8 46.5SQuAD2.0 (Rajpurkar et al., 2018) 130,319 5,675 6,198 3 EM 91.5 86.8MRQA-NQ (Kwiatkowski et al., 2019) 104,071 6,418 6,418 3 EM 69.9 –Sp

anSe

lect

ion

Quoref (Dasigi et al., 2019) 19,399 1,209 1,209 3 EM 78.7 93.0

Table 1: Datasets grouped by their task format and ordered by release year. Cust. denotes cases when weuse our own custom split. Metric: evaluation metric used in this study. RoBERTa: model performance usingRoBERTaLarge. Human: human performance.

3.2 Models

We aim to understand how examples from differ-ent datasets contribute to the evaluations of mod-els with near-state-of-the-art abilities, so we in-clude several pretrained Transformer-based modelsto approximate this. However, using only high-performing models could result in a poor IRTmodel fit (Martínez-Plumed et al., 2019) To avoidthis, we add both weaker models and under-trainedversions of our original models. We use ALBERT-XXL-v2 (Lan et al., 2020), RoBERTaLarge andRoBERTaBase (Liu et al., 2019), BERTLarge andBERTBase (Devlin et al., 2019), XLM-R (Conneauet al., 2020), and 12 MiniBERTas (Zhang et al.,2021b). 4 For each of the 18 Transformer-basedmodels, we evaluate five different checkpoints—at1%, 10%, 25%, and 50% of the maximum steps of

4The MiniBERTas are RoBERTa models pretrained on1M, 10M, 100M, or 1B words of raw text, and varying slightlyin model size. There are three pretrained models for eachpretraining data quantity, which are pretrained using differentnear-optimal hyperparameter values. We use all three variantsin producing responses for IRT.

the maximum epochs (Section 3.3), as well as thebest checkpoint on the validation set, which neednot be one of the other four. This yields a total of90 model predictions for each test example.

3.3 Experimental Setup

Optimization We perform a hyperparametersweep on each dataset, varying the learning rate∈ {1e− 5, 3e− 5, 5e− 6}. We tune the maximumepochs ∈ {10, 40} for small datasets (< 5k train-ing examples), and ∈ {3, 10} for other datasets(Zhang et al., 2021a). We use the jiant (Pruk-sachatkun et al., 2020b) library which is based onPyTorch (Paszke et al., 2019) and HuggingFaceTransformers (Wolf et al., 2020).

We only perform hyperparameter tuning withthe RoBERTaLarge model and apply the best con-figuration to train all the other Transformer models.We use NVIDIA V100 Tensor Core GPUs for ourexperiments. On average, it takes approximatelyfour hours to train RoBERTa on small datasets(< 3k training examples), one day for medium-

Page 5: Comparing Test Sets with Item Response Theory

1145

Figure 3: The best validation performance of ALBERT-XXL-v2, RoBERTaLarge, and the smallest MiniBERTa(RoBERTa-Med-Small-1M-2) on each dataset. The full results table with performance of all models is reported inthe Appendix (Table 3)

sized datasets (< 10k), and four days for largedatasets (> 10k).

4 Results and Analysis

Figure 3 shows the performance of RoBERTaLarge,ALBERT-XXL-v2, and one of the low performingMiniBERTas (RoBERTa-Med-Small-1M-2) on allvalidation sets. Unsurprisingly, ALBERT-XXL-v2and RoBERTaLarge are the best-performing models,while the small MiniBERTa model achieves muchlower performance. Full results using all 18 modelscan be found in the Appendix (Table 3).

4.1 IRT Analysis

4.1.1 Item CharacteristicsMetric As our primary metric, we introduce Lo-cally Estimated Headroom (LEH) score, whichmeasures the ability of each test example to con-tribute to the evaluation of near-future progress. Wecalculate it as the derivative of the example’s ICC(Figure 2) with respect to the highest latent abilityscore, which corresponds to ALBERT-XXL-v2. Ahigh LEH score indicates that the best-performingmodel is still far from the example’s saturationpoints—the flat sections of ICC inferred by ourmodel. There is enough space along the curve thatthe IRT model expects the example to be able todifferentiate future state-of-the-art models. Typ-ically, different near-state-of-the-art models bothsucceed and fail on this kind of example, whileweaker models mostly fail. A high LEH score im-plies that there is still enough room for potentiallystronger models to perform better on this dataset.

To validate the use of LEH scores for detect-ing near-future improvements, we compare twoIRT models. The first is fitted using responsesfrom all models, while the second is fitted basedon responses from BERT and other weaker models

(excluding RoBERTaLarge, RoBERTaBase, XLM-R, and ALBERT-XXL-v2). After that, we com-pute the correlation between the two sets of LEHscores, focusing on the 75th percentile for eachdataset. The Pearson correlation is 95.5% with amedian absolute difference of 0.007 and a standarddeviation of 0.011. Out of the 29 datasets, onlySQuAD2.0, CommensenseQA, MuTual, Quoref,and HellaSwag have more than 0.02 absolute dif-ference in LEH scores. This strong correlationsuggests that our ICCs fits are not overly sensitiveto the exact characteristics of current state of theart models.

Analysis by LEH Scores Figure 1 shows the dis-tribution of test examples for each dataset based ontheir LEH scores. For our analysis, we focus onthe 75th percentile examples in each dataset as arough proxy for how likely a dataset is to have asignificant number of examples that are difficult ordiscriminative for near-future models.

We observe that Quoref, HellaSwag, and MC-TACO have examples with the highest LEH scores,suggesting sufficient headroom for future state-of-the-art models with a higher ability to achievebetter performance on these datasets. SNLI,CommitmentBank, and MNLI have relatively lowLEH scores, indicating that performance on thesedatasets is largely saturated. Additionally, we alsomeasure how the 75th percentile LEH scores corre-late with human-RoBERTa gap. Using 22 datasetsthat have human performance numbers (Table 1),we find that the Pearson correlation between thetwo is weakly positive (0.21).

Analysis by Item Parameters Next, we analyzethe distribution of test examples according to theirdiscrimination and difficulty parameters (Figure 4).We observe that datasets with span selection for-

Page 6: Comparing Test Sets with Item Response Theory

1146

Figure 4: Distribution of test examples for each dataset based on the log discrimination (logα) parameter (top) andthe difficulty (β) parameter (bottom).

mat (QAMR, NewsQA, SQuAD, MRQA-NQ, andQuoref) have the highest discrimination scores thanother datasets, highlighting span selection as an ef-fective task format for discriminating among strongand weak models. However, this might be becausethis task format typically features a much largerspace of possible model outputs than the other for-mats we consider. It does not necessarily mean thatspan selection is the most suitable to test models’ability to understand language. As the span-basedformat restricts answers to be text spans in the givenpassage, there are concerns that it rarely requiresreasoning ability which often involves answers notmentioned in the passage, and thus not reflectingcomprehension ability of humans (Lai et al., 2017;Sugawara et al., 2018).

For the difficulty parameter, we do not observea narrow task format that is superior to the oth-ers. However, we notice that the highest diffi-culty scores are obtained by QA datasets such asSQuAD2.0, NewsQA, QuAIL, ARC-Challenge,and MC-TACO. ANLI, which is created with adver-sarial model-in-the-loop crowdsourcing, also has ofmany hard examples. Impressionistically, trainingset size and creation date do not seem to correlatewith either example’s difficulty or discriminationparameters.

Figure 5 shows the distribution of examplesjointly according to their difficulty and log dis-crimination parameters. We notice a half-moonshape pattern in most datasets, which indicates that

most of the discriminative examples are either veryeasy or very difficult. Referring to the ICC curve(Figure 2), this indicates that there is high agree-ment among strong models or weak models, whichcorresponds to one of the saturation points in theICC curve (upper or lower). The only dataset thatdoes not have this pattern is Winogrande, which isdifficult for all models.

ARC-Challenge, QuAIL, HellaSwag, Common-senseQA, and MC-TACO show clusters with highdensity on the top right regions, indicating a largenumber of examples with high discrimination anddifficulty scores. Other datasets have more scat-tered distributions. SNLI, MNLI, and MCScriptshow higher density on the bottom right regions,while NewsQA, SQuAD2.0, and MRQA-NQ showhigher density on both the top and bottom right re-gions. Further analysis of the guessing parameterscan be found in Appendix A.

4.2 Examples with Unanimous Responses

When fitting ICC on examples that have only cor-rect responses or only incorrect responses, the dis-crimination parameter is unconstrained. We findthat these examples make up 4% of our data. 13 ofthe 29 datasets contain at least one such example.Roughly 16% of NewsQA examples are incorrectlyanswered by all models, while the remaining 12datasets have less than 10% of all correct or in-correct examples. To study the effect of exampleswith all correct or incorrect responses, we fit an

Page 7: Comparing Test Sets with Item Response Theory

1147

Figure 5: Distributions of log discrimination (logα) versus the difficulty (β) parameters for each dataset..

IRT model on responses excluding such examplesand compare against parameters from the full setof responses. We find that the Pearson correla-tion for the discrimination at the 75th percentile is97.2%, with a median absolute difference of 0.016and standard deviation of 0.015. MC-TACO, Com-mitmentBank, and WSC differ by more than 0.04.Further, we find that the Pearson correlation forthe LEH score at the 75th percentile is 98.9%, witha median absolute difference of 0.006 and stan-dard deviation of 0.005. RTE, WiC, WinoGrande,QAMR, NewsQA, MRQA-NQ, MC-TACO, andBoolQ differ by 0.01. Given these high correlations,we do not exclude these examples when reportingour main results.

4.3 Analysis by Task Group

Next, we analyze each task-type group in moredetail, focusing on the example’s scores around the75th percentile.

Classification We observe that all datasets havemoderate discrimination scores. Most ANLI ex-amples have relatively high difficulty scores, whileSNLI, MNLI, and CommitmentBank have the low-est difficulty scores.

Sentence-Level Multiple Choice All of thedatasets in this group have relatively low discrimi-nation scores compared to span selection datasets.Figure 5 shows that MC-TACO, Winogrande, andCommonsenseQA all have a higher density of dif-ficult examples, while for other datasets the distri-

bution is more spread.

Paragraph-Level Multiple Choice QuAIL andARC-Challenge examples have high difficulty butmoderate discrimination scores. As seen in Fig-ure 5, these datasets have a higher density in thetop right regions, showing a large proportion ofdifficult examples. ARCT shows moderate diffi-culty despite its known artifacts (Niven and Kao,2019), indicating that it can still be challenging formodels. Compared to other datasets, BoolQ has thehighest number of easy examples. However, as itis a binary classification task, the random baselineperformance is already high.

To investigate this, we calculate the number ofexamples in each test set that have γ parameterbelow 0.5. In general, we find that 88% of thetest examples have γ < 0.5, implying that mostof the examples contributed to the inferences ofα, β, and θ. BoolQ was the only exception inwhich approximately 56% of examples were as-signed γ > 0.5. After filtering out these guessableexamples in BoolQ, we find that its test exampleshave slightly higher discrimination scores with lit-tle change in difficulty scores.

Span Selection We observe that span selectiondatasets are the most discriminative. However, interms of difficulty, only SQuAD2.0 and NewsQAare among the top five.

4.3.1 Analysis on Model AbilityFor a sanity check, we further analyze how eachmodel scores according to our fitted IRT parame-

Page 8: Comparing Test Sets with Item Response Theory

1148

Name Example Difficulty (β)

MNLIPremise: And, you know, with this, you know, it wasn’t many opportunities for kids to be special,because kids weren’t, you know, you were pushed out of adult conversation, and just reallypushed to the side.

3.27

Hypothesis: Children were pushed out of adult conversation, and really just pushed to the side ingeneral.Label: entailment

MNLI Premise: Look, it’s your skin, but you’re going to be in trouble if you don’t get busy. -1.87Hypothesis: The boss will fire you if he sees you slacking off.

Label: neutral

MC-TACO

The Beatles are giving a press conference about their new film, Magical Mystery Tour .Whattime of day was the press conference? 2.86

(1) 4:00 PM 3 (2) 12:00 PM 3 (3) 3 p.m 3 (4) 6:00 AM 7

MC-TACO

Because then they feel like they are forced to stay in that situation."On average, how often dothey feel stuck in the situation? -1.67

(1) 54 months 7 (2) 6 centuries 7 (3) once every 6 years 7

(4) every few seconds 7 (5) once every 2 seconds 7 (6) once every 18 years 7

Table 2: Hardest and easiest examples along with their estimated difficulty score for MNLI and MC-TACO.

ters. We observe a positive correlation betweenability and average model accuracy (AppendixB). Generally, within a model, the best validationcheckpoint obtains the highest average model accu-racy and/or ability score. Across models, ALBERT-XXL-v2 performs typically best.

4.4 Qualitative Analysis

To better understand what kinds of examples aredifficult or discriminating, we analyze the 20 exam-ples with the lowest and highest scores for the dis-crimination and the difficulty parameters from fivedatasets: SQuAD2.0, MC-TACO, QuAIL, MNLI,and BoolQ. The first three are datasets with highdiscrimination and/or difficulty scores. MNLI andBoolQ have moderate discrimination and difficultyscores and low label entropy (three-class classifica-tion for MNLI and binary choice for BoolQ).

We observe that the 20 most difficult BoolQ ex-amples are labeled False (the minority class), while19 of the 20 easiest examples are labeled True. ForMNLI, we find that the 20 easiest MNLI examplesare labeled neutral while the 20 hardest examplesare a mixture of entailment and contradiction.

In MC-TACO, each example contains a vary-ing number of answer choices. For each choice, amodel needs to predict whether the answer is Trueor False. We find that all answer choices in top20 easiest examples are labeled False (the majorityclass), whereas for difficult examples the answerchoices are either all True or a mix of True andFalse (Table 2). For SQuAD2.0 and QuAIL, we

analyze the context length, the answerability ofa question, and the lexical overlap between con-text and questions. However, we do not find anyclear evidence that any of them might indicate thedifficulty level of test examples.

For BoolQ, we observe that the 20 most discrim-inating examples are all labeled False while 13 ofthe 20 least discriminating examples are labeledTrue. Table 2 shows the hardest and the easiestexamples of MNLI and MC-TACO.

5 Related Work

Prior work on using IRT to evaluate NLP systemsmostly relies on human responses. Hopkins andMay (2013) use IRT to estimate the relative abil-ity of a set of machine translation systems usingresponses from pairwise comparison of system out-puts by human judges. Otani et al. (2016) extendthis work by including a baseline translation tothe pairwise comparison. Lalor et al. (2016, 2018)use IRT to identify hard examples in natural lan-guage inference data based on human responses.In a follow-up study, Lalor et al. (2019) comparehuman versus model responses and find that bothare positively correlated and demonstrate the usecases of IRT parameters in training set filtering. Se-doc and Ungar (2020) use IRT to evaluate chatbotsystems.

The work by Martínez-Plumed et al. (2019) isthe first to study the idea of using model responses(as opposed to human responses) for IRT in ma-chine learning research. For NLU, Lalor and Yu

Page 9: Comparing Test Sets with Item Response Theory

1149

(2020) use model responses to estimate difficultyparameters of several GLUE datasets for dynamicdata selection in curriculum learning. In concurrentwork, Rodriguez et al. (2021) study how IRT canbe used for more nuanced leaderboard evaluations.Their experiments demonstrate that IRT can pro-duce a more reliable ranking of models than thetraditional metrics. They also show that IRT is notonly useful for better understanding of individualexamples in the dataset and task, but also effectivein identifying annotation errors.

For other dataset evaluations, in addition toproviding a benchmark, the SuperGLUE paperalso compares a set of candidate datasets using afixed pool of machine learning models and humanannotators (Nangia and Bowman, 2019). Wanget al. (2019a) investigate pretraining tasks andparadigms for effective transfer learning methods.Pruksachatkun et al. (2020a) study when and whyintermediate-task training is useful for a given tar-get task. Vu et al. (2020) introduce task embed-dings to predict the most beneficial source task for agiven target task. Schlegel et al. (2020) propose anevaluation framework for machine reading compre-hension (MRC) datasets and reveal some concernsregarding factual correctness and the presence oflinguistic cues in existing MRC gold datasets.

6 Conclusion

Given the large number of NLU datasets introducedin recent years, what kinds of datasets are effec-tive to measure near-future progress? Our analysison 29 test sets using IRT gives us reason to be-lieve that, among the datasets we evaluate, Quoref,HellaSwag, and MC-TACO are best able to dis-criminate among current (and likely future) strongmodels. Meanwhile, SNLI, MNLI, and Commit-mentBank seem to be saturated and ineffective formeasuring future progress.

Our analysis of examples’ difficulty and discrim-ination parameters shows that datasets with manyhard examples do not always contain examples thatcan discriminate between strong and weak mod-els. We find that QA datasets are more difficultthan other datasets. We also find span selection asthe most effective task format for discriminatingbetween strong and weak models.

According to our LEH score, datasets that seemto be solved are unlikely to see improvements withfuture pretrained models. Therefore, the skills theyintend to test are either largely solved, to the extent

that they are solvable, or not well isolated (e.g.,due to data artifacts). Focusing on the skills forwhich these solved test sets are originally designedto evaluate would most likely require a new datasetthat better isolates the reasoning ability of interest.

On the other hand, datasets that perform wellaccording to our LEH metric show the best signsof being amenable to future hill-climbing. Thisdoes not entail that we should focus future researchon these benchmarks, since we do not evaluatewhether they test the skills they mean to test, orwhether these skills are important for scientific orpractical progress on natural language understand-ing. Finally, we argue that this evaluation should bedone periodically, as datasets and models improveover time.

For future work, one can study multi-dimensional variables for both model ability anditem parameters, which could reveal a factorizationof datasets by skills. Other potential directions in-clude expanding our analysis to a broader range oftasks and analyzing the relationship between theestimated IRT parameters and the human-modelgap.

Acknowledgments

We thank John Lalor, João Sedoc, Nikita Nangia,Sebastian Schuster, Iacer Calixto, and the anony-mous reviewers for feedback. This work has ben-efited from financial support to SB by Eric andWendy Schmidt (made by recommendation of theSchmidt Futures program), Samsung Research (un-der the project Improving Deep Learning usingLatent Structure), Apple, and Intuit, and from in-kind support by the NYU High-Performance Com-puting Center and by NVIDIA Corporation (withthe donation of a Titan V GPU). This material isbased upon work supported by the National Sci-ence Foundation under Grant No. 1922658. Anyopinions, findings, and conclusions or recommen-dations expressed in this material are those of theauthor(s) and do not necessarily reflect the viewsof the National Science Foundation.

Ethical Considerations

We present an objective approach for comparing thedifficulty of test sets examples across datasets anddemonstrate it on a large set of established datasets.We expect this to contribute to the developmentof more challenging benchmarks for NLP datasetsand potentially to develop more challenging mod-

Page 10: Comparing Test Sets with Item Response Theory

1150

els. One concern worth noting is that most of theevaluation datasets we study are crowdsourced ordrawn from naturally occurring data. Thus, theylikely demonstrate harmful stereotypes to some de-gree or even score models more highly for demon-strating them. In general, models that perform wellon these datasets should not be deployed directlywithout additional measures to measure and elim-inate any harms that stereotypes like these couldcause in the target application settings.

ReferencesFrank B. Baker and Seock-Ho Kim. 1993. Item re-

sponse theory : parameter estimation techniques.Journal of the American Statistical Association,88:707.

Luisa Bentivogli, Peter Clark, Ido Dagan, and DaniloGiampiccolo. 2009. The Fifth PASCAL Recogniz-ing Textual Entailment Challenge. In TAC.

Chandra Bhagavatula, Ronan Le Bras, ChaitanyaMalaviya, Keisuke Sakaguchi, Ari Holtzman, Han-nah Rashkin, Doug Downey, Wen tau Yih, and YejinChoi. 2020. Abductive Commonsense Reasoning.In International Conference on Learning Represen-tations.

Eli Bingham, Jonathan P. Chen, Martin Jankowiak,Fritz Obermeyer, Neeraj Pradhan, Theofanis Kar-aletsos, Rohit Singh, Paul A. Szerlip, Paul Horsfall,and Noah D. Goodman. 2019. Pyro: Deep UniversalProbabilistic Programming. J. Mach. Learn. Res.,20:28:1–28:6.

Yonatan Bisk, Rowan Zellers, Ronan LeBras, JianfengGao, and Yejin Choi. 2020. PIQA: Reasoning aboutPhysical Commonsense in Natural Language. InThe Thirty-Fourth AAAI Conference on Artificial In-telligence, AAAI 2020, The Thirty-Second Innova-tive Applications of Artificial Intelligence Confer-ence, IAAI 2020, The Tenth AAAI Symposium on Ed-ucational Advances in Artificial Intelligence, EAAI2020, New York, NY, USA, February 7-12, 2020,pages 7432–7439. AAAI Press.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.

Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. BoolQ: Exploring the surprisingdifficulty of natural yes/no questions. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1

(Long and Short Papers), pages 2924–2936, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think You Have Solved Question An-swering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8440–8451, Online. Association for Computational Lin-guistics.

Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, and MingZhou. 2020. MuTual: A dataset for multi-turn dia-logue reasoning. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, pages 1406–1416, Online. Association forComputational Linguistics.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The PASCAL recognising textual entailmentchallenge. In Machine Learning Challenges Work-shop, pages 177–190. Springer.

Pradeep Dasigi, Nelson F. Liu, Ana Marasovic,Noah A. Smith, and Matt Gardner. 2019. Quoref:A reading comprehension dataset with questions re-quiring coreferential reasoning. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5925–5932, Hong Kong,China. Association for Computational Linguistics.

Marie-Catherine De Marneffe, Mandy Simons, andJudith Tonhauser. 2019. The CommitmentBank:Investigating projection in naturally occurring dis-course. In proceedings of Sinn und Bedeutung, vol-ume 23, pages 107–124.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-nsol Choi, and Danqi Chen. 2019. MRQA 2019shared task: Evaluating generalization in readingcomprehension. In Proceedings of the 2nd Work-shop on Machine Reading for Question Answering,pages 1–13, Hong Kong, China. Association forComputational Linguistics.

Page 11: Comparing Test Sets with Item Response Theory

1151

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The third PASCAL recogniz-ing textual entailment challenge. In Proceedings ofthe ACL-PASCAL Workshop on Textual Entailmentand Paraphrasing, pages 1–9, Prague. Associationfor Computational Linguistics.

Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,and Benno Stein. 2018. The argument reasoningcomprehension task: Identification and reconstruc-tion of implicit warrants. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 1930–1940, New Orleans, Louisiana. Associ-ation for Computational Linguistics.

R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, DaniloGiampiccolo, Bernardo Magnini, and Idan Szpektor.2006. The second pascal recognising textual entail-ment challenge. In Proceedings of the Second PAS-CAL Challenges Workshop on Recognising TextualEntailment.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, andWeizhu Chen. 2020. Deberta: Decoding-enhancedbert with disentangled attention.

Mark Hopkins and Jonathan May. 2013. Models oftranslation competitions. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages1416–1424, Sofia, Bulgaria. Association for Compu-tational Linguistics.

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, andYejin Choi. 2019. Cosmos QA: Machine readingcomprehension with contextual commonsense rea-soning. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages2391–2401, Hong Kong, China. Association forComputational Linguistics.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Al-berti, Danielle Epstein, Illia Polosukhin, Jacob De-vlin, Kenton Lee, Kristina Toutanova, Llion Jones,Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: A benchmark for question an-swering research. Transactions of the Associationfor Computational Linguistics, 7:452–466.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. RACE: Large-scale ReAd-ing comprehension dataset from examinations. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages785–794, Copenhagen, Denmark. Association forComputational Linguistics.

John P. Lalor, Hao Wu, Tsendsuren Munkhdalai, andHong Yu. 2018. Understanding deep learning per-

formance through an examination of test set diffi-culty: A psychometric case study. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 4711–4716, Brus-sels, Belgium. Association for Computational Lin-guistics.

John P. Lalor, Hao Wu, and Hong Yu. 2016. Build-ing an evaluation scale using item response theory.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages648–657, Austin, Texas. Association for Computa-tional Linguistics.

John P. Lalor, Hao Wu, and Hong Yu. 2019. Learn-ing latent parameters without human response pat-terns: Item response theory with artificial crowds. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 4249–4259, Hong Kong, China. Association for Computa-tional Linguistics.

John P. Lalor and Hong Yu. 2020. Dynamic data selec-tion for curriculum learning via ability estimation.In Findings of the Association for ComputationalLinguistics: EMNLP 2020, pages 545–555, Online.Association for Computational Linguistics.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. ALBERT: A lite BERT for self-supervisedlearning of language representations. In 8th Inter-national Conference on Learning Representations,ICLR 2020.

Hector Levesque, Ernest Davis, and Leora Morgen-stern. 2012. The Winograd Schema Challenge. InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning.Citeseer.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized bert pretraining ap-proach. Unpublished manuscript available on arXiv.

Fernando Martínez-Plumed, Ricardo B.C. Prudêncio,Adolfo Martínez-Usó, and José Hernández-Orallo.2019. Item response theory in ai: Analysing ma-chine learning classifiers at the instance level. Artifi-cial Intelligence, 271:18 – 42.

Julian Michael, Gabriel Stanovsky, Luheng He, Ido Da-gan, and Luke Zettlemoyer. 2018. Crowdsourcingquestion-answer meaning representations. In Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 2 (Short Papers), pages 560–568, New Orleans,Louisiana. Association for Computational Linguis-tics.

Page 12: Comparing Test Sets with Item Response Theory

1152

Nikita Nangia and Samuel R. Bowman. 2019. Humanvs. muppet: A conservative estimate of human per-formance on the GLUE benchmark. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 4566–4575, Flo-rence, Italy. Association for Computational Linguis-tics.

Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2020. Ad-versarial NLI: A new benchmark for natural lan-guage understanding. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 4885–4901, Online. Associationfor Computational Linguistics.

Timothy Niven and Hung-Yu Kao. 2019. Probing neu-ral network comprehension of natural language ar-guments. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 4658–4664, Florence, Italy. Associationfor Computational Linguistics.

Simon Ostermann, Ashutosh Modi, Michael Roth, Ste-fan Thater, and Manfred Pinkal. 2018. MCScript:A novel dataset for assessing machine comprehen-sion using script knowledge. In Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018), Miyazaki,Japan. European Language Resources Association(ELRA).

Naoki Otani, Toshiaki Nakazawa, Daisuke Kawahara,and Sadao Kurohashi. 2016. IRT-based aggrega-tion model of crowdsourced pairwise comparisonfor evaluating machine translations. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 511–520,Austin, Texas. Association for Computational Lin-guistics.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. PyTorch:An Imperative Style, High-Performance Deep Learn-ing Library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 32, pages 8024–8035. Curran Asso-ciates, Inc.

Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context datasetfor evaluating context-sensitive meaning represen-tations. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 1267–1273, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Yada Pruksachatkun, Jason Phang, Haokun Liu,Phu Mon Htut, Xiaoyi Zhang, Richard YuanzhePang, Clara Vania, Katharina Kann, and Samuel R.Bowman. 2020a. Intermediate-task transfer learningwith pretrained language models: When and whydoes it work? In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, pages 5231–5247, Online. Association forComputational Linguistics.

Yada Pruksachatkun, Phil Yeres, Haokun Liu, JasonPhang, Phu Mon Htut, Alex Wang, Ian Tenney, andSamuel R. Bowman. 2020b. jiant: A softwaretoolkit for research on general-purpose text under-standing models. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics: System Demonstrations, pages 109–117,Online. Association for Computational Linguistics.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Compu-tational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392, Austin,Texas. Association for Computational Linguistics.

Rajesh Ranganath, Sean Gerrish, and David M. Blei.2014. Black Box Variational Inference. In Proceed-ings of the Seventeenth International Conference onArtificial Intelligence and Statistics, AISTATS 2014,Reykjavik, Iceland, April 22-25, 2014, volume 33of JMLR Workshop and Conference Proceedings,pages 814–822. JMLR.org.

Pedro Rodriguez, Joe Barrow, Alexander Hoyle, John P.Lalor, Robin Jia, and Boyd-Graber Jordan. 2021.Evaluation examples are not equally informative:How should that change nlp leaderboards? In Pro-ceedings of the 59th Annual Meeting of the Associa-tion for Computational Linguistics, Online. Associa-tion for Computational Linguistics.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of Plausible Alterna-tives: An Evaluation of Commonsense Causal Rea-soning. In AAAI spring symposium: logical formal-izations of commonsense reasoning, pages 90–95.

Anna Rogers, Olga Kovaleva, Matthew Downey, andAnna Rumshisky. 2020. Getting closer to AI com-plete question answering: A set of prerequisitereal tasks. In The Thirty-Fourth AAAI Conferenceon Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelli-gence Conference, IAAI 2020, The Tenth AAAI Sym-posium on Educational Advances in Artificial Intel-ligence, EAAI 2020, New York, NY, USA, February7-12, 2020, pages 8722–8731. AAAI Press.

Page 13: Comparing Test Sets with Item Response Theory

1153

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-ula, and Yejin Choi. 2020. Winogrande: An adver-sarial winograd schema challenge at scale. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, volume 34, pages 8732–8740.

Maarten Sap, Hannah Rashkin, Derek Chen, RonanLe Bras, and Yejin Choi. 2019. Social IQa: Com-monsense reasoning about social interactions. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computa-tional Linguistics.

Viktor Schlegel, Marco Valentino, Andre Freitas,Goran Nenadic, and Riza Batista-Navarro. 2020. Aframework for evaluation of machine reading com-prehension gold standards. In Proceedings of the12th Language Resources and Evaluation Confer-ence, pages 5359–5369, Marseille, France. Euro-pean Language Resources Association.

João Sedoc and Lyle Ungar. 2020. Item response the-ory for efficient human evaluation of chatbots. InProceedings of the First Workshop on Evaluationand Comparison of NLP Systems, pages 21–33, On-line. Association for Computational Linguistics.

Saku Sugawara, Kentaro Inui, Satoshi Sekine, andAkiko Aizawa. 2018. What makes reading com-prehension questions easier? In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 4208–4219, Brus-sels, Belgium. Association for Computational Lin-guistics.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. CommonsenseQA: A ques-tion answering challenge targeting commonsenseknowledge. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4149–4158, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. NewsQA: A machine compre-hension dataset. In Proceedings of the 2nd Work-shop on Representation Learning for NLP, pages191–200, Vancouver, Canada. Association for Com-putational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In Advances in Neural Information Pro-cessing Systems, volume 30, pages 5998–6008. Cur-ran Associates, Inc.

Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessan-dro Sordoni, Adam Trischler, Andrew Mattarella-

Micke, Subhransu Maji, and Mohit Iyyer. 2020. Ex-ploring and predicting transferability across NLPtasks. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Process-ing (EMNLP), pages 7882–7926, Online. Associa-tion for Computational Linguistics.

Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pap-pagari, R. Thomas McCoy, Roma Patel, NajoungKim, Ian Tenney, Yinghui Huang, Katherin Yu,Shuning Jin, Berlin Chen, Benjamin Van Durme,Edouard Grave, Ellie Pavlick, and Samuel R. Bow-man. 2019a. Can you tell me how to get past sesamestreet? sentence-level pretraining beyond languagemodeling. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 4465–4476, Florence, Italy. Association forComputational Linguistics.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel Bowman. 2019b. SuperGLUE:A Stickier Benchmark for General-Purpose Lan-guage Understanding Systems. In H. Wallach,H. Larochelle, A. Beygelzimer, F. dÁlché-Buc,E. Fox, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 32, pages 3266–3280. Curran Associates, Inc.

Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 353–355, Brussels, Belgium.Association for Computational Linguistics.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 1112–1122, New Orleans,Louisiana. Association for Computational Linguis-tics.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adversar-ial dataset for grounded commonsense inference. InProceedings of the 2018 Conference on Empirical

Page 14: Comparing Test Sets with Item Response Theory

1154

Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computa-tional Linguistics.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Cana machine really finish your sentence? In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for ComputationalLinguistics.

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Wein-berger, and Yoav Artzi. 2021a. Revisiting few-sample BERT fine-tuning. In International Confer-ence on Learning Representations.

Yian Zhang, Alex Warstadt, Haau-Sing Li, andSamuel R. Bowman. 2021b. When do you need bil-lions of words of pretraining data? In Proceedingsof the 59th Annual Meeting of the Association forComputational Linguistics, Online. Association forComputational Linguistics.

Ben Zhou, Daniel Khashabi, Qiang Ning, and DanRoth. 2019. “going on a vacation” takes longerthan “going for a walk”: A study of temporal com-monsense understanding. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 3363–3369, Hong Kong,China. Association for Computational Linguistics.

Page 15: Comparing Test Sets with Item Response Theory

1155

A Discrimination vs. Guessing

In addition to the analysis of discrimination versusdifficulty parameters, we also look at the distri-bution of the guessing (γ) parameters. From Fig-ure 6, we observe that all QA datasets with spanselection format generally have low guessing pa-rameters, meaning that they are difficult to predictcorrectly by random guessing. This makes sense asspan selection has higher label entropy than clas-sification or multiple-choice task. We find thatseveral datasets have examples with varying guess-ing parameters: For SNLI we see a high density ofexamples that can be predicted easily by randomguessing while for MNLI, HellaSwag, and MC-Script, there are more examples with low guessingparameters.

B Additional Analysis on Model Ability

Figure 7 plots model abilities θ against their aver-age accuracy over all test examples, where eachpoint represents a model checkpoint (Section 3.2).We use different colors for different models (e.g.,dark blue for ALBERT-XXL-v2), and differentshapes to mark different checkpoints.

Since we only perform tuning on RoBERTaLarge,some of these models might have worse perfor-mance than if they were individually tuned.

C Task Descriptions

In this section, we provide a short description foreach dataset.

RTE The series of Recognizing Textual Entail-ment datasets (Dagan et al., 2005; Haim et al.,2006; Giampiccolo et al., 2007; Bentivogli et al.,2009) correspond to a two-class textual entailmentclassification task. Given a premise sentence and ahypothesis sentence, the task is to decide whetherthe premise entails the hypothesis.

SNLI The Stanford Natural Language Inferencecorpus (Bowman et al., 2015) is a textual entail-ment dataset, formulated as a three-class classifi-cation task. Given a premise sentence and a hy-pothesis sentence, the task is to determine if thepremise entails the hypothesis, contradicts it, or nei-ther. The SNLI dataset is created using premisestaken from image captions.

MNLI The Multi-Genre Natural Language Infer-ence corpus (Williams et al., 2018) is also a textualentailment dataset, similar to that of SNLI. The

MNLI dataset is built to cover a broad range ofgenres, including written and spoken text. Half ofits test set is created from text that is out of domainrelative to the training set.

CommitmentBank CommitmentBank(De Marneffe et al., 2019) is a dataset for-mulated as a three-class textual entailmentclassification task. Given a piece of text and anembedded clause, models must decide whether theembedded clause is entailed by the text.

ARCT The Argument Reasoning Comprehen-sion Task (Habernal et al., 2018) is a multiple-choice question answering dataset. Given an ar-gument, a claim, and a premise, the task is to selectthe correct implicit warrant (which explains whythe premise implies the claim) from two choices.

ARC-Easy ARC (Clark et al., 2018) is amultiple-choice QA dataset composed of realmultiple-choice science questions in grade schools.ARC-Easy is composed of the easier questionsthat do not satisfy the criteria used to built ARC-Challenge (described below).

ARC-Challenge ARC-Challenge (Clark et al.,2018) is the subset of ARC that contains questionsthat are incorrectly answered by both a retrieval-based algorithm and a word co-occurrence algo-rithm.

MCScript The MCScript (Ostermann et al.,2018) is a QA dataset with multiple-choice format.The dataset tests models’ commonsense knowl-edge, in particular, script knowledge which cor-responds to the sequence of actions people do in aparticular situation.

Cosmos QA Cosmos QA (Huang et al., 2019) isa multiple-choice reading comprehension dataset,and it is intended to require extensive abstrac-tive commonsense reasoning. Unlike Common-senseQA, Cosmos QA requires comprehensionover an auxiliary article, instead of simply respond-ing to a free-standing question.

HellaSwag HellaSwag (Zellers et al., 2019) is acommonsense reasoning multiple-choice dataset.It is built using adversarial filtering with BERT.Given a story, the task is to select the most plausiblecontinuation.

BoolQ BoolQ (Clark et al., 2019) is a boolean(yes/no) reading comprehension QA dataset built

Page 16: Comparing Test Sets with Item Response Theory

1156

Figure 6: Plots of the log discrimination (logα) versus the guessing (γ) parameters for each dataset.

Figure 7: Average model accuracy over all datasets vs. ability (θ). The three different hyperparameter configura-tions of each MiniBERTa are represented by a single color for ease of readability. Best viewed in color.

using the same pipeline used to produce the (non-boolean) Natural Questions (Kwiatkowski et al.,2019).

MuTual MuTual (Cui et al., 2020) is a multiple-choice QA dataset for multi-turn dialogue reason-ing. The dataset is created from Chinese students’English listening comprehension exams, and it isintended to require a variety of commonsense rea-soning skills.

MuTual-Plus MuTual-Plus (Cui et al., 2020) isa variant of MuTual, in which one of the choices ineach set of answers is replaced by a safe response(i.e., “could you repeat that”). If all other choicesare incorrect, then the model is supposed to selectthe safe response. This variant of MuTual is builtso that we can evaluate if the model can select thesafe response when all other options are incorrect.

QuAIL QuAIL (Rogers et al., 2020) is a read-ing comprehension dataset formulated as a mul-tiple choice task. One feature of QuAIL is thatit combines “commonsense, text-based, and unan-swerable questions.” It is also designed such that ithas a balanced distribution of genres and reasoningtypes.

COPA Choice of Plausible Alternatives (Roem-mele et al., 2011) is a dataset for sentence-levelmultiple-choice task. Given a premise and a ques-tion that asks for the cause or effect of the premise,the task is to choose the most plausible hypothesisfrom two options.

WSC The Winograd Schema Challenge(Levesque et al., 2012) is a sentence-level multiple-choice commonsense reasoning dataset. Givena piece of text, a pronoun, and a list of possiblenoun phrases, the model must choose the correct

Page 17: Comparing Test Sets with Item Response Theory

1157

referent to the pronoun. The dataset is designedsuch that world knowledge is required to make thecorrect choices. We use the SuperGLUE (Wanget al., 2019b) version of the dataset.

CommonsenseQA CommonsenseQA (Talmoret al., 2019) is a multiple-choice QA dataset whichis designed to test a range of commonsense knowl-edge.

SocialIQA SocialIQA (Sap et al., 2019) is adataset that is specifically designed to test a models’capabilities related to emotional and social intelli-gence in everyday situations.

MC-TACO MC-TACO (Zhou et al., 2019) isa multiple-choice QA dataset that is designed totest temporal commonsense reasoning, in particu-lar: duration, temporal ordering, typical time, fre-quency, and stationarity. Each question consistsof a varying number of choices, and for each an-swer choice, a model needs to predict whether theanswer is correct or incorrect.

WiC The Word-in-Context (Pilehvar andCamacho-Collados, 2019) dataset which isdesigned to test the word sense disambiguationskill of a model. Given two pieces of text (a phraseor a sentence) with a polysemous word in both, amodel needs to predict whether the two words areused in the same sense.

PIQA The Physical Interaction Question An-swering dataset (Bisk et al., 2020) is a multiple-choice QA dataset that is designed to test the physi-cal commonsense reasoning skill. Given a physicaltask expressed in text, a model needs to select themost sensible solution.

WinoGrande The WinoGrande dataset (Sak-aguchi et al., 2020) is built through a crowdsourc-ing procedure that incorporates adversarial filtering.Given a sentence with a blank (where the blank cor-responds to a noun phrase), the task is to select thecorrect filler. The dataset is designed to test thecommonsense reasoning skill.

Abductive NLI The Abductive Natural Lan-guage Inference dataset (Bhagavatula et al., 2020)is a multiple-choice dataset. Given a premise, thetask is to select the most likely explanation fromthe given hypotheses.

QAMR The Question-Answer Meaning Repre-sentations (Michael et al., 2018) is a QA dataset

where the question-answer pairs are created fromsentences’ predicate-argument relationships.

NewsQA NewsQA (Trischler et al., 2017) is aQA dataset formulated as span selection task. Thedataset is built by crowdworkers using passagestaken from CNN news articles.

SQuAD2.0 SQuAD2.0 (Rajpurkar et al., 2018)is a QA dataset that combines the span-selectionreading-comprehension questions in SQuAD 1.1(Rajpurkar et al., 2016) with over 50,000 unanswer-able questions. The unanswerable questions werewritten by crowdworkers to look like the answer-able ones. A model must either select an answerspan or decline to answer.

Quoref Quoref (Dasigi et al., 2019) is a QAdataset that is designed to test coreferential rea-soning ability. The dataset is formulated as a spanselection QA task.

MRQA Natural Questions The Natural Ques-tions dataset (Kwiatkowski et al., 2019) is a datasetdesigned to test a model’s ability in reading compre-hension. The questions are taken from real-wordqueries, while the context passages are taken fromWikipedia articles. We use the MRQA version of itwhich contains a preprocessed version of a subsetof questions in Natural Questions.

ANLI The Adversarial Natural Language Infer-ence dataset (Nie et al., 2020) is a textual entail-ment dataset built using an iterative human-and-model-in-the-loop procedure in order to find hardexamples.

Page 18: Comparing Test Sets with Item Response Theory

1158

Dat

aset

Bes

tA

LB

ER

TR

LR

BX

LM

-RB

LB

BR

B-1

00M

-1R

B-1

00M

-2R

B-1

00M

-3R

B-1

0M-1

RB

-10M

-2R

B-1

0M-3

RB

-1B

-1R

B-1

B-2

RB

-1B

-3R

MS-

1M-1

RM

S-1M

-2R

MS-

1M-3

RT

E86

.681

.287

.680

.457

.281

.973

.960

.966

.766

.758

.759

.457

.267

.464

.571

.760

.960

.958

.0SN

LI

92.6

92.4

92.7

91.7

91.7

90.5

90.6

87.6

88.8

87.5

85.1

86.1

85.7

88.7

88.4

89.2

79.1

78.5

79.3

MN

LI-

m90

.288

.389

.886

.587

.585

.380

.475

.576

.674

.669

.370

.668

.677

.877

.979

.861

.362

.660

.9M

NL

I-m

m90

.288

.589

.586

.287

.885

.081

.177

.177

.475

.570

.671

.270

.879

.778

.980

.862

.662

.362

.3C

B90

.580

.690

.588

.069

.984

.678

.763

.777

.380

.563

.563

.677

.383

.583

.284

.863

.860

.557

.7A

NL

I-R

173

.875

.966

.656

.160

.158

.552

.647

.748

.146

.645

.648

.647

.449

.547

.251

.137

.839

.839

.7A

NL

I-R

248

.957

.744

.641

.541

.542

.944

.538

.840

.939

.139

.742

.040

.540

.440

.241

.336

.136

.636

.4A

NL

I-R

344

.454

.441

.340

.842

.043

.842

.138

.340

.840

.738

.038

.238

.340

.741

.041

.731

.834

.233

.5

CO

PA79

.196

.086

.072

.062

.080

.068

.066

.074

.070

.068

.058

.070

.074

.072

.072

.058

.054

.052

.0W

SC89

.078

.878

.876

.961

.565

.457

.761

.561

.559

.655

.859

.650

.059

.655

.861

.559

.651

.967

.3C

omm

onse

nseQ

A72

.180

.574

.658

.523

.860

.857

.433

.836

.132

.526

.126

.123

.341

.338

.743

.021

.622

.525

.1M

C-T

AC

O44

.055

.955

.948

.647

.743

.238

.737

.840

.537

.835

.134

.228

.837

.841

.440

.520

.727

.024

.3So

cial

IQA

78.5

79.4

79.9

70.5

38.9

66.1

59.9

54.4

55.2

56.0

51.8

49.9

50.8

58.5

56.9

55.6

43.5

46.1

44.1

WiC

70.5

74.3

71.5

71.2

71.5

69.6

68.7

63.3

64.9

68.0

61.4

63.3

62.4

65.8

66.5

68.3

60.2

59.2

59.9

Abd

uctiv

eN

LI

83.9

83.8

85.0

72.2

76.4

66.4

62.4

56.9

57.4

56.7

55.0

54.8

55.0

57.7

56.1

60.3

52.9

53.3

53.9

PiQ

A77

.180

.677

.668

.253

.565

.759

.760

.761

.761

.560

.660

.460

.062

.960

.362

.255

.055

.955

.9W

inog

rand

e79

.385

.677

.764

.652

.652

.952

.050

.954

.552

.651

.251

.851

.253

.755

.652

.847

.650

.948

.7

AR

C-C

–47

.537

.531

.838

.139

.831

.831

.830

.429

.830

.830

.430

.429

.430

.832

.827

.827

.129

.8A

RC

-E66

.069

.362

.553

.240

.961

.456

.839

.141

.640

.037

.036

.034

.041

.843

.046

.531

.831

.929

.8A

RC

T70

.183

.586

.776

.376

.674

.472

.266

.568

.066

.564

.666

.166

.570

.368

.770

.366

.164

.664

.6M

CSc

ript

90.0

96.0

92.8

85.0

90.1

84.1

80.7

74.1

74.7

73.5

68.9

70.5

69.7

73.7

76.2

77.8

60.6

61.7

60.1

Boo

lQ87

.187

.385

.782

.284

.276

.175

.774

.373

.572

.569

.572

.771

.973

.674

.374

.167

.368

.666

.9C

osm

osQ

A81

.986

.579

.466

.871

.764

.956

.054

.753

.451

.544

.046

.848

.955

.954

.354

.039

.942

.040

.6H

ella

Swag

85.2

89.8

84.1

61.6

74.5

43.9

37.2

36.9

35.4

34.7

33.8

33.6

33.0

37.5

36.4

36.5

29.9

30.0

29.4

Mut

ual

71.3

89.8

87.8

73.3

26.9

72.5

65.5

56.2

58.7

57.8

50.8

52.8

53.3

63.9

59.1

62.5

43.3

41.3

42.4

Mut

ual+

62.6

82.8

77.9

63.9

65.5

65.9

54.2

51.9

49.7

50.6

47.2

43.8

45.8

54.9

53.5

53.5

35.9

40.4

40.2

QuA

IL47

.978

.073

.367

.063

.952

.954

.452

.253

.653

.647

.850

.048

.554

.254

.157

.340

.738

.741

.8

QA

MR

79.1

79.6

79.6

77.6

79.3

76.4

73.5

70.1

71.8

70.4

63.4

64.4

62.6

73.7

73.1

74.6

24.0

28.7

27.8

New

sQA

–59

.657

.853

.156

.253

.548

.036

.238

.534

.729

.129

.627

.540

.442

.445

.96.

58.

99.

7SQ

uAD

2.0

86.8

89.9

91.5

88.6

87.0

88.6

86.3

82.0

82.9

80.7

75.8

77.2

75.8

82.7

83.5

84.8

57.1

58.2

58.2

MR

QA

-NQ

57.4

71.5

69.9

65.0

66.9

67.3

66.1

59.0

58.9

57.5

52.9

53.0

52.8

62.1

60.6

61.5

41.0

43.1

42.4

Quo

ref

74.9

83.7

78.7

66.7

76.3

69.8

62.9

48.0

46.7

43.9

34.9

35.6

34.5

51.0

50.1

56.0

14.6

21.6

16.8

Tabl

e3:

Res

ults

ofal

lour

mod

els

onea

chva

lidat

ion

set.

RL

:RoB

ER

TaLarg

e,R

B:R

oBE

RTa

Base

,BL

:BE

RTLarg

e,B

B:B

ER

TBase

,RM

S:R

oBE

RTa

-Med

-Sm

all.

Bes

tden

otes

best

know

npe

rfor

man

ceon

the

orig

inal

data

set’s

valid

atio

nse

t.