arXiv:2108.00578v1 [cs.CL] 2 Aug 2021

Is My Model Using The Right Evidence? Systematic Probes forExamining Evidence-Based Tabular Reasoning

Vivek Gupta∗

University of UtahRiyaz A. Bhat

Verisk [email protected] [email protected] [email protected]

Atreya GhosalLTRC, IIIT-H

Manish SrivastavaLTRC, IIIT-H

Maneesh SinghVerisk Inc.

[email protected] [email protected] [email protected]

Vivek SrikumarUniversity of Utah

AbstractWhile neural models routinely report state-of-the-art performance across NLP tasks involv-ing reasoning, their outputs are often observedto not properly use and reason on the evidencepresented to them in the inputs. A model thatreasons properly is expected to attend to theright parts of the input, be self-consistent inits predictions across examples, avoid spuri-ous patterns in inputs, and to ignore biasingfrom its underlying pretrained language modelin a nuanced, context-sensitive fashion (e.g.handling counterfactuals). Do today’s mod-els do so? In this paper, we study this ques-tion using the problem of reasoning on tabulardata. The tabular nature of the input is par-ticularly suited for the study as it admits sys-tematic probes targeting the properties listedabove. Our experiments demonstrate that aBERT-based model representative of today’sstate-of-the-art fails to properly reason on thefollowing counts: it often (a) misses the rele-vant evidence, (b) suffers from hypothesis andknowledge biases, and, (c) relies on annotationartifacts and knowledge from pretrained lan-guage models as primary evidence rather thanrelying on reasoning on the premises in the tab-ular input.

1 Introduction

The problem of understanding tabular or semi-structured knowledge presents a reasoning chal-lenge for modern NLP systems. Recently, Chenet al. (2019) and Gupta et al. (2020) have posedthis problem in the form of a natural language infer-ence question (NLI, Dagan et al., 2013; Bowmanet al., 2015, inter alia), where a model is askedto determine whether a hypothesis is entailed orcontradicted given a premise, or is unrelated toit. To study the problem as an NLI task, theyhave extended the standard NLI datasets with Tab-Fact (Chen et al., 2019) and INFOTABS (Guptaet al., 2020) datasets containing tabular premises.

∗*Corresponding Author

One strategy to model such tabular reasoningtasks involves relying on the successes of contextu-alized representations (e.g., Devlin et al., 2019;Liu et al., 2019b) for the sentential version ofthe problem. To encode tabular data into a formamenable for these models, they are flattened intoartificial sentences using heuristics. Surprisingly,even this naïve strategy leads to high predictive ac-curacy (e.g., Chen et al., 2019; Gupta et al., 2020;Eisenschlos et al., 2020; Yin et al., 2020).

In this paper, we ask: Do these seemingly accu-rate models for tabular reasoning with contextu-alized embeddings actually reason properly abouttheir semi-structured inputs? We posit that a modelwhich properly reasons about its inputs should(a) use the evidence presented to it, and the rightparts thereof, (b) be self-consistent in its predic-tions across controlled variants of the input, and,(c) avoid being biased by knowledge encoded inthe pretrained embeddings.

We design systematic probes to examine whetherthese properties hold for a tabular NLI system.Our probes exploit the semi-structured natureof the premises allowing us to not only semi-automatically construct a large number of them,but also define expected model behavior unambigu-ously. Specifically, we construct three kinds ofprobes that either introduce controlled edits to thepremise or the hypothesis, or create counterfac-tual examples. Our experiments reveal that despiteseemingly high test set accuracy, a model basedon RoBERTa (Liu et al., 2019b) is far from beingable to reason correctly about tabular information.Not only does it ignore relevant evidence from itsinputs, but it also relies heavily on annotation arti-facts and pre-trained knowledge.

The dataset, mturk templates and the scriptsare available at https://github.com/utahnlp/table_probing.

arX

iv:2

108.

0057

8v1

[cs

.CL

] 2

Aug

202

1

https://github.com/utahnlp/table_probing


Breakfast in AmericaReleased 29 March 1979Recorded May–December 1978Studio The Village Recorder (Studio B) in

Los AngelesGenre pop ; art rock ; soft rockLength 46:06Label A&MProducer Peter Henderson, Supertramp

H1: Breakfast in America is a pop album with a length of 46minutes.

H2: Breakfast in America was released at the end of 1979.H3: Most of Breakfast in America was recorded in the last

month of 1978.H4: Breakfast in America has 6 tracks.

Figure 1: A tabular premise example. The hypothesesH1 is entailed by it, H2 is a contradiction and H3, H4are neutral i.e. neither entailed nor contradictory.

2 Preliminaries: Tabular NLI

Tabular natural language inference is a reasoningtask similar to standard NLI in that it examines ifa natural language hypothesis can be derived fromthe given premise. Unlike standard NLI, whereevidence is presented in the form of sentences, thepremises in tabular NLI are semi-structured tabu-lar data. Recently, datasets such as TabFact (Chenet al., 2019) and INFOTABS (Gupta et al., 2020), aswell as a SemEval shared task (Ru Wang et al.,2021), have sparked interest in tabular NLI re-search.

Tabular NLI data. In this paper, we use the IN-FOTABS dataset for our investigations. It consistsof 23, 738 premise-hypothesis pairs. The tabularpremises are based on Wikipedia infoboxes, andthe hypotheses are short statements labeled as EN-TAIL, CONTRADICT or NEUTRAL. Figure 1 showsan example of a table and four hypotheses fromthis dataset and will serve as our running exam-ple. The dataset contains 2, 540 distinct infoboxesrepresenting a variety of domains. All hypotheseswere written and labeled by Amazon MechanicalTurk workers. All tables contain a title and twocolumns, as shown in the example. Since each rowtakes the form of a key-value pair, we will refer tothe elements in the left column as the keys, and theright column provides the corresponding values.

The diversity of topics in INFOTABS over theTabFact data motivates our choice of the former.Additionally, TabFact does not include NEUTRAL

statements, and only annotates ENTAIL and CON-TRADICT hypotheses. Furthermore, in addition tothe usual train and development sets, INFOTABS

includes three test sets, α1, α2 and α3. α1 repre-sents a standard test set that is both topically andlexically similar to the training data. In α2, hy-potheses are designed to be lexically adversarial,and α3 tables are drawn from topics unavailable inthe training set. We will use all three test sets forour analysis.

Representation of Tabular Premises. Unlikestandard NLI, which can use off-the-shelf pre-trained contextualized embeddings, the semi-structured nature of premises in tabular NLI ne-cessitates a different modeling approach. Follow-ing Chen et al. (2019), tabular premises are flat-tened into token sequences that fit the input inter-face of such models. Different flattening strategiesexist in the literature. In this work, we will fol-low the Table as Paragraph strategy of Gupta et al.(2020), wherein each row in the table is convertedto a sentence of the form “The key of title is value.”This strategy achieved the highest accuracy in theoriginal work, shown in Table 1. The table alsoshows hypothesis only baseline (Poliak et al., 2018;Gururangan et al., 2018) and human agreement onthe labels.

Model dev α1 α2 α3

Human 79.78 84.04 83.88 79.33Hypothesis Only 60.51 60.48 48.26 48.89RoBERTaL 75.55 74.88 65.55 64.94

Table 1: Performance of the Table as Paragraph strat-egy on various INFOTABS subsets, hypothesis onlybaseline and human agreement results. All results arereproduced from Gupta et al. (2020).

3 Reasoning: An illusion?

Given the high accuracies in Table 1, should weconclude that the RoBERTa-based model reasonsabout tabular inputs? Merely achieving high accu-racy on a labeling task is not sufficient evidence ofreasoning: the model may be arriving at the rightanswer for the wrong reasons. This observationis in line with recent work that points out that thehigh-capacity models we use may be relying onspurious correlations (e.g., Poliak et al., 2018).

“Reasoning” is a multi-faceted phenomenon, andfully characterizing it is beyond the scope of thiswork. However, we can probe for the absence ofspecific kinds of reasoning by studying how themodel responds to carefully constructed inputs andtheir variants. The guiding idea for this work is:

Any system that claims to reason aboutdata should demonstrate expected, pre-dictable behavior in response to con-trolled changes to its inputs.

In practice, the contrapositive of the idea providesa blueprint for showing that our models do not rea-son. That is, for a given dimension of reasoning,if we can show that a model does not present ex-pected behavior in response to controlled changesto its input, then the model cannot claim to success-fully reason along that dimension. We note thatthis strategy has been either explicitly or implicitlyemployed in several lines of recent work (Ribeiroet al., 2020; Gardner et al., 2020).

In this work, we will instantiate the above strat-egy along three specific dimensions, which we willbriefly introduce using the running example in Fig-ure 1. Each dimension defines several concreteprobes that subsequent sections detail.

1. Evidence Selection. A model should use thecorrect evidence in the premise for determining thehypothesis label. For example, ascertaining thathypothesis H1 is entailed requires the rows Genreand Length in the table. When any relevant row isremoved from a table, a model that predicts the EN-TAIL or the CONTRADICT label should predict theNEUTRAL label. However, when an irrelevant rowis removed, a model cannot change its predictionfrom ENTAIL to NEUTRAL or vice versa.

2. Avoiding Annotation Artifacts. The modelshould not rely on spurious lexical artifacts to ar-rive at a label. It should predict the correct labelfor hypotheses that may be lexically different (evenopposing) but essentially rely on the same reason-ing steps (perhaps up to negation). For example, inhypothesis H2, if the token ‘end’ is replaced with‘start’, then the model prediction should changefrom CONTRADICT to ENTAIL.

3. Robustness to Counterfactual Changes.The model should be grounded with the providedinformation even if it contradicts the real world i.e.,counterfactual information. For example, if for theReleased date, the month is changed to ‘Decem-ber’, then the model should change the label of H2to ENTAIL from CONTRADICT. Here, we note thatthis information about release date contradicts thereal world; hence a language model cannot rely onlearned pre-trained knowledge from Wikipedia orthe common web, the primary data used for pre-training. Thus, for the model to predict the label

correctly, it needs to reason with the informationin the table as the primary evidence, rather than itspre-trained knowledge.

A model capable of such robust reasoning isexpected to respond in a predictable manner to con-trolled perturbations of the input that test its rea-soning along these dimensions. There are certainpieces of information in the premise (irrelevant tothe hypothesis) which do not impact the outcome,making the outcome invariant to the change. Forexample, deleting irrelevant rows from the premisetables will not change the model’s predicted label.Contrary to this is the relevant information (‘evi-dence’) in the premise. Changing these pieces ofinformation varies the outcome in a predictablemanner, making the model covariant with thesechanges. For example, deleting or modifying rel-evant evidence rows from the premise tables willchange the model’s predicted label.

The three properties described above are not lim-ited to tabular inference. They can be extendedto other NLP tasks too, such as reading compre-hension, question answering, and the standard sen-tential NLI. However, directly checking for suchproperties would require a lot of additional andrichly labeled data—a big practical impediment.

Fortunately, in the case of tabular inference, the(in-/co-)variants associated with these propertiesallow controlled and semi-automatic edits to the in-puts leading to predictable variation of the expectedoutput.

This insight underlies the design of probes usingwhich we examine the robustness of the reasoningemployed by a model performing tabular inference.As we demonstrate in the following sections, highlyeffective probes can be designed in a mostly unsu-pervised manner, requiring minimal supervision inrare cases.

4 Automated Probing of EvidenceSelection

Let us examine the first of our dimensions of rea-soning, namely evidence selection. To check ifsome information is important for prediction, wecan modify it and observe the change in the modelprediction.

We defined four types of modifications to thetabular premises, none of which need manual label-ing: (a) deleting one row i.e. deleting information,(b) inserting a new row i.e. inserting new informa-tion, (c) updating the value of an existing row i.e.

changing existing information, and (d) permutingthe rows i.e. reordering information . Each mod-ification admits only certain label invariants andcovariants. Let us examine these in detail.

Row Deletion. When we delete a relevant rowfrom a premise table (e.g, the row Length for thehypothesis H1 in our running example), then ir-respective of what the label for the original tableis, the modified table should lead to a NEUTRAL

label. If the deleted row is irrelevant (e.g., the rowProducer for the same hypothesis), then the orig-inal label should be retained. Note that the labelNEUTRAL will remain unaffected by row deletion.

Row Insertion. When we insert a row that doesnot contradict an existing table, then the originalprediction should be retained in all but rare cases.However, in the uncommon situation when the orig-inal model prediction is NEUTRAL, and the addedrow perfectly aligns with the hypothesis, the labelmay change to ENTAIL or CONTRADICT. 1

Singles The Logical Song; Breakfast inAmerica; Goodbye Stranger; Takethe Long Way Home

For example, suppose we add the above row ‘Sin-gles’ to our example table. All hypotheses exceptH4 retain their label after insertion, while H4 be-comes a CONTRADICT with the new information.

Row update When the value associated with arow is updated, then either the label should changeto CONTRADICT or remain the same. To ensurethat the update operation is legal, we need to ensurethat the updated value is appropriate for the key inquestion. To ensure this, we select the value fromthe appropriate row of another random table thathas the same key.2 To avoid self-contradictions af-ter updating, we only update values in multi-valuerows. A case in point would be multi-value keyGenre, if we change pop to hard rock (taken fromanother table), then the hypothesis H1 is contra-dicted. However, if the new value is pop as well,H1 is stays entailed. For either change, the otherhypotheses corresponding to the table retain theiroriginal label. 3

1To ensure that the information added is not contradictoryto existing rows, we only add rows with new keys instead ofadding existing keys with different values.

2The shift from CONTRADICT to ENTAIL prediction isfeasible, but with our present automatic updating method, itis exceedingly unusual. Therefore we consider it a prohibitedtransition.

3We do not examine row updates of single values, sincethis may be thought of as deletion followed by insertion, i.e.,

Row Permutation. Since the premises in IN-FOTABS are infoboxes, the order of the rows shouldnot influence whether a hypothesis is entailed, con-tradicted or neither. In other words, the label shouldbe invariant to row permutation.

C

E N

90.75, 91.44, 91.08

6.02, 4.86, 5.91

95.57, 96.1, 94.76

6.03, 6.64, 4.98

88.21, 86.10, 90.01

2.67, 2.36, 2.955.76, 7.26, 5.01

3.23, 3.7, 3.01

1.76, 1.55, 2.29

*

*

Figure 2: Possible prediction labels change after dele-tion of a row red line represents impossible transitions.Dashed and solid lines represent possible transitions foran irrelevant and relevant row perturbation respectively.* represents transitions possible with any row perturba-tion. The label number for each line represents percent-age(%) transitions for α1, α2 and α3 set in order.

Results and Analysis. We studied the impact ofthe four perturbations described above on the α1,α2 and α3 test sets.

Figure 2 shows the label changes after row dele-tions performed one at a time as a directed labeledgraph. The nodes in this graph (and subsequentones) represent the three labels in the task, and theedges denote the transition from the prediction forthe original table to the modified one. The sourcenode of an edge represents the original prediction,and the end node represents the label after mod-ification. The numbers in the edge represent thepercentage of such transitions from the original la-bel to one after modification. The three numberson each edge in this figure correspond to the α1,α2 and α3 datasets, respectively.

We see that the model makes invalid transitions(colored red) in all three datasets, as shown in Fig-ure 2. Table 2 summarises all the prohibited transi-tions 2 by summing all the prohibited transitions foran original predicted label. Furthermore, the per-centage of inconsistencies is higher for originallyENTAIL predictions than for the other two labelsCONTRADICT and NEUTRAL, on average. This is

a composition operation, see Appendix section $C

Dataset α1 α2 α3 AverageENTAIL 5.76 7.26 5.01 6.01

NEUTRAL 4.43 3.91 5.24 4.53CONTRADICT 3.23 3.70 3.01 3.31

Average 4.47 4.96 4.42 -

Table 2: Percentage (%) of prohibited transitions in therow deletion perturbation.

C

E N

93.23, 93.46, 93.65

94.39, 95.16, 92.66

3.05, 2.79, 3.87

97.19, 95.01, 97.48

1.58, 3.26, 1.28

4.32, 4.32, 4.16

1.23, 1.73, 1.23*

2.56, 2.05, 3.47

2.45, 2.22, 2.19

*

Figure 3: Possible prediction labels change after inser-tion of a new row. The notation convention is the sameas in Figure 2.

because after row deletion, many examples are in-correctly transitioning to CONTRADICT rather thanNEUTRAL label. The opposite trend is observedfor the original CONTRADICT prediction. α2 hasmore prohibited transitions.

Figure 3 shows the label changes after a new rowinsertion as a directed labeled graph, and the resultsare summarized in Table 3. Here, for NEUTRAL alltransitions are valid, so no invalid transitions exist.Compared to ENTAIL, we observe that CONTRA-DICT has more invalid transitions on average. Thisis because substantially more invalid transitionsfrom ENTAIL to CONTRADICT occur than viceversa. However, the invalid transitions of NEU-TRAL are more for CONTRADICT than NEUTRAL.Here, we also find that row insertion affects theadversarial dataset α2 more than the other two.

Figure 4 shows that the hypothesis label doesnot change drastically as we update values in multi-


NEUTRAL 0 0 0 0CONTRADICT 6.77 6.54 6.35 6.55

Average 3.19 3.84 2.95 -

Table 3: Percentage (%) of prohibited transitions in therow insertion perturbation.

C

E N

99.51, 99.69, 99.81

99.63, 99.69, 99.999.7, 99.24, 99.63

0.23, 0.53, 0.25

0.3, 0.21, 0.09

0.08, 0.22, 0.12

0.19, 0.09, 0.10

*

0.26, 0.20, 0.02

0.12, 0.11, 0.07

Figure 4: Effect of prediction labels after updatingvalue of an existing row. The notation convention isthe same as in Figure 2.



Average 0.23 0.21 0.13 -

Table 4: Percentage (%) of prohibited transitions in therow update perturbation.

C

E N

88.37, 91.25, 86.30

92.94, 93.10, 87.5590.75, 87.74, 85.41

6.95, 5.76, 6.91

3.47, 4.82, 6.82

4.65, 2.96, 6.85

*

3.24, 3.78, 5.84

*

*

3.82, 3.12, 6.615.78, 7.44, 7.77

Figure 5: Effect of prediction labels after perturbing therows. red line represents impossible transitions. Fullline represents possible transitions after perturbation.Rest of the notation convention is the same as in Fig-ure 2.

value keys such as Genre. The notation conventionis same as the figure 2. Table 4 summarises the re-sults for the figure 4. One reason for this is that onlyone or two values out of many (on average 5-6 ineach table row) are used for hypothesis generation.Hence, when we update a sub-value it has little tono effect on the row in consideration. Thereforethe change may not be perceivable by the model.

Finally, from figure 5, it is evident that even a



Average 9.34 9.26 13.58 -

Table 5: Percentage (%) of prohibited transitions in therow permutation.

simple shuffling of rows, where no relevant infor-mation has been tampered with, can affect perfor-mance. This indicates that the model is incorrectlyrelying on row positions for making the predic-tion, when in fact, the semantics of the table in-dicates that the rows are to be (largely) treated asa set rather than an ordered list. We summarizethe composite invalid transitions from figure 5 intable 5. Here, we find that the ENTAIL and CON-TRADICT labels are slightly more affected by rowpermutation on average, since α3 has both morerows and more violations with row permutations.Furthermore, invalid transitions from ENTAIL toCONTRADICT and vice versa are higher than othertransitions. For α2, ENTAIL to CONTRADICT isalso higher than the α1 set.

5 Crowdsourcing Evidence

To further investigate whether the model looks atthe correct evidence, we ask human annotators onAmazon Mechanical Turk to mark relevant rowsfor each given table and hypothesis. We use thedevelopment and test sets (4 × 1800 pairs) of theINFOTABS dataset for this purpose.4 Each HITconsists of three table-sentence pairs from the sametable. Annotators are asked to mark the rows whichare relevant to the given hypothesis.5

Since many hypothesis sentences (especiallyones with neutral labels) use out-of-table infor-mation, we add an additional option of selectingout0of-table (OOT) information, which is markedonly when information not present in the table isused in the hypothesis. We do not provide the la-bels to annotators to avoid any bias arising due tocorrelation between the NLI label and row selec-tions. 6

4The templates with detailed instruction and examplesfor the annotation is provided at https://github.com/utahnlp/table_probing.

5Note that here we consider each row as an independentoption to select/not select, hence making this a multi-labelselection problem.

6One example of bias is always selecting/not selectingOOT depending on the neutral/non-neutral labels.

Agreement Range Percentage (%)Poor < 0 00.17Slight 0.01 – 0.20 01.53Fair 0.21 – 0.40 05.88Moderate 0.41 - 0.60 14.60Substantial 0.61 - 0.80 24.40Perfect 0.81 - 1.00 53.43

Table 6: Percentage of annotated examples for eachfliess Kappa bucket.

We repeated the task five times for every hypoth-esis. For every row, we took the most commonlabel (relevant or not) as the ground truth. We alsofollow standard approaches to improve annotationquality: We employed high-quality master annota-tors, released examples in batches (50 batches in-cluding 2 pilot batches, each with 50 HITs, whereeach HIT has 1 table with 3 sentences), blocked badannotators and rewarded bonuses to good ones7.Furthermore, after every batch, we re-annotatedexamples with poor consensus, and removed HITscorresponding to 452 pairs (3.6%) and 22 annota-tors due to poor overall consensus. Our annotationscheme ensures that samples have at least 5 diverseannotations each, as shown in figure 8 in appendix.Inter-annotator Agreement. Table 7 shows inter-annotator agreement via the macro average F1-score w.r.t majority for all the annotations, beforeand after data cleaning.8 Overall, we obtain anagreement macro F1 score of more than 90% acrossthe four sets.

Furthermore, we also find that only 19% and25% of examples have precision and recall < 80%and < 90% respectively, indicating high coverageand preciseness for the selection task, as shown infigure 9 in the appendix. The high average Fleisskappa with an average score of 78 (std: 0.22) alsoindicates high inter-annotator agreement. Detailedanalysis for each kappa bucket is shown in Table6. Additionally, we found that around 82.4% and61.84% examples have at-least 3 or 4 annotatorsselecting the exact same rows as relevant, as shownin figure 10 in appendix.Human Bias with Information. Our annotationallows us to ask: Where the annotators who wrote

7We used the degree of agreement between the annotatorand the majority to identify good and bad annotators. The sup-plement at https://github.com/utahnlp/table_probing has more details.

8We ensure that annotators get paid above minimum wageby timing the hits ourselves. The final cost of the wholeannotation was $1, 750 including the pilots, re-annotation,bonuses and any other expenses.





Cleaning #anno Prec Recall f1 scoreBefore 72 87.39 87.86 87.51After 50 88.86 89.63 89.15

Table 7: Macro average of all the annotators’ agree-ment with the majority selection label for the relevantrows.

the hypotheses in the original data biased towardsspecific kinds of rows? We found that INFOTABSpredominantly has hypotheses which focus moreon a some rows from the premise table than oth-ers; some row keys are completely ignored. Ofcourse, not all keys occur equally in all tables, andtherefore, we only consider keys that significantlyreoccur in the premise tables (i.e. rare occurrences).The details of the results are shown in figure 15,in the appendix. We find that rows with keys suchas born, died, years active, label, genres, releasedate, spouse, and occupation are over-representedin the hypotheses. Rows with keys such as website,birth-name, type, origin, budget, language, editedby, directed by, and cinematography are ignored,and are mostly irrelevant for most hypotheses.

Although INFOTABS discusses the presence ofhypothesis bias because of workers knowingly writ-ing similar hypotheses, it does not discuss the possi-ble bias based on usage of the premise rows (keys)and its possible effects on model prediction. Wesuspect such bias occurs because, during NLI datacreation, annotators excessively use keys with nu-merical and temporal values to create a hypoth-esis, as that makes samples easier and faster tocreate. One possible approach to handle this databias would be to force NLI data creation annotatorsto write hypotheses using randomly selected partsof the premises.

6 Probing by Deleting Relevant Rows

We extended our approach for probing throughtable perturbation by deleting a relevant row in-stead of a random one. If a relevant row is deleted,the ENTAIL and CONTRADICT predictions shouldchange to NEUTRAL. . This is in contrast to theanalysis in section §4 where even after deleting arandom row, the model may correctly retain its orig-inal label if the deleted row was irrelevant. Whenonly relevant rows are deleted, the ENTAIL to EN-TAIL and CONTRADICT to CONTRADICT transi-tions are prohibited. Figure 6 shows the possibletransitions of a label after deletion of the relevantrow.

C

E N

91.62, 93.42, 91.99

21.00

, 23.4

2,16

.64

4.89,

7.67,

3.53

1.07, 1.05, 2.04

24.59, 25.3, 22.68

72.13, 73.43, 74.27

54.41, 51.28, 60.67

7.32, 5.53, 5.97

22.99, 18.90, 22.21

Figure 6: Possible prediction labels change after dele-tion of a relevant rows. red line represents impossibletransitions. Full line represents possible transitions af-ter deleting the relevant row. The label number for eachline represents transitions for α1, α2 and α3 set in or-der.

Datasets α1 α2 α3 AverageENTAIL 75.41 74.70 77.28 75.79


Average 53.60 54.14 54.35 -

Table 8: Percentage (%) of prohibited transitions afterrelevant rows deletion as marked by human annotators.

Results and Analysis We see that the model of-ten retains its label even after row deletion, whenit ideally should change its prediction. Further, wealso observe that the model can incorrectly changeits prediction when a relevant row is deleted. Wesummarize the combined prohibited transitions foreach label in Table 8. Here, the prohibited transi-tions percentage is significantly more when com-pared to random row deletion in Figure 2.9 How-ever, the model retains its original label despiteabsence of a relevant row for ENTAIL and CON-TRADICT predictions.

There are two reasons for these observations:(a) First, deleted rows are related to the hypothesisand directly impact the ENTAIL and CONTRADICT

predictions, which confuses the model. This is con-trasted with the previous cases when an unrelatedhypothesis row was deleted. Second, (b) relevantrows are significantly fewer than irrelevant ones;most hypotheses relate to only one or two rows,whereas α1 and α2 tables have nine rows, and α3

tables have thirteen rows on average. Random row

9The dashed black line in figure 2 is now a red, i.e. prohib-ited transition in the 6.

deletion is more likely to select irrelevant rows.Furthermore, many more prohibited ENTAIL-to-CONTRADICT transitions occur then vice-versa.Similarly, after relevant row deletion, more prohib-ited transitions from NEUTRAL to CONTRADICT

happen than to ENTAIL.

Model Evidence Selection Evaluation Tocheck if the model is using correct evidence, weuse the human-annotated relevant row markingof 4600 hypothesis-table pairs with ENTAIL andCONTRADICT labels as gold labels for evalua-tion.10 We collect all the rows in the example pairswhere deletion affects the model’s (RoBERTaL)predictions. Since removing these rows changesthe prediction of the model, we call these themodel relevant rows and compare them withhuman-annotated relevant rows as ground truth.

Overall, across all the 4600 examples, the modelon average has around 41% precision and 40.9%recall on the human-annotated relevant rows. Fur-ther analysis reveals that the model (a) uses allrelevant rows in 27% cases, (b) uses the incorrectrows as evidence in 52% of occurrences, and (c) isonly partially accurate with relevant rows selectionin the remaining 21% of examples . Furthermore,in the 52% of pairs where the model completelymisses relevant rows, in 46% of cases the modelpays no attention to rows at all (i.e., row deletionhas no effect on the model predictions), which is asignificantly high number compared to 1.95% forthe human-annotated case.

Furthermore, the model has a 70% overall accu-racy on these cases, with 37% coming from wrongevidence rows, 12% coming from partially accu-rate evidence rows, and just 21% from looking atall the correct evidence rows. This study showsthat even a model with 70% accuracy cannot betrusted for correct evidence selection.

7 Probing for Annotation Artifacts

Models tend to leverage spurious patterns in hy-potheses for their predictions while bypassing theevidence from premises (Poliak et al., 2018; Gu-rurangan et al., 2018; Geva et al., 2019; Guptaet al., 2020, and others). To measure the relianceof our models on annotation artifacts, we alter thehypotheses such that these patterns are uniformlydistributed across labels. This creates ambiguous

10Here, we did not include the 2400 NEUTRAL examplesand the ambiguous 200 ENTAIL or CONTRADICT exampleswith annotation majority indicating zero relevant rows.

artifacts, and a model which is highly reliant onthese artifacts will degrade.

To introduce such ambiguity, we perturb thehypotheses in such a manner so that their labelschange. However, we also perform hypothesischanges while preserving the labels to affirm thepresence and the role of annotation artifacts. Wealter the content words in a hypothesis while pre-serving its overall structure. Particularly, we alterthe expressions in a hypothesis if they are one ofthe following types:

• Named Entities: substitute entities such asPerson, Location, Organisation, etc.;

• Numerical Expressions: modify numeric ex-pressions such as weights, percentages, areas,etc.;

• Temporal Expressions: Date and Time;• Quantification: introduction, deletion or

modification of quantifiers such as most,many, every, etc.;

• Negation: introduction or deletion of negationmarkers; and

• Nominal modifiers: addition of nominalphrases or clauses.

The above list is a subset of the reasoning typesthat the annotators had used for hypothesis genera-tion per Gupta et al. (2020). Although we can easilytrack these expressions in a hypothesis using toolslike entity recognizers and parsers, it is non-trivialto modify them in such a way so that the effectof change on the hypothesis label is known. Insome cases, label changes can be deterministicallyknown even with somewhat random changes in thehypothesis. For example, we can convert a hypoth-esis from ENTAIL to CONTRADICT by replacing anamed entity in the hypothesis with a random entityof the same type. On the other extreme, some labelchanges can only be controlled if the target expres-sion in the hypothesis is correctly aligned with thefacts in the premise. Such cases include CONTRA-DICT to ENTAIL, NEUTRAL to CONTRADICT, andNEUTRAL to ENTAIL, which are difficult withoutextensive expression-level annotations. Therefore,we chose to annotate these label flipping perturba-tions manually.11 We avoid perturbations involvingthe NEUTRAL label altogether, as they often needtable changes as well.

To automatically perturb hypotheses, we lever-age both the syntactic structure of the hypothesis

11Annotation was done by an expert linguist, who was wellversed in the NLI task.

Adposition UpwardMonotonicity

DownwardMonotonicity

over CONTRADICT ENTAIL

under ENTAIL CONTRADICT

more than CONTRADICT ENTAIL

less than ENTAIL CONTRADICT

before ENTAIL CONTRADICT

after CONTRADICT ENTAIL

Table 9: Monotonicity properties of some of the adpo-sitions.

and the monotonicity properties of function wordslike adpositions. First, we perform syntactic anal-ysis of the hypothesis to identify named entitiesand their relations to the title expressions via de-pendency paths. Based on the entity type, then weeither substitute or modify them. Named entitiessuch as person names and locations are mostly sub-stituted with entities of the same type. On the otherhand, expressions containing numbers are modifiedusing the monotonicity property of the adpositions(or other function words) governing them in theircorresponding syntactic trees.

Given the monotonicity property of an adposi-tion (see Table 9), we can modify its governingnumerical expression in a hypothesis in the reverseorder to flip and in the same order to preserve thehypothesis label. Consider the hypothesis in Ta-ble 11 which contains an adposition (over) withupward monotonicity. Because of upward mono-tonicity, we can increase the number of hours men-tioned in the hypothesis without altering the label,and we can flip the label by decreasing the num-ber. Although for label flipping, we need to havea good step size. During our experiments, we ob-served that a step size of half the actual numberoften works.

We generated 1,688 and 794 perturbed examplesfrom the α1 set in label flipping and preservingcases, respectively. We also generated 7,275 and4,275 examples from the Train set. Some exampleperturbations using different types of expressionsare listed in Table 10.

Results and Analysis The results in Table 12affirm the impact of hypothesis bias or spurious an-notation artifacts on model prediction. As we spec-ulated above, the introduction of ambiguity in an-notation artifacts has degraded the model. Whetherthe model is trained using both hypotheses andpremises or just using hypotheses, the model per-formance goes down. When labels are flipped by

Type Perturbed HypothesisNominalModifier

H1EE : Breakfast in America which

was produced by Pert Handerson is apop album with a length of 46 min-utes.

TemporalExpression

H1EC : Breakfast in America is a pop

album with a length of 56 minutes.Negation H2C

E : Breakfast in America was notreleased towards the end of 1979.

TemporalExpression

H2CC : Breakfast in America was re-

leased towards the end of 1989.

Table 10: Example of different hypothesis perturba-tions for the hypothesis and premise shown in Figure 1.Colored text represent the altered expressions. Super-script E/C represent gold ENTAIL and CONTRADICTlabels, while subscript E/C represent new labels.

BridesmaidsRunning time 125 minutes

H5: Bridesmaids has a running time of over 3 hrs.

Table 11: H5 is contradicting the running time shownin the table.

perturbations, we see an average decrease in perfor-mance of about 18.08% and 39.06% points, on theα1 set and Train set respectively, for both models.On the other hand, the perturbations that preservethe hypothesis label have shown mixed results. Inthe case of α1 set, the performance of the mod-els has remained almost similar - 0.62% improve-ment. In comparision, the performance for the train-ing set decreases substantially, by 6.56%. We ex-pected our models to have similar performance onthese perturbations as they are structure-preservingand annotation artifacts-preserving transformations.The results of the hypothesis-only model on theTrain set may appear slightly surprising at first.However, given that the model was trained on thisdataset, it seems reasonable to assume that themodel has over-fitted. Therefore, the model is ex-pected to be vulnerable to slight modifications tothe examples it was trained on, leading to the hugedrop of 26.16(%). Whereas, for the α1, set thedrop in performance is much less - 2.25%. Overall,these results demonstrate that our models ignorethe information content in hypotheses, thereby thealigned facts in the premises, and instead rely heav-ily on the false patterns in the hypotheses.

8 Probing by Counterfactual Examples

Since INFOTABS is a factual dataset based onWikipedia, large pre-trained language models suchas RoBERTa, which are trained on Wikipedia and

Model Original Label Pre-served

LabelFlipped

Train Set (w/o NEUTRAL)Prem+Hypo 99.44(0.06) 92.98(0.20) 53.92(0.28)

Hypo-Only 96.39(0.13) 70.23(0.35) 19.23(0.27)

α1 Set (w/o NEUTRAL)Prem+Hypo 68.94(0.76) 69.56(0.77) 51.48(0.86)

Hypo-Only 63.52(0.75) 60.27(0.85) 31.02(0.63)

Table 12: Results of Hypothesis-only model and fullmodel (trained on hypothesis and premise together) ongold and perturbed examples from Train and α1 sets.Here, α1 and Train have 967 and 4, 274 examplespairs respectively. The Label Flipped counterfactualsare manually annotated by experts, whereas the La-bel Preserved are automated.12. Main numbers are themean and subscript(·) is corresponding std. calculatedwith 80% random splits 100 times. The human accu-racy on α1 set is between 88 % − 89 % via 3 expertannotators.

other similar datasets, may have already encoun-tered the facts in INFOTABS during training. Asa result, reasoning, for our NLI models built ontop of RoBERTaL, boils down to determining ifa hypothesis is supported by the knowledge ofthe pre-trained language model. More specifi-cally, the model may be relying on "confirma-tion bias", in which it selects evidence/patternsfrom both premise and hypothesis that matches itsprior knowledge. To test whether our models aregrounded on the evidence provided in the form oftables, we create counterfactual examples. The in-formation in the tabular premises is modified suchthat the content does not reflect the real world.

While world knowledge is necessary for tableNLI (Neeraja et al., 2021), models should still treatpremises as the primary evidence. This will aid bet-ter generalization to unseen data, especially withtemporally situated information (e.g., leaders ofstates). We only considered premise changes re-lated to ENTAIL and CONTRADICT hypotheses.Most NEUTRAL cases in INFOTABS (Gupta et al.,2020) use out-of-table information for which creat-ing counterfactuals requires the addition of novelrows, making the table completely different.

The task of creating counterfactual tablespresents two challenges. First, the modified tableshould not be self-contradictory. Second, we needto determine the labels of the associated hypothesesafter the table is modified, and for that, we need tomodify relevant rows.

We employ a simple solution for these chal-lenges: we locate relevant rows for a particular

Breakfast in AmericaReleased 29 December 1979Recorded May–December 1978Studio The Village Recorder (Studio B) in

Los AngelesGenre pop ; art rock ; soft rockLength 40:06Label A&MProducer Peter Henderson, Supertramp

Figure 7: A modified tabular premise example (counter-factual value for ‘Length’, and ‘Released’ from exam-ple premise in Figure 1. The hypotheses H1 changes toCONTRADICT from ENTAIL, H2 changes to ENTAILform CONTRADICT and H3, H4 stay as NEUTRAL.

premise-hypothesis pair and move them to anothertable (with a different title) in the same category,along with the hypothesis. Finally, in the hypoth-esis, we replace the title with that of the relocatedtable. Because we shift all relevant rows simul-taneously, both concerns of relevance and self-contradiction are overcome.13 In a few situationswhere the relevant rows were not accessible, wechanged the table title in the current hypothesis forthat table to generate a counterfactual pair.

However, one issue with these approaches isthat they will only create counterfactual exampleswhose label is the same as the original example (i.e.,Label Preserved). Therefore, in addition to the au-tomatic data with the label preserved, we manuallycreated counterfactual table-hypothesis pairs suchthat the label is flipped. This is done to ensure thatthe model does not rely on the hypothesis or thetable as evidence to validate its knowledge. An ex-ample of a counterfactual table is shown in Figure7. The values of Length and Released are changedsuch that the label of H1 is flipped from ENTAIL toCONTRADICT and the label of H2 is flipped fromCONTRADICT to ENTAIL. This label flipped data isimportant to ensure counterfactual data is devoid ofhypothesis artifacts or other spurious correlations.

To generate the Label Flipped counterfactualdata, three annotators manually modified tablesfrom the Train and α1 dataset, for hypotheseswith ENTAIL and CONTRADICT labels. The anno-tators cross verified the labels against each otherto estimate human accuracy, which was 88.45%for Train and 86.57% for the α1 set. In sum, forthis setting, we ended up with 885 counterfactual

13Although there may be a few self-contradictory instances,we expect that the inconsistency does not exist in the rowslinked to the hypothesis using our method.

examples for the Train set and 942 ones for the α1

set . The hypothesis-only accuracy is 60.89% and27.68% for original and counterfactual data afterthe perturbation for α1. Similarly, for the trainingset, the hypothesis only accuracy was 99.94% and0.06% for original and counterfactual data after theperturbation.

The results are shown in the Table 13. For boththe datasets we also analyze pairs of examples hav-ing the same hypothesis with predictions with theoriginal table and counterfactual table as premise,as well as just a hypothesis only baseline with orig-inal label. Table 14 shows the results with the orig-inal and counterfactual pairs. Here 7 and 3 repre-sent incorrect and correct predictions respectively.For example, Orig: 3, Counter: 7 represents hy-potheses which are predicted correctly w.r.t. theoriginal table, and incorrectly with the counterfac-tual table.

Model Original Label Pre-served

LabelFlipped

Train Set (w/o NEUTRAL)Prem+Hypo 94.38(0.39) 78.53(0.65) 48.70(0.72)

Hypo-Only 99.94(0.06) 82.23(0.65) 00.06(0.01)

α1 Set (w/o NEUTRAL)Prem+Hypo 71.99(0.69) 69.65(0.78) 44.01(0.72)

Hypo-Only 60.89(0.76) 58.19(0.91) 27.68(0.65)

Table 13: Performance in terms of accuracy for theRoBERTaL model and human (model/human) for orig-inal data and its counterfactual. Here, 885 examples forTrain set and 942 examples for α1 set. Here, the LabelPreserved are automatic perturbations by experts. Eachentry denotes themean(stdev) calculated with 80% ran-dom splits 100 times.

Results and Analysis From Table 13, we seethat the model does not handle counterfactual infor-mation, and performs close to the random predic-tion (48.70% for Train and 44.01% for α1) in thelabel flipped setting. The performance was better

Original Counterfactual Train (%) α1 (%)7 7 1.92 10.537 3 3.57 17.913 7 49.36 44.913 3 45.15 26.64

Table 14: Train and α1 examples (%) when pairedwith same hypothesis and their prediction w.r.t originalpremise table and counterfactual premise table on thefull premise-hypothesis model (w/o NEUTRAL) in thelabel flipped setting. Here 3 represents correct pre-diction and 7 represents incorrect prediction. Here,numbers are with the full set to avoid variation.

Original Counterfact Hypo Train (%) α1 (%)7 3 7 0.00 11.437 3 3 3.57 6.483 7 3 49.36 33.123 7 7 0.00 11.79

Table 15: Train and α1 examples (%) when pairedwith same hypothesis and their prediction w.r.t originalpremise table, counterfactual premise table and hypoth-esis only model (Hypo) with original label (w/o NEU-TRAL) in label flipped setting. Here 3 represents cor-rect prediction and 7 represents incorrect prediction.Here, numbers are with the full set to avoid variation.

in the label preserved setting; we conjecture thatthe model exploits hypothesis artifacts, althoughpremise artifacts/pre-trained knowledge does nothelp. Surprisingly, this drop was high for trainset 15.85%, compared to just 2.70% for α1. Thisshows that not only does the model memorize hy-pothesis artifacts, but it also memorizes severalpremise artifacts/pre-trained knowledge to make itsentailment judgments.

Furthermore, as shown in Table 14, for the la-bel flipped data, the examples where both originaland counterfactual predictions are correct is limitedto 26.64% in α1. Furthermore, the percentage ofexamples in α1 where the original model’s predic-tion is incorrect but the counterfactual predictionis correct is 17.91% on the α1 set and 3.57% onthe Train set. In contrast, for the opposite case itis substantially higher- 44.91% on the α1 set and49.36% on the Train set.

Furthermore, in Table 15, we analyze caseswhere both the original prediction is correct andthe counterfactual prediction is incorrect throughhypothesis bias i.e., the hypothesis baseline is cor-rect, we get 33.12% (out of 44.91%) on α1 set and49.36% (out of 49.36%) on the Train set, indi-cating that hypothesis bias is the primary reasonfor the predictions. In the opposite setting wherethe original prediction is incorrect and the counter-factual is correct, we get 6.48% (out of 17.91%)on the α1 set and 3.57% (out of 3.57%) on theTrain, which also indicates that model is relyingon pre-trained knowledge in α1 more i.e. 11.43%for correct prediction on counterfactual cases.

Overall, we show that the model relies on hy-pothesis artifacts and also its pre-trained knowl-edge for making predictions. Thus, one needs toevaluate models on counterfactual information toensure robustness.

9 Discussion and Related Work

What did we learn? Firstly, through systematicprobing, we have shown that a neural model, de-spite good performance on evaluation sets, failsto perform correct reasoning in a large percentageof the cases. The model does not look at correctevidence required for reasoning, as is evident fromevidence selection probing. Rather, it leveragesspurious connections to make predictions. A recentstudy by Lewis et al. (2021) on question-answeringshows that models indeed leverage spurious pat-terns for answering a whopping (60-70%) percent-age of questions.

Secondly, from our analysis of hypothesis pertur-bations, we show that the model relies heavily oncorrelations between sentence structure and label,or between certain rows and label, or artifacts in thedata (for example: the fact that several sentenceshave similar syntactic structure and root words havea particular label). Therefore, the existing modelshould be evaluated on adversarial sets like α2 forrobustness and sensitivity. This observation hasalso been drawn by multiple studies that probe deeplearning models on adversarial examples in a vari-ety of tasks such as question answering, sentimentanalysis, document classification, natural languageinference, etc. (Ribeiro et al., 2020; Richardsonet al., 2020; Goel et al., 2021; Lewis et al., 2021;Tarunesh et al., 2021).

Lastly, from our counterfactual probes, we foundthat the model relies more on pre-trained knowl-edge of MLMs than on tabular evidence as theprimary source of knowledge for prediction. Wefound that the model uses premise memory ac-quired during training and knowledge of pre-trained MLMs on the counterfactual evaluationdata, instead of grounding its predictions solelyon a given premise. This is in addition to the spu-rious patterns or hypothesis artifacts leveraged bythe model. Similar observations are made by Clarkand Etzioni (2016); Jia and Liang (2017); Kaushiket al. (2019); Huang et al. (2020); Gardner et al.(2020); Tu et al. (2020); Liu et al. (2021); Zhanget al. (2021); Wang et al. (2021b) for unstructureddata.

Benefit of Tabular data Unlike unstructureddata (Ribeiro et al., 2020; Goel et al., 2021; Mishraet al., 2021; Yin et al., 2021), we have demonstratedthat we can analyze semi-structured tabular dataeffectively. Although connected with the title, the

rows in the table are still mutually exclusive, lin-guistically and otherwise. One can treat each rowas an independent and exclusive source of infor-mation. Thus, controlled experiments are easier todesign and study.

For example, the analysis done for the evidenceselection via multiple table perturbation opera-tions such as row deletion and update is possi-ble mainly due to the tabular nature of the data.Compared to raw text, this independent granular-ity is mostly absent at the token, sentence andeven paragraph level. Even if this granularity islocated, designing suitable probes around them isa challenging task. Thus, more effort needs tobe put in using semi-structured data for modelevaluation. Such approaches can be applied onother datasets such as WikiTableQA (Pasupat andLiang, 2015), TabFact (Chen et al., 2019), Hy-bridQA (Chen et al., 2020b; Zayats et al., 2021;Oguz et al., 2020), OpenTableQA (Chen et al.,2021), ToTTo (Parikh et al., 2020), Turing Ta-bles (Yoran et al., 2021) i.e. table to text gener-ation tasks, LogicTable and (Chen et al., 2020a)and recently proposed tabular reasoning modelsproposed in TAPAS (Müller et al., 2021; Herziget al., 2020), TaBERT (Yin et al., 2020), TABBIE(Iida et al., 2021), TabGCN (Pramanick and Bhat-tacharya, 2021) and RCI (Glass et al., 2021).

Interpretability for NLI model Our analysisshows that for classification tasks such as natu-ral language inference, correct predictions do notmean correct model reasoning. More effort isneeded to either make the model more interpretableor at least generate explanations for the predictions(Feng et al., 2018; Serrano and Smith, 2019; Jainand Wallace, 2019; Wiegreffe and Pinter, 2019;DeYoung et al., 2020; Paranjape et al., 2020). In-vestigation is needed around how the model arrivesat a particular prediction: is it looking at the correctevidence? Furthermore, NLI models should be sub-jected to multiple test sets with adversarial settings(e.g. Ribeiro et al., 2016, 2018a,b; Zhao et al., 2018;Iyyer et al., 2018; Glockner et al., 2018; Naik et al.,2018; McCoy et al., 2019; Nie et al., 2019; Liuet al., 2019a) focusing on particular properties or as-pects of reasoning, such as perturbed premises forevidence selection, zero-shot transfer (α3), counter-factual premises or alternate facts, contrasting hy-potheses via perturbation α2. Model performanceshould be checked on all these settings to iden-tify problems and weaknesses. Such behavioral

probing with evaluation on multiple test bench-marks (e.g. Ribeiro et al., 2020; Tarunesh et al.,2021) and controlled probes (e.g. Hewitt and Liang,2019; Richardson and Sabharwal, 2020; Ravichan-der et al., 2021) is essential to estimate the actualreasoning ability of large pre-trained language mod-els. Recently many shared tasks based on semi-structured table reasoning (such as SemEval 2021Task 9 (Wang et al., 2021a) and FEVEROUS (Alyet al., 2021)) highlight the importance and chal-lenges associated with the evidence extraction forclaim verification for both unstructured and semi-structured data.

10 Conclusion

In this paper, we show how targeted probing whichfocuses on invariant properties can highlight thelimitations of tabular reasoning models, througha case study on a tabular inference task on IN-FOTABS. We showed that an existing pre-trainedlanguage models such as RoBERTaL , despite goodperformance on standard splits, fails at selectingcorrect evidence, predicts incorrectly on adversar-ial hypotheses with similar reasoning, and doesn’tground itself in provided counterfactual evidencefor the task of tabular reasoning. Our work showsthe importance of such probes in evaluating the rea-soning ability of pre-trained language models forthe task of tabular reasoning. While our results arefor a particular task and a single model, we believethat our analysis and results transfer to other data-sets and models. Therefore, in the future, we planto expand our probing for other tasks such as tabu-lar question answering and even tabular sentencegeneration. Furthermore, we plan to also use the in-sight from the probing tasks in designing rationaleselection techniques based on structural constraintsfor tabular inference and other tasks. We also hopethat our work encourages the research communityto evaluate models more often on contrast and coun-terfactual adversarial test sets.

ReferencesRami Aly, Zhijiang Guo, Michael Schlichtkrull,

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, Oana Cocarascu, and ArpitMittal. 2021. Feverous: Fact extraction and verifi-cation over unstructured and structured information.arXiv preprint arXiv:2106.05707.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A Large Anno-

tated Corpus for Learning Natural Language Infer-ence. In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing.

Wenhu Chen, Ming-Wei Chang, Eva Schlinger,William Wang, and William W Cohen. 2021. Openquestion answering over tables and text. Proceed-ings of ICLR 2021.

Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, andWilliam Yang Wang. 2020a. Logical natural lan-guage generation from open-domain tables. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 7929–7942.

Wenhu Chen, Hongmin Wang, Jianshu Chen, YunkaiZhang, Hong Wang, Shiyang Li, Xiyou Zhou, andWilliam Yang Wang. 2019. Tabfact: A large-scaledataset for table-based fact verification. In Interna-tional Conference on Learning Representations.

Wenhu Chen, Hanwen Zha, Zhiyu Chen, WenhanXiong, Hong Wang, and William Wang. 2020b. Hy-bridqa: A dataset of multi-hop question answeringover tabular and textual data. Findings of EMNLP2020.

Peter Clark and Oren Etzioni. 2016. My Computer Isan Honor Student — but How Intelligent Is It? Stan-dardized Tests as a Measure of AI. AI Magazine,37(1):5–12.

Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-simo Zanzotto. 2013. Recognizing textual entail-ment: Models and applications. Synthesis Lectureson Human Language Technologies, 6(4):1–220.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani,Eric Lehman, Caiming Xiong, Richard Socher, andByron C. Wallace. 2020. ERASER: A benchmark toevaluate rationalized NLP models. In Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics, pages 4443–4458, On-line. Association for Computational Linguistics.

Julian Eisenschlos, Syrine Krichene, and ThomasMueller. 2020. Understanding tables with interme-diate pre-training. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing: Findings, pages 281–296.

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer,Pedro Rodriguez, and Jordan Boyd-Graber. 2018.Pathologies of neural models make interpretationsdifficult. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 3719–3728.

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10/gjtbvw



https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/2020.acl-main.408


Matt Gardner, Yoav Artzi, Victoria Basmov, JonathanBerant, Ben Bogin, Sihao Chen, Pradeep Dasigi,Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,et al. 2020. Evaluating models’ local decisionboundaries via contrast sets. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing: Findings, pages 1307–1323.

Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.Are we modeling the task or the annotator? an inves-tigation of annotator bias in natural language under-standing datasets. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166.

Michael Glass, Mustafa Canim, Alfio Gliozzo, Sa-neem Chemmengath, Vishwajeet Kumar, RishavChakravarti, Avi Sil, Feifei Pan, Samarth Bharad-waj, and Nicolas Rodolfo Fauceglia. 2021. Captur-ing row and column semantics in transformer basedquestion answering over tables. In Proceedings ofthe 2021 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 1212–1224,Online. Association for Computational Linguistics.

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking NLI systems with sentences that re-quire simple lexical inferences. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 650–655, Melbourne, Australia. Associationfor Computational Linguistics.

Karan Goel, Nazneen Fatema Rajani, Jesse Vig,Zachary Taschdjian, Mohit Bansal, and ChristopherRé. 2021. Robustness gym: Unifying the nlp eval-uation landscape. In Proceedings of the 2021 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies: Demonstrations, pages 42–55.

Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, and VivekSrikumar. 2020. INFOTABS: Inference on tables assemi-structured data. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 2309–2324, Online. Associationfor Computational Linguistics.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, and Noah ASmith. 2018. Annotation artifacts in natural lan-guage inference data. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 107–112.

Jonathan Herzig, Pawel Krzysztof Nowak, ThomasMüller, Francesco Piccinno, and Julian Eisenschlos.2020. TaPas: Weakly supervised table parsing viapre-training. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-guistics, pages 4320–4333, Online. Association forComputational Linguistics.

John Hewitt and Percy Liang. 2019. Designing andinterpreting probes with control tasks. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 2733–2743.

William Huang, Haokun Liu, and Samuel Bowman.2020. Counterfactually-augmented snli trainingdata does not yield better generalization than unaug-mented data. In Proceedings of the First Workshopon Insights from Negative Results in NLP, pages 82–87.

Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mo-hit Iyyer. 2021. TABBIE: Pretrained representationsof tabular data. In Proceedings of the 2021 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 3446–3456, Online. As-sociation for Computational Linguistics.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1875–1885.

Sarthak Jain and Byron C Wallace. 2019. Attention isnot explanation. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 3543–3556.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2021–2031, Copenhagen, Denmark. Association forComputational Linguistics.

Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.2019. Learning the difference that makes a differ-ence with counterfactually-augmented data. In Inter-national Conference on Learning Representations.

Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel.2021. Question and answer test-train overlap inopen-domain question answering datasets. In Pro-ceedings of the 16th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Main Volume, pages 1000–1008.

Leo Z Liu, Yizhong Wang, Jungo Kasai, Hannaneh Ha-jishirzi, and Noah A Smith. 2021. Probing acrosstime: What does roberta know and when? arXivpreprint arXiv:2104.07885.

https://doi.org/10.18653/v1/2021.naacl-main.96



https://doi.org/10.18653/v1/P18-2103

https://doi.org/10.18653/v1/P18-2103

https://www.aclweb.org/anthology/2020.acl-main.210

https://www.aclweb.org/anthology/2020.acl-main.210





https://doi.org/10.18653/v1/D17-1215

https://doi.org/10.18653/v1/D17-1215

Nelson F Liu, Roy Schwartz, and Noah A Smith. 2019a.Inoculation by fine-tuning: A method for analyzingchallenge datasets. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 2171–2179.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A Robustly Optimized BERT PretrainingApproach. arXiv preprint arXiv:1907.11692.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3428–3448.

Anshuman Mishra, Dhruvesh Patel, Aparna Vijayaku-mar, Xiang Lorraine Li, Pavan Kapanipathi, and Kar-tik Talamadupula. 2021. Looking beyond sentence-level natural language inference for question answer-ing and text summarization. In Proceedings of the2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1322–1336, On-line. Association for Computational Linguistics.

Thomas Müller, Julian Martin Eisenschlos, and SyrineKrichene. 2021. Tapas at semeval-2021 task 9: Rea-soning over tables with intermediate pre-training.arXiv preprint arXiv:2104.01099.

Aakanksha Naik, Abhilasha Ravichander, NormanSadeh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language inference.In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 2340–2353.

J. Neeraja, Vivek Gupta, and Vivek Srikumar. 2021. In-corporating external knowledge to enhance tabularreasoning. In Proceedings of the 2021 Annual Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics, Online. Asso-ciation for Computational Linguistics.

Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019.Analyzing compositionality-sensitivity of nli mod-els. In Proceedings of the AAAI Conference on Arti-ficial Intelligence, volume 33, pages 6867–6874.

Barlas Oguz, Xilun Chen, Vladimir Karpukhin,Stan Peshterliev, Dmytro Okhonko, MichaelSchlichtkrull, Sonal Gupta, Yashar Mehdad, andScott Yih. 2020. Unified open-domain ques-tion answering with structured and unstructuredknowledge. arXiv preprint arXiv:2012.14610.

Bhargavi Paranjape, Mandar Joshi, John Thickstun,Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020.An information bottleneck approach for controllingconciseness in rationale extraction. In Proceed-ings of the 2020 Conference on Empirical Methods

in Natural Language Processing (EMNLP), pages1938–1952.

Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann,Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, andDipanjan Das. 2020. ToTTo: A controlled table-to-text generation dataset. In Proceedings of EMNLP.

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InProceedings of the 53rd Annual Meeting of the Asso-ciation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers), pages 1470–1480.

Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language infer-ence. NAACL HLT 2018, page 180.

Aniket Pramanick and Indrajit Bhattacharya. 2021.Joint learning of representations for web-tables, en-tities and types using graph convolutional network.In Proceedings of the 16th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Main Volume, pages 1197–1206.

Abhilasha Ravichander, Yonatan Belinkov, and EduardHovy. 2021. Probing the probing paradigm: Doesprobing accuracy entail task relevance? In Proceed-ings of the 16th Conference of the European Chap-ter of the Association for Computational Linguistics:Main Volume, pages 3363–3377.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. " why should i trust you?" explain-ing the predictions of any classifier. In Proceed-ings of the 22nd ACM SIGKDD international con-ference on knowledge discovery and data mining,pages 1135–1144.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018a. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAIconference on artificial intelligence, volume 32.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018b. Semantically equivalent adversar-ial rules for debugging nlp models. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 856–865.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,and Sameer Singh. 2020. Beyond accuracy: Behav-ioral testing of nlp models with checklist. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4902–4912.

Kyle Richardson, Hai Hu, Lawrence Moss, and AshishSabharwal. 2020. Probing natural language infer-ence models through semantic fragments. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, volume 34, pages 8713–8721.

https://openreview.net/forum?id=SyxS0T4tvS

https://openreview.net/forum?id=SyxS0T4tvS




Kyle Richardson and Ashish Sabharwal. 2020. Whatdoes my qa model know? devising controlled probesusing expert knowledge. Transactions of the Associ-ation for Computational Linguistics, 8:572–588.

Nancy Xin Ru Wang, Diwakar Mahajan, MarinaDanilevsky, and Sara Rosenthal. 2021. Se-meval2021 task 9: Fact verification and evidencefinding for tabular data in scientific documents(semtab-facts). Proceedings of the 15th interna-tional workshop on semantic evaluation (SemEval-2021).

Sofia Serrano and Noah A Smith. 2019. Is attentioninterpretable? In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 2931–2951.

Ishan Tarunesh, Somak Aditya, and Monojit Choud-hury. 2021. Trusting roberta over bert: Insightsfrom checklisting the natural language inferencetask. arXiv preprint arXiv:2107.07229.

Lifu Tu, Garima Lalwani, Spandana Gella, and He He.2020. An empirical study on robustness to spuri-ous correlations using pre-trained language models.Transactions of the Association for ComputationalLinguistics, 8:621–633.

Nancy XR Wang, Diwakar Mahajan, Marina DanilevskRosenthal, et al. 2021a. Semeval-2021 task 9: Factverification and evidence finding for tabular data inscientific documents (sem-tab-facts). arXiv preprintarXiv:2105.13995.

Siyuan Wang, Wanjun Zhong, Duyu Tang, ZhongyuWei, Zhihao Fan, Daxin Jiang, Ming Zhou, andNan Duan. 2021b. Logic-driven context extensionand data augmentation for logical reasoning of text.arXiv preprint arXiv:2105.03659.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 11–20.

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Se-bastian Riedel. 2020. TaBERT: Pretraining for jointunderstanding of textual and tabular data. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8413–8426, Online. Association for Computational Lin-guistics.

Wenpeng Yin, Dragomir Radev, and Caiming Xiong.2021. Docnli: A large-scale dataset for document-level natural language inference. arXiv preprintarXiv:2106.09449.

Ori Yoran, Alon Talmor, and Jonathan Berant.2021. Turning tables: Generating examplesfrom semi-structured tables for endowing languagemodels with reasoning skills. arXiv preprintarXiv:2107.07261.

Vicky Zayats, Kristina Toutanova, and Mari Osten-dorf. 2021. Representations for question answeringfrom documents with tables and text. arXiv preprintarXiv:2101.10573.

Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-WeiChang, and Cho-Jui Hsieh. 2021. Double Perturba-tion: On the Robustness of Robustness and Counter-factual Bias Evaluation. In Proceedings of NAACL-HLT.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018.Generating natural adversarial examples. In Interna-tional Conference on Learning Representations.

A Details: Inter-Annotator Agreement

Table 16 shows details for inter-annotator agree-ment for the relevant row annotation task. Weobtain an average f1 score of 89.0% (macro) and88.6%(micro) for all our experiments.

Type Prec Recall f1 scoremacro-avg 88.8 91.0 89.0micro-avg 88.2 89.0 88.6

Table 16: Final inter annotator agreement average ofall example pairs for the majority selection label w.r.tall the annotations for the pair.

No. of Annotators

Exa

mpl

es (%

)

0

25

50

75

100

4 5 6 7 8 9 10

Figure 8: Percentage of example pairs vs number ofdiverse annotations for the relevant rows.

Figure 9 shows fine-grained agreement i.e., thepercentage of examples with varying precision andrecall. Figure 8 shows the percentage of exampleswith number of annotations (distinct annotators).Figure 10 shows the percentage of examples pairs(y-axis) vs number of annotations with the exactsame labeling for the relevant rows (x-axis).

B Deletion of Irrelevant Row

Below, we shows the operation of deletion of an ir-relevant row in Figure 11 and its summary in Table17. NEUTRAL has the least number of improbable



transitions and α3 set is the most affected by irrel-evant row deletion. This demonstrates that evenirrelevant rows have an impact on model predic-tion (which should not be the case) indicating thatthe model is focusing on incorrect evidence forreasoning.

Figure 9: Fine-grained agreement analysis showing thepercentage of examples with given precision and recall.

No. of Annotators

Per

cent

age

of E

xam

ples

(%)

0

25

50

75

100

> =8 > =7 > =6 > =5 > =4 > =3

Figure 10: Percentage of example pairs (y-axis) vsnumber of annotations with the exact same labeling forrelevant rows (x-axis).



Average 4.99 5.2 6.01 -

Table 17: Percentage (%) of prohibited transitions afterthe irrelevant rows deletion as not marked by humans.

C Composition Operation

Figure 13 demonstrates what happens when a com-position operation is used, such as multiple deletes,

C

E N

96.12, 96.46, 94.99

2.76,

4.05,

3.46

2.93,

2.94,

2.94

1.85, 1.62, 2.32

2.38, 2.92, 2.63

94.06, 94.92 , 93.09

94.86, 93.03, 93.91

2.05, 1.92, 2.69

3.01, 2.15, 3.96

Figure 11: Possible prediction labels change after dele-tion of a irrelevant row. red lines represent impossibletransition. Full lines represent possible transition afterdeleting of relevant row. The label number for each linerepresents transitions for α1, α2 and α3 set in order.

multiple inserts, multiple substitutes (value update),or any other combination of delete, insert, substi-tute (value update). The potential transitions, inter-estingly, have symmetric properties, making themorder and repetition invariant as long as the finaltable change is consistent. Therefore, operationsmay be combined in any form and order, and theultimate transition can be derived from just the twooperations function. Furthermore, all proceduresinvolving the same operator were transformed intoa single transition.

Below, we shows the composition operation ofdeleting one row then following by inserting an-other row in Figure 12 and its summary in Table18. Here, we observe that improbable transitionsfrom ENTAIL to CONTRADICT are higher for α2

set, and then followed by α3. However, for theopposite case i.e. CONTRADICT to ENTAIL this ishigher for α1 set and then followed by α2. Also,on average, we observe that the invalid transitionsfor CONTRADICT are twice as much as ENTAIL.



Average 4.28 4.80 3.63 -

Table 18: Percentage (%) of prohibited transitions afterthe rows deletion and rows insertion i.e. compositionoperation.

C

E N

86.67, 87.46, 86.35

3.02,

6.53,

4.16

9.81,

7.88,

6.72

7.47, 7.15, 6.11

1.76, 2.07, 3.28

86.03, 89.19, 89.13

95.42, 91.40, 92.56

5.66, 5.39. 7.53

4.16, 2.93, 4.15

Figure 12: Possible prediction labels change after dele-tion followed by an insert operation. Notations as infigure 11.

D Multi Row Reasoning

We also analyse the proportion of annotated exam-ples using more than one row for a given hypothe-sis, i.e. multiple row reasoning. As shown in figure14 around 54% and 37% examples have only oneor two rows relevant to the hypothesis, respectively.Furthermore, we find that annotators mark OOTmostly for the neutral labels i.e. 71% comparedto 5% for combined ENTAIL and CONTRADICT

labels. 14. We also find a very negligible < 0.25%of examples have zero relevant rows; we suspectthis might be because of annotation ambiguity orthe hypothesis being a universal factual statement.Overall our annotation analysis verifies the claimmade by INFOTABS (Gupta et al., 2020) in theirreasoning analysis.

Figure 13: Composition Operations. For more than twooperations it follow the symmetric property i.e. threeor above DDDS = DDS = SDD = DSD = DS = SD i.e.order and repetition invariant.

14The marking of OOT for ENTAIL and CONTRADICT ismostly attributed to unexpected use of common sense or ex-ternal knowledge.

No. of Rows

No.

of E

xam

ples

(E,N

,C)

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6 7 8

Total OOT

No. of Rows

No.

of E

xam

ples

(Ent

ail)

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6 7 8

Total OOT

No. of Rows

No.

of E

xam

ples

(Neu

tral)

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6 7 8

Total OOT

No. of Rows

No.

of E

xam

ples

(Con

tradi

ctio

n)

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6 7 8

Total OOT

Figure 14: Figure depicting the percentage of exampleswhere multiple rows are relevant for a given hypothesisfor ENTAIL, CONTRADICT, NEUTRAL, and in total. Italso shows the percentage of examples where Out ofTable(OOT) information is used.

E Human Bias in Annotation

Figure 15: Figure depicting the human bias problemwith the annotations for the ENTAIL, CONTRADICTand NEUTRAL label. Here, green and red circlingsrepresent overused and underused keys for hypothesisgeneration, respectively. To account for varying appear-ance frequency of keys in the tables, we have only con-sidered keys with frequency ≥ 180 for fair comparison.

Figure 15 shows the human bias problem withthe INFOTABS annotations. For ENTAIL and CON-TRADICT labels rows keys such as born, died,yearsactive, label, genres, release date,spouse, occupa-tion are overused. Similarly for the NEUTRAL labelrows keys such as born, associated act, and spouseare overused. On the other-side, for ENTAIL, NEU-TRAL and CONTRADICT labels rows keys such aswebsite, birth-name, type, origin, budget, language,edited by, directed by, running time, nationality,cinematography are under used despite frequentlyappearing in the tables. Some keys such as yearsactive, label, genres, release date, spouse, and cin-ematography are underused only for the NEUTRAL

label. In all the analyses we only consider row keyswhich are frequent i.e., appear ≥ 180 times in thetraining set.

arXiv:2108.00578v1 [cs.CL] 2 Aug 2021

Documents

Transcript of arXiv:2108.00578v1 [cs.CL] 2 Aug 2021