Fact Extraction and VERication · Oana Cocarascu (Imperial College London) Christos...

EMNLP 2018

Fact Extraction and VERification

Proceedings of the First Workshop

November 1, 2018Brussels, Belgium

We thank our sponsor Amazon Research in Cambridge for their generous support.

c©2018 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-948087-73-5

ii

Introduction

With billions of individual pages on the web providing information on almost every conceivable topic,we should have the ability to collect facts that answer almost every conceivable question. However,only a small fraction of this information is contained in structured sources (Wikidata, Freebase, etc.) –we are therefore limited by our ability to transform free-form text to structured knowledge. There is,however, another problem that has become the focus of a lot of recent research and media coverage:false information coming from unreliable sources.

In an effort to jointly address both problems, herein we proposed this workshop to promote research injoint Fact Extraction and VERification (FEVER). We aim for FEVER to be a long-term venue for workin verifiable knowledge extraction.

To stimulate progress in this direction, we also hosted the FEVER Challenge, an information verificationshared task on a purposely-constructed dataset. We received entries from 23 competing teams, 19 ofwhich scored higher than the previously published baseline. We invited descriptions of the participatingsystems and we received 15 system descriptions, all of which are included in these proceedings. Weoffered the top 4 systems oral presentations.

For the main workshop, we received 23 submissions, out of which we accepted 14 (3 oral presentationsand 11 posters).

iii

Organizers:

James Thorne (University of Sheffield)Andreas Vlachos (University of Sheffield)Oana Cocarascu (Imperial College London)Christos Christodoulopoulos (Amazon)Arpit Mittal (Amazon)

Program Committee:

Nikolaos Aletras (University of Sheffield), Fernando Alva-Manchego (University of Sheffield),Isabelle Augenstein (University of Copenhagen), Esma Balkir (University of Edinburgh), DanieleBonadiman (University of Trento), Matko Bošnjak (University College London), Kris Cao (Uni-versity of Cambridge), Tuhin Chakrabarty (Columbia University), Weiwei Cheng (Amazon), Bich-Ngoc Do (Heidelberg University), Micha Elsner (The Ohio State University), Diego Esteves (Uni-versität Bonn), Fréderic Godin (ELIS - IDLab, Ghent University), Ivan Habernal (UKP Lab, Tech-nische Universität Darmstadt), Andreas Hanselowski (UKP lab, Technische Universität Darm-stadt), Christopher Hidey (Columbia University), Julia Hockenmaier (University of Illinois Urbana-Champaign), Alexandre Klementiev (Amazon Development Center Germany), Jan Kowollik (Uni-versity of Duisburg-Essen), Anjishnu Kumar (Amazon), Nayeon Lee (Hong Kong Universityof Science and Technology), Pranava Swaroop Madhyastha (University of Sheffield), Christo-pher Malon (NEC Laboratories America), Marie-Catherine de Marneffe (The Ohio State Univer-sity), Stephen Mayhew (University of Pennsylvania), Marie-Francine Moens (KU Leuven), JasonNaradowsky (University College London), Yixin Nie (UNC), Farhad Nooralahzadeh (Universityof Oslo), Wolfgang Otto (GESIS – Leibniz-Institute for the Social Sciences in Cologne), AnkurPadia (University of Maryland, Baltimore County), Mithun Paul (University Of Arizona), TamaraPolajnar (University of Cambridge), Hoifung Poon (Microsoft Research), Preethi Raghavan (IBMResearch TJ Watson), Marek Rei (University of Cambridge), Laura Rimell (DeepMind), TimRocktäschel (University College London and Facebook AI Research), Jodi Schneider (UIUC),Claudia Schulz (UKP Lab, Technische Universität Darmstadt), Diarmuid Ó Séaghdha (Apple),Sameer Singh (University of California, Irvine), Kevin Small (Amazon), Christian Stab (UKPLab, Technische Universität Darmstadt), Motoki Taniguchi (Fuji Xerox), Paolo Torroni (AlmaMater - Università di Bologna), Serena Villata (Université Côte d’Azur, CNRS, Inria, I3S), ZeerakWaseem (University of Sheffield), Noah Weber (Stony Brook University), Johannes Welbl (Uni-versity College London), Menglin Xia (University of Cambridge), Takuma Yoneda (Toyota Tech-nological Institute)

Invited Speakers:

Luna Dong (Amazon)Marie-Francine Moens (KU Leuven)Delip Rao (Joostware AI Research, Johns Hopkins University)Tim Rocktäschel (Facebook AI Research, University College London)

v

Table of Contents

The Fact Extraction and VERification (FEVER) Shared TaskJames Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos and Arpit Mittal 1

The Data Challenge in Misinformation Detection: Source Reputation vs. Content VeracityFatemeh Torabi Asr and Maite Taboada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Crowdsourcing Semantic Label Propagation in Relation ClassificationAnca Dumitrache, Lora Aroyo and Chris Welty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Retrieve and Re-rank: A Simple and Effective IR Approach to Simple Question Answering over Knowl-edge Graphs

Vishal Gupta, Manoj Chinnakotla and Manish Shrivastava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Information Nutrition Labels: A Plugin for Online News EvaluationVincentius Kevin, Birte Högden, Claudia Schwenger, Ali Sahan, Neelu Madan, Piush Aggarwal,

Anusha Bangaru, Farid Muradov and Ahmet Aker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Joint Modeling for Query Expansion and Information Extraction with Reinforcement LearningMotoki Taniguchi, Yasuhide Miura and Tomoko Ohkuma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Towards Automatic Fake News Detection: Cross-Level Stance Detection in News ArticlesCostanza Conforti, Mohammad Taher Pilehvar and Nigel Collier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the WebDiego Esteves, Aniketh Janardhan Reddy, Piyush Chawla and Jens Lehmann . . . . . . . . . . . . . . . . . 50

Automated Fact-Checking of Claims in Argumentative Parliamentary DebatesNona Naderi and Graeme Hirst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Stance Detection in Fake News A Combined Feature RepresentationBilal Ghanem, Paolo Rosso and Francisco Rangel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Zero-shot Relation Classification as Textual EntailmentAbiola Obamuyide and Andreas Vlachos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Teaching Syntax by Adversarial DistractionJuho Kim, Christopher Malon and Asim Kadav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Where is Your Evidence: Improving Fact-checking by Justification ModelingTariq Alhindi, Savvas Petridis and Smaranda Muresan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Affordance Extraction and Inference based on Semantic Role LabelingDaniel Loureiro and Alípio Jorge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF)Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus Stenetorp and Sebastian Riedel . . . . . . . . 97

Multi-Sentence Textual Entailment for Claim VerificationAndreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Claudia Schulz and

Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Team Papelo: Transformer Networks at FEVERChristopher Malon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

vii

Uni-DUE Student Team: Tackling fact checking through decomposable attention neural networkJan Kowollik and Ahmet Aker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

SIRIUS-LTG: An Entity Linking Approach to Fact Extraction and VerificationFarhad Nooralahzadeh and Lilja Øvrelid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Integrating Entity Linking and Evidence Ranking for Fact Extraction and VerificationMotoki Taniguchi, Tomoki Taniguchi, Takumi Takahashi, Yasuhide Miura and Tomoko Ohkuma

124

Robust Document Retrieval and Individual Evidence Modeling for Fact Extraction and Verification.Tuhin Chakrabarty, Tariq Alhindi and Smaranda Muresan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Comparison and DecomposableAttention

Aniketh Janardhan Reddy, Gil Rocha and Diego Esteves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

An End-to-End Multi-task Learning Model for Fact Checkingsizhen li, Shuai Zhao, Bo Cheng and Hao Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Team GESIS Cologne: An all in all sentence-based approach for FEVERWolfgang Otto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Team SWEEPer: Joint Sentence Extraction and Fact Checking with Pointer NetworksChristopher Hidey and Mona Diab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

QED: A fact verification system for the FEVER shared taskJackson Luken, Nanjiang Jiang and Marie-Catherine de Marneffe . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Team UMBC-FEVER : Claim verification using Semantic Lexical ResourcesAnkur Padia, Francis Ferraro and Tim Finin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

A mostly unlexicalized model for recognizing textual entailmentMithun Paul, Rebecca Sharp and Mihai Surdeanu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

viii

Conference Program

Thursday, November 1, 2018

09:00–09:15 Welcome TalkOrganizers

09:15–10:00 Invited Talk: Learning with ExplanationsTim Rocktäschel

10:00–10:30 Research Talks 1

10:00–10:15 The Data Challenge in Misinformation Detection: Source Reputation vs. ContentVeracityFatemeh Torabi Asr and Maite Taboada

10:15–10:30 Towards Automated Factchecking: Developing an Annotation Schema and Bench-mark for Consistent Automated Claim DetectionLev Konstantinovskiy, Oliver Price, Mevan Babakar and Arkaitz Zubiaga

10:30–11:30 Research Posters

Crowdsourcing Semantic Label Propagation in Relation ClassificationAnca Dumitrache, Lora Aroyo and Chris Welty

Retrieve and Re-rank: A Simple and Effective IR Approach to Simple Question An-swering over Knowledge GraphsVishal Gupta, Manoj Chinnakotla and Manish Shrivastava

Information Nutrition Labels: A Plugin for Online News EvaluationVincentius Kevin, Birte Högden, Claudia Schwenger, Ali Sahan, Neelu Madan,Piush Aggarwal, Anusha Bangaru, Farid Muradov and Ahmet Aker

Joint Modeling for Query Expansion and Information Extraction with Reinforce-ment LearningMotoki Taniguchi, Yasuhide Miura and Tomoko Ohkuma

Towards Automatic Fake News Detection: Cross-Level Stance Detection in NewsArticlesCostanza Conforti, Mohammad Taher Pilehvar and Nigel Collier

Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on theWebDiego Esteves, Aniketh Janardhan Reddy, Piyush Chawla and Jens Lehmann

ix

Thursday, November 1, 2018 (continued)

Automated Fact-Checking of Claims in Argumentative Parliamentary DebatesNona Naderi and Graeme Hirst

Stance Detection in Fake News A Combined Feature RepresentationBilal Ghanem, Paolo Rosso and Francisco Rangel

Zero-shot Relation Classification as Textual EntailmentAbiola Obamuyide and Andreas Vlachos

Teaching Syntax by Adversarial DistractionJuho Kim, Christopher Malon and Asim Kadav

Where is Your Evidence: Improving Fact-checking by Justification ModelingTariq Alhindi, Savvas Petridis and Smaranda Muresan

11:30–12:15 Invited Talk: Argumentation Mining and Generation Supporting the Verification ofContentMarie-Francine Moens

12:15–12:30 Research Talks 2

12:15–12:30 Affordance Extraction and Inference based on Semantic Role LabelingDaniel Loureiro and Alípio Jorge

14:00–14:45 Invited Talk: Building a broad knowledge graph for productsLuna Dong

x


14:45–15:30 Shared Task Flash Talks

14:45–14:50 The Fact Extraction and VERification (FEVER) Shared TaskOrganizers

14:50–15:00 Combining Fact Extraction and Claim Verification in an NLI ModelYixin Nie, Haonan Chen and Mohit Bansal

15:00–15:10 UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF)Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus Stenetorp and SebastianRiedel

15:10–15:20 Multi-Sentence Textual Entailment for Claim VerificationAndreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Clau-dia Schulz and Iryna Gurevych

15:20–15:30 Team Papelo: Transformer Networks at FEVERChristopher Malon

15:30–16:30 Shared Task Posters

Uni-DUE Student Team: Tackling fact checking through decomposable attentionneural networkJan Kowollik and Ahmet Aker

SIRIUS-LTG: An Entity Linking Approach to Fact Extraction and VerificationFarhad Nooralahzadeh and Lilja Øvrelid

Integrating Entity Linking and Evidence Ranking for Fact Extraction and Verifica-tionMotoki Taniguchi, Tomoki Taniguchi, Takumi Takahashi, Yasuhide Miura andTomoko Ohkuma

Robust Document Retrieval and Individual Evidence Modeling for Fact Extractionand Verification.Tuhin Chakrabarty, Tariq Alhindi and Smaranda Muresan

DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Compari-son and Decomposable AttentionAniketh Janardhan Reddy, Gil Rocha and Diego Esteves

xi


An End-to-End Multi-task Learning Model for Fact Checkingsizhen li, Shuai Zhao, Bo Cheng and Hao Yang

Team GESIS Cologne: An all in all sentence-based approach for FEVERWolfgang Otto

Team SWEEPer: Joint Sentence Extraction and Fact Checking with Pointer Net-worksChristopher Hidey and Mona Diab

QED: A fact verification system for the FEVER shared taskJackson Luken, Nanjiang Jiang and Marie-Catherine de Marneffe

Team UMBC-FEVER : Claim verification using Semantic Lexical ResourcesAnkur Padia, Francis Ferraro and Tim Finin

A mostly unlexicalized model for recognizing textual entailmentMithun Paul, Rebecca Sharp and Mihai Surdeanu

16:30–17:15 Invited Talk: Call for Help: Putting Computation in Computational Fact CheckingDelip Rao

17:15–17:30 Prizes + Closing RemarksOrganizers

xii

Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 1–9Brussels, Belgium, November 1, 2018. c©2018 Association for Computational Linguistics

The Fact Extraction and VERification (FEVER) Shared TaskJames Thorne1, Andreas Vlachos1, Oana Cocarascu2,

Christos Christodoulopoulos3, and Arpit Mittal3

1Department of Computer Science, University of Sheffield2Department of Computing, Imperial College London

3Amazon Research Cambridgej.thorne, [email protected]

[email protected], [email protected]

Abstract

We present the results of the first Fact Extrac-tion and VERification (FEVER) Shared Task.The task challenged participants to classifywhether human-written factoid claims couldbe SUPPORTED or REFUTED using evidenceretrieved from Wikipedia. We received entriesfrom 23 competing teams, 19 of which scoredhigher than the previously published base-line. The best performing system achieved aFEVER score of 64.21%. In this paper, wepresent the results of the shared task and asummary of the systems, highlighting com-monalities and innovations among participat-ing systems.

1 Introduction

Information extraction is a well studied domainand the outputs of such systems enable many nat-ural language technologies such as question an-swering and text summarization. However, sinceinformation sources can contain errors, there ex-ists an additional need to verify whether the infor-mation is correct. For this purpose, we hosted thefirst Fact Extraction and VERification (FEVER)shared task to raise interest in and awareness ofthe task of automatic information verification -a research domain that is orthogonal to informa-tion extraction. This shared task required partic-ipants to develop systems to predict the veracityof human-generated textual claims against textualevidence to be retrieved from Wikipedia.

We constructed a purpose-built dataset forthis task (Thorne et al., 2018) that contains185,445 human-generated claims, manually veri-fied against the introductory sections of Wikipediapages and labeled as SUPPORTED, REFUTED orNOTENOUGHINFO. The claims were generatedby paraphrasing facts from Wikipedia and mutat-ing them in a variety of ways, some of which weremeaning-altering. For each claim, and without

the knowledge of where the claim was generatedfrom, annotators selected evidence in the form ofsentences from Wikipedia to justify the labeling ofthe claim.

The systems participating in the FEVER sharedtask were required to label claims with the cor-rect class and also return the sentence(s) formingthe necessary evidence for the assigned label. Per-forming well at this task requires both identifyingrelevant evidence and reasoning correctly with re-spect to the claim. A key difference between thistask and other textual entailment and natural lan-guage inference tasks (Dagan et al., 2009; Bow-man et al., 2015) is the need to identify the evi-dence from a large textual corpus. Furthermore,in comparison to large-scale question answeringtasks (Chen et al., 2017), systems must reasonabout information that is not present in the claim.We hope that research in these fields will be stim-ulated by the challenges present in FEVER.

One of the limitations of using human annota-tors to identify correct evidence when constructingthe dataset was the trade-off between annotationvelocity and evidence recall (Thorne et al., 2018).Evidence selected by annotators was often incom-plete. As part of the FEVER shared task, any ev-idence retrieved by participating systems that wasnot contained in the original dataset was annotatedand used to augment the evidence in the test set.

In this paper, we present a short description ofthe task and dataset, present a summary of the sub-missions and the leader board, and highlight futureresearch directions.

2 Task Description

Candidate systems for the FEVER shared taskwere given a sentence of unknown veracity calleda claim. The systems must identify suitable ev-idence from Wikipedia at the sentence level and

1

Claim: The Rodney King riots took place inthe most populous county in the USA.

[wiki/Los Angeles Riots]The 1992 Los Angeles riots,also known as the Rodney King riotswere a series of riots, lootings, ar-sons, and civil disturbances thatoccurred in Los Angeles County, Cali-fornia in April and May 1992.

[wiki/Los Angeles County]Los Angeles County, officiallythe County of Los Angeles,is the most populous county in the USA.

Verdict: Supported

Figure 1: Example claim from the FEVER shared task:a manually verified claim that requires evidence frommultiple Wikipedia pages.

assign a label whether, given the evidence, theclaim is SUPPORTED, REFUTED or whether thereis NOTENOUGHINFO in Wikipedia to reach a con-clusion. In 16.82% of cases, claims required thecombination of more than one sentence as support-ing or refuting evidence. An example is providedin Figure 1.

2.1 Data

Training and development data was releasedthrough the FEVER website.1 We used the re-served portion of the data presented in Thorneet al. (2018) as a blind test set. Disjoint train-ing, development and test splits of the dataset weregenerated by splitting the dataset by the page usedto generate the claim. The development and testdatasets were balanced by randomly discardingclaims from the more populous classes.

Split SUPPORTED REFUTED NEITraining 80,035 29,775 35,639

Dev 6,666 6,666 6,666Test 6,666 6,666 6,666

Table 1: Dataset split sizes for SUPPORTED, REFUTEDand NOTENOUGHINFO (NEI) classes

1http://fever.ai

2.2 Scoring Metric

We used the scoring metric described in Thorneet al. (2018) to evaluate the submissions. TheFEVER shared task requires submission of evi-dence to justify the labeling of a claim. The train-ing, development and test data splits contain mul-tiple sets of evidence for each claim, each set be-ing a minimal set of sentences that fully support orrefute it. The primary scoring metric for the taskis the label accuracy conditioned on providing atleast one complete set of evidence, referred to asthe FEVER score. Sentences labeled (correctly) asNOTENOUGHINFO do not require evidence. Cor-rectly labeled claims with no or only partial ev-idence received no points for the FEVER score.Where multiple sets of evidence was annotated inthe data, only one set was required for the claim tobe considered correct for the FEVER score.

Since the development and evaluation datasplits are balanced, random baseline label ac-curacy ignoring the requirement for evidence is33.33%. This performance level can also beachieved for the FEVER score by predictingNOTENOUGHINFO for every claim. However,as the FEVER score requires evidence for SUP-PORTED and REFUTED claims, a random baselineis expected to score lower on this metric.

We provide an open-source release of the scor-ing software.2 Beyond the FEVER score, it com-putes precision, recall, F1, and label accuracy toprovide diagnostic information. The recall pointis awarded, as is the case for the FEVER score,only by providing a complete set of evidence forthe claim.

2.3 Submissions

The FEVER shared task was hosted as a competi-tion on Codalab3 which allowed submissions to bescored against the blind test set without the needto publish the correct labels. The scoring systemwas open from 24th to 27th July 2018. Partici-pants were limited to 10 submissions (max. 2 perday).4

2The scorer, test cases and examples can be found inthe following GitHub repository https://github.com/sheffieldnlp/fever-scorer

3https://competitions.codalab.org/competitions/18814

4An extra half-day was given as an artifact of the compe-tition closing at midnight pacific time.

2

Rank Team Name Evidence (%) LabelAccuracy (%)

FEVERScore (%)Precision Recall F1

1 UNC-NLP 42.27 70.91 52.96 68.21 64.212 UCL Machine Reading Group 22.16 82.84 34.97 67.62 62.523 Athene UKP TU Darmstadt 23.61 85.19 36.97 65.46 61.584 Papelo 92.18 50.02 64.85 61.08 57.365 SWEEPer 18.48 75.39 29.69 59.72 49.946 Columbia NLP 23.02 75.89 35.33 57.45 49.067 Ohio State University 77.23 47.12 58.53 50.12 43.428 GESIS Cologne 12.09 51.69 19.60 54.15 40.779 FujiXerox 11.37 29.99 16.49 47.13 38.8110 withdrawn 46.60 51.94 49.12 51.25 38.5911 Uni-DuE Student Team 50.65 36.02 42.10 50.02 38.5012 Directed Acyclic Graph 51.91 36.36 42.77 51.36 38.3313 withdrawn 12.90 54.58 20.87 53.97 37.1314 Py.ro 21.15 49.38 29.62 43.48 36.5815 SIRIUS-LTG-UIO 19.19 70.82 30.19 48.87 36.5516 withdrawn 0.00 0.01 0.00 33.45 30.2017 BUPT-NLPer 45.18 35.45 39.73 45.37 29.2218 withdrawn 23.75 86.07 37.22 33.33 28.6719 withdrawn 7.69 32.11 12.41 50.80 28.4020 FEVER Baseline 11.28 47.87 18.26 48.84 27.4521 withdrawn 49.01 29.66 36.95 44.89 23.7622 withdrawn 26.81 12.08 16.65 57.32 22.8923 withdrawn 26.33 12.20 16.68 55.42 21.7124 University of Arizona 11.28 47.87 18.26 36.94 19.00

Table 2: Results on the test dataset.

3 Participants and Results

86 submissions (excluding the baseline) weremade to Codalab for scoring on the blind test set.There were 23 different teams which participatedin the task (presented in Table 2). 19 of theseteams scored higher than the baseline presented inThorne et al. (2018). All participating teams wereinvited to submit a description of their systems.We received 15 descriptions at the time of writingand the remaining are considered as withdrawn.The system with the highest score was submittedby UNC-NLP (FEVER score: 64.21%).

Most participants followed a similar pipelinestructure to the baseline model. This consistedof three stages: document selection, sentence se-lection and natural language inference. How-ever, some teams constructed models to jointlyselect sentences and perform inference in a sin-gle pipeline step, while others added an additionalstep, discarding inconsistent evidence after per-forming inference.

Based on the team-submitted system descrip-tion summaries (Appendix A), in the followingsection we present an overview of which modelsand techniques were applied to the task and theirrelative performance.

4 Analysis

4.1 Document Selection

A large number of teams report a multi-step ap-proach to document selection. The majority ofsubmissions report extracting some combinationof Named Entities, Noun Phrases and CapitalizedExpressions from the claim. These were used ei-ther as inputs to a search API (i.e. WikipediaSearch or Google Search), search server (e.g.Lucene5 or Solr6) or as keywords for matchingagainst Wikipedia page titles or article bodies.BUPT-NLPer report using S-MART for entity link-ing (Yang and Chang, 2015) and the highest scor-

5http://lucene.apache.org/6http://lucene.apache.org/solr/

3

ing team, UNC-NLP, report using page viewershipstatistics to rank the candidate pages. This ap-proach cleverly exploits a bias in the dataset con-struction, as the most visited pages were sampledfor claim generation. GESIS Cologne report di-rectly selecting sentences using the Solr search,bypassing the need to perform document retrievalas a separate step.

The team which scored highest on evidence pre-cision and evidence F1 was Papelo (precision =92.18% and F1 = 64.85%) who report using acombination of TF-IDF for document retrieval andstring matching using named entities and capital-ized expressions.

The teams which scored highest on evidencerecall were Athene UKP TU Darmstadt (recall =85.19%) and UCL Machine Reading Group (re-call = 82.84%) 7 8 Athene report extracting noun-phrases from the claim and using these to querythe Wikipedia search API. A similar approach wasused by Columbia NLP who query the Wikipediasearch API using named entities extracted fromthe claim as a query string, all the text beforethe first lowercase verb phrase as a query stringand also combine this result with Wikipedia pagesidentified with Google search using the entireclaim. UCL Machine Reading Group report a doc-ument retrieval approach that identifies Wikipediaarticle titles within the claim and ranks the resultsusing features such as capitalization, sentence po-sition and token match.

4.2 Sentence Selection

There were three common approaches to sentenceselection: keyword matching, supervised classifi-cation and sentence similarity scoring. Ohio Stateand UCL Machine Reading Group report usingkeyword matching techniques: matching eithernamed entities or tokens appearing in both theclaim and article body. UNC-NLP, Athene UKPTU Darmstadt and Columbia NLP modeled thetask as supervised binary classification, using ar-chitectures such as Enhanced LSTM (Chen et al.,2016), Decomposable Attention (Parikh et al.,2016) or similar to them. SWEEPer and BUPT-NLPer present jointly learned models for sentence

7The withdrawn team that ranked 18th on F1 score hadthe highest recall: 86.07%. A system description was notsubmitted by this team preventing us from including it in ouranalysis.

8The scores for precision, recall and F1 were computedindependent of the label accuracy and FEVER Score.

selection and natural language inference. Otherteams report scoring based on sentence similar-ity using Word Mover’s Distance (Kusner et al.,2015) or cosine similarity over smooth inverse fre-quency weightings (Arora et al., 2017), ELMo em-beddings (Peters et al., 2018) and TF-IDF (Saltonet al., 1983). UCL Machine Reading Group andDirected Acyclic Graph report an additional ag-gregation stage after the classification stage in thepipeline where evidence that is inconsistent is dis-carded.

4.3 Natural Language Inference

NLI was modeled as supervised classification inall reported submissions. We compare and dis-cuss the approaches for combining the evidencesentences together with the claim, sentence rep-resentations and training schemes. While manydifferent approaches were used for sentence pairclassification, e.g. Enhanced LSTM (Chen et al.,2016), Decomposable Attention (Parikh et al.,2016), Transformer Model (Radford and Sali-mans, 2018), Random Forests (Svetnik et al.,2003) and ensembles thereof, these are not specificto the task and it is difficult to assess their impactdue to the differences in the processing precedingthis stage.

Evidence Combination: UNC-NLP (the high-est scoring team) concatenate the evidence sen-tences into a single string for classification; UCLMachine Reading Group classify each evidence-claim pair individually and aggregate the re-sults using a simple multilayer perceptron (MLP);Columbia NLP perform majority voting; and fi-nally, Athene-UKP TU Darmstadt encode eachevidence-claim pair individually using an En-hanced LSTM, pool the resulting vectors and usean MLP for classification.

Sentence Representation: University of Ari-zona explore using non-lexical features for pre-dicting entailment, considering the proportion ofnegated verbs, presence of antonyms and nounoverlap. Columbia NLP learn universal sentencerepresentations (Conneau et al., 2017). UNC-NLPinclude an additional token-level feature the sen-tence similarity score from the sentence selectionmodule. Both Ohio State and UNC-NLP report al-ternative token encodings: UNC-NLP report usingELMo (Peters et al., 2018) and WordNet (Miller,1995) and Ohio State report using vector represen-

4

tations of named entities. FujiXerox report repre-senting sentences using DEISTE (Yin et al., 2018).

Training: BUPT-NLPer and SWEEPer modelthe evidence selection and claim verification us-ing a multi-task learning model under the hypoth-esis that information from each task supplementsthe other. SWEEPer also report parameter tuningusing reinforcement learning.

5 Additional Annotation

As mentioned in the introduction, to increase theevidence coverage in the test set, the evidence sub-mitted by participating systems was annotated byshared task volunteers after the competition ended.There were 18,846 claims where at least one sys-tem returned an incorrect label, according to theFEVER score, i.e. taking evidence into account.These claims were sampled for annotation with aprobability proportional to the number of systemswhich labeled each of them incorrectly.

The evidence sentences returned by each sys-tem for each claim was sampled further with aprobability proportional to the system’s FEVERscore in an attempt to focus annotation efforts to-wards higher quality candidate evidence. Theseextra annotations were performed by volunteersfrom the teams participating in the shared task andthree of the organizers. Annotators were askedto label whether the retrieved evidence sentencessupported or refuted the claim at question, and tohighlight which sentences (if any), either individu-ally or as a group, can be used as evidence. We re-tained the annotation guidelines from Thorne et al.(2018) (see Sections A.7.1, A.7.3 and A.8 fromthat paper for more details).

At the time of writing, 1,003 annotations werecollected for 618 claims. This identified 3 claimsthat were incorrectly labeled as SUPPORTED orREFUTED and 87 claims that were originally la-beled as NOTENOUGHINFO that should be re-labeled as SUPPORTED or REFUTED through theintroduction of new evidence (44 and 43 claimsrespectively). 308 new evidence sets were identi-fied for claims originally labeled as SUPPORTED

or REFUTED, consisting of 280 single sentencesand 28 sets of 2 or more sentences.

Further annotation is in progress and the datacollected as well as the final results will be madepublic at the workshop.

6 Conclusions

The first Fact Extraction and VERification sharedtask attracted submissions from 86 submissionsfrom 23 teams. 19 of these teams exceeded thescore of the baseline presented in Thorne et al.(2018). For the teams which provided a systemdescription, we highlighted the approaches, iden-tifying commonalities and features that could befurther explored.

Future work will address limitations in human-annotated evidence and explore other subtasksneeded to predict the veracity of information ex-tracted from untrusted sources.

Acknowledgements

The work reported was partly conducted whileJames Thorne was at Amazon Research Cam-bridge. Andreas Vlachos is supported by the EUH2020 SUMMA project (grant agreement number688139).

ReferencesSanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.

A simple but tough to beat baseline for sentence em-beddings. In Iclr, pages 1–14.

Sean Baird, Doug Sibley, and Yuxi Pan. 2017. Cisco’sTalos Intelligence Group Blog Talos Targets Disin-formation with Fake News Challenge Victory.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. Proc. of ACL’17.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2016. Enhanced LSTM forNatural Language Inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics, pages 1657–1668, Vancouver,Canada.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoicBarrault, and Antoine Bordes. 2017. SupervisedLearning of Universal Sentence Representationsfrom Natural Language Inference Data. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 670–680,Copenhagen, Denmark.

Ido Dagan, Bill Dolan, Bernardo Magnini, and DanRoth. 2009. Recognizing textual entailment: Ratio-nal, evaluation and approaches. Natural LanguageEngineering, 15(4):i–xvii.

5

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.SCITAIL: A Textual Entailment Dataset from Sci-ence Question Answering. Aaai.

Matt J Kusner, Yu Sun, Nicholas I Kolkin, and Kilian QWeinberger. 2015. From Word Embeddings To Doc-ument Distances. Proceedings of The 32nd Inter-national Conference on Machine Learning, 37:957–966.

George a. Miller. 1995. WordNet: a lexicaldatabase for English. Communications of the ACM,38(11):39–41.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A Decomposable Atten-tion Model for Natural Language Inference. pages2249–2255.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of NAACL-HLT, pages2227–2237, New Orleans, Louisiana.

Alec Radford and Tim Salimans. 2018. Improv-ing Language Understanding by Generative Pre-Training. arXiv, pages 1–12.

Gerard Salton, Edward A. Fox, and Harry Wu. 1983.Extended boolean information retrieval. Commun.ACM, 26(11):1022–1036.

Vladimir Svetnik, Andy Liaw, Christopher Tong,J Christopher Culberson, Robert P Sheridan, andBradley P Feuston. 2003. Random forest: a clas-sification and regression tool for compound classi-fication and qsar modeling. Journal of chemical in-formation and computer sciences, 43(6):1947–1958.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extraction andverification. In NAACL-HLT.

Yi Yang and Ming-Wei Chang. 2015. S-MART: NovelTree-based Structured Learning Algorithms Appliedto Tweet Entity Linking. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing, pages504–513, Beijing, China.

Wenpeng Yin, Hinrich Schutze, and Dan Roth. 2018.End-Task Oriented Textual Entailment via Deep Ex-plorations of Inter-Sentence Interactions. In Pro-ceedings of the 56th Annual Meeting of the Associ-ation for Computational Linguistics (Short Papers),pages 540–545, Melbourne, Australia.

A Short System Descriptions Submittedby Participants

A.1 UNC-NLPOur system is composed of three connected com-ponents namely, a document retriever, a sentence

selector, and a claim verifier. The documentretriever chooses candidate wiki-documents viamatching of keywords between the claims and thewiki-document titles, also using external page-view frequency statistics for wiki-page ranking.The sentence selector is a sequence-matching neu-ral network that conducts further fine-grained se-lection of evidential sentences by comparing thegiven claim with all the sentences in the candi-date documents. This module is trained as a binaryclassifier given the ground truth evidence as pos-itive examples and all the other sentences as neg-ative examples with an annealing sampling strat-egy. Finally, the claim verifier is a state-of-the-art 3-way neural natural language inference (NLI)classifier (with WordNet and ELMo features) thattakes the concatenation of all selected evidence asthe premise and the claim as the hypothesis, andlabels each such evidences-claim pair as one of‘support’, ‘refute’, or ‘not enough info’. To im-prove the claim verifier via better awareness of theselected evidence, we further combine the last twomodules by feeding the sentence similarity score(produced by the sentence selector) as an addi-tional token-level feature to the claim verifier.

A.2 UCL Machine Reading GroupThe UCLMR system is a four stage model consist-ing of document retrieval, sentence retrieval, natu-ral language inference and aggregation. Documentretrieval attempts to find the name of a Wikipediaarticle in the claim, and then ranks each articlebased on capitalization, sentence position and to-ken match features. A set of sentences are thenretrieved from the top ranked articles, based on to-ken matches with the claim and position in the ar-ticle. A natural language inference model is thenapplied to each of these sentences paired with theclaim, giving a prediction for each potential evi-dence. These predictions are then aggregated us-ing a simple MLP, and the sentences are rerankedto keep only the evidence consistent with the finalprediction.

A.3 Athene UKP TU DarmstadtDocument retrieval We applied the con-stituency parser from AllenNLP to extract nounphrases in the claim and made use of WikipediaAPI to search corresponding pages for each nounphrase. So as to remove noisy pages from theresults, we have stemmed the words of their titlesand the claim, and then discarded pages whose

6

stemmed words of the title are not completelyincluded in the set of stemmed words in the claim.

Sentence selection The hinge loss with negativesampling is applied to train the enhanced LSTM.For a given positive claim-evidence pair, negativesamples are generated by randomly sampling sen-tences from the retrieved documents.

RTE We combine the 5 sentences from sentenceselection and the claim to form 5 pairs and thenapply enhanced LSTM for each pair. We com-bine the resulting representations using averageand max pooling and feed the resulting vectorthrough an MLP for classification.

A.4 PapeloWe develop a system for the FEVER fact ex-traction and verification challenge that uses ahigh precision entailment classifier based on trans-former networks pretrained with language model-ing (Radford and Salimans, 2018), to classify abroad set of potential evidence. The precision ofthe entailment classifier allows us to enhance re-call by considering every statement from severalarticles to decide upon each claim. We include notonly the articles best matching the claim text byTFIDF score, but read additional articles whose ti-tles match named entities and capitalized expres-sions occurring in the claim text. The entailmentmodule evaluates potential evidence one statementat a time, together with the title of the page theevidence came from (providing a hint about pos-sible pronoun antecedents). In preliminary evalu-ation, the system achieved 57.36% FEVER score,61.08% label accuracy, and 64.85% evidence F1on the FEVER shared task test set.

A.5 SWEEPerOur model for fact checking and verification con-sists of two stages: 1) identifying relevant docu-ments using lexical and syntactic features from theclaim and first two sentences in the Wikipedia ar-ticle and 2) jointly modeling sentence extractionand verification. As the tasks of fact checkingand finding evidence are dependent on each other,an ideal model would consider the veracity of theclaim when finding evidence and also find only theevidence that supports/refutes the position of theclaim. We thus jointly model the second stage byusing a pointer network with the claim and evi-dence sentence represented using the ESIM mod-ule. For stage 2, we first train both components

using multi-task learning over a larger memory ofextracted sentences, then tune parameters using re-inforcement learning to first extract sentences andpredict the relation over only the extracted sen-tences.

A.6 Columbia NLPFor document retrieval we use three components:1) use google custom search API with the claim asa query and return the top 2 Wikipedia pages; 2)extract all name entities from the claims and useWikipedia python API to return a page for eachname entity and 3); use the prefix of the claim untilthe first lowercase verb phrase, and use WikipediaAPI to return the top page.

For Sentence Selection we used the modifieddocument retrieval component of DrQA to get thetop 5 sentences and then further extracted the top 3sentences using cosine similarity between vectorsobtained from Elmo (Peters et al., 2018) sentenceembeddings of the claim and the evidence.

For RTE we used the same model as outlinedby (Conneau et al., 2017) in their work for rec-ognizing textual entailment and learning universalsentence representations. If at least one out of thethree evidences SUPPORTS/REFUTES the claimand the rest are NOT ENOUGH INFO , then wetreat the label as SUPPORTS/REFUTES, else wereturn the majority among three classes as the pre-dicted label.

A.7 Ohio State UniversityOur system was developed using a heuristics-based approach for evidence extraction and a mod-ified version of the inference model by Parikhet al. (2016) for classification into refute, sup-port, or not enough info. Our process is bro-ken down into three distinct phases. First, po-tentially relevant documents are gathered basedon key words/phrases in the claim that appear inthe wiki dump. Second, any possible evidencesentences inside those documents are extractedby breaking down the claim into named entitiesplus nouns and finding any sentences which matchthose entities, while allowing for various excep-tions and additional potential criteria to increaserecall. Finally, every sentences is classified intoone of the three categories by the inference tool,after additional vectors are added based on namedentity types. NEI sentences are discarded and thehighest scored label of the remaining sentences isassigned to the claim.

7

A.8 GESIS Cologne

In our approach we used a sentence wise approachin all components. To find the sentences we set upa Solr database and indexed every sentence includ-ing information about the article where the sen-tence is from. We created queries based on thenamed entities and noun chunks of the claims. Forthe entailment task we used a Decomposable At-tention Model similar to the one used in the base-line approach. But instead of comparing the claimwith all top 5 sentences at once we treat every sen-tence separately. The results for the top 5 sentencewhere then joined with an ensemble learner incl.the rank of the sentence retriever of the wikipediasentences.

A.9 FujiXerox

We prepared a pipeline system which composesof document selection, a sentence retrieval, anda recognizing textual entailment (RTE) compo-nents. A simple entity linking approach with textmatch is used as the document selection com-ponent, this component identifies relevant docu-ments for a given claim by using mentioned en-tities as clues. The sentence retrieval componentselects relevant sentences as candidate evidencefrom the documents based on TF-IDF. Finally,the RTE component selects evidence sentencesby ranking the sentences and classifies the claimas SUPPORTED, REFUTED, or NOTENOUGH-INFO simultaneously. As the RTE component,we adopted DEISTE (Deep Explorations of Inter-Sentence interactions for Textual Entailment) (Yinet al., 2018) model that is the state-of-the-art inRTE task.

A.10 Uni-DuE Student Team

We generate a Lucene index from the providedWikipedia dump. Then we use two neural net-works, one for named entity recognition and theother for constituency parsing, and also the Stan-ford dependency parser to create the keywordsused inside the Lucene queries. Depending onthe amount of keywords found for each claim, werun multiple Lucene searches on the generated in-dex to create a list of candidate sentences for eachclaim. The resulting list of claim-candidate pairsis processed in three ways:

1. We use the Standford POS-Tagger to gen-erate POS-Tags for the claim and candidate

sentences which are then used in a hand-crafted scoring script to assign a score on a0 to 15 scale.

2. We run each pair through a modified versionof the Decomposable Attention network.

3. We merge all candidate sentences per claiminto one long piece of text and run the resultpaired with the claim through the same mod-ified Decomposable Attention network as in(2.).

We then make the final prediction in a hand-crafted script combining the results of the threeprevious steps.

A.11 Directed Acyclic Graph

In this paper, we describe the system we designedfor the FEVER 2018 Shared Task. The aim of thistask was to conceive a system that can not onlyautomatically assess the veracity of a claim butalso retrieve evidence supporting this assessmentfrom Wikipedia. In our approach, the Wikipediadocuments whose Term Frequency - Inverse Doc-ument Frequency (TFIDF) vectors are most sim-ilar to the vector of the claim and those docu-ments whose names are similar to the named en-tities (NEs) mentioned in the claim are identifiedas the documents which might contain evidence.The sentences in these documents are then sup-plied to a decomposable attention-based textualentailment recognition module. This module cal-culates the probability of each sentence support-ing the claim, contradicting the claim or not pro-viding any relevant information. Various featurescomputed using these probabilities are finally usedby a Random Forest classifier to determine theoverall truthfulness of the claim. The sentenceswhich support this classification are returned asevidence. Our approach achieved a 42.77% ev-idence F1-score, a 51.36% label accuracy and a38.33% FEVER score.

A.12 Py.ro

We NER tagged the claim using SpaCy and usedthe Named Entities as candidate page IDs. We re-solved redirects by following the Wikipedia URLif an item was not in the preprocessed dump. If apage could not be found, we fell back to the base-line document selection method. The rest of thesystem was identical to the baseline system, al-

8

though we used our document retrieval system togenerate alternative training data.

A.13 SIRIUS-LTG-UIOThis article presents the SIRIUS-LTG system forthe Fact Extraction and VERification (FEVER)Shared Task. Our system consists of three com-ponents:

1. Wikipedia Page Retrieval: First we extractthe entities in the claim, then we find poten-tial Wikipedia URI candidates for each of theentities using the SPARQL query over DBpe-dia

2. Sentence selection: We investigate vari-ous techniques i.e. SIF embedding, WordMover’s Distance (WMD), Soft-Cosine Sim-ilarity, Cosine similarity with unigram TF-IDF to rank sentences by their similarity tothe claim.

3. Textual Entailment: We compare three mod-els for the claim classification. We ap-ply a Decomposable Attention (DA) model(Parikh et al., 2016), a Decomposed GraphEntailment (DGE) model (Khot et al., 2018)and a Gradient-Boosted Decision Trees(TalosTree) model (Baird et al., 2017) for thistask.

The experiments show that the pipeline with sim-ple Cosine Similarity using TFIDF in sentence se-lection along with DA as labeling model achievesbetter results in development and test dataset.

A.14 BUPT-NLPerWe introduce an end-to-end multi-task learningmodel for fact extraction and verification with bi-direction attention. We propose a multi-task learn-ing framework for the evidence extraction andclaim verification because these two tasks can beaccomplished at the same time. Each task providessupplementary information for the other and im-proves the results of another task.

For each claim, our system firstly uses the en-tity linking tool S-MART to retrieve relative pagesfrom Wikipedia. Then, we use attention mecha-nisms in both directions, claim-to-page and page-to-claim, which provide complementary informa-tion to each other. Aimed at the different task, oursystem obtains claim-aware sentence representa-tion for evidence extraction and page-aware claimrepresentation for claim verification.

A.15 University of ArizonaMany approaches to automatically recognizing en-tailment relations have employed classifiers overhand engineered lexicalized features, or deeplearning models that implicitly capture lexicaliza-tion through word embeddings. This reliance onlexicalization may complicate the adaptation ofthese tools between domains. For example, such asystem trained in the news domain may learn that asentence like “Palestinians recognize Texas as partof Mexico” tends to be unsupported, a fact whichhas no value in say a scientific domain. To mit-igate this dependence on lexicalized information,in this paper we propose a model that reads twosentences, from any given domain, to determineentailment without using any lexicalized features.Instead our model relies on features like propor-tion of negated verbs, antonyms, noun overlap etc.In its current implementation, this model does notperform well on the FEVER dataset, due to tworeasons. First, for the information retrieval partof the task we used the baseline system provided,since this was not the aim of our project. Sec-ond, this is work in progress and we still are inthe process of identifying more features and grad-ually increasing the accuracy of our model. In theend, we hope to build a generic end-to-end clas-sifier, which can be used in a domain outside theone in which it was trained, with no or minimalre-training.

9


The Data Challenge in Misinformation Detection:Source Reputation vs. Content Veracity

Fatemeh Torabi AsrDiscourse Processing LabSimon Fraser University

Burnaby, BC, [email protected]

Maite TaboadaDiscourse Processing LabSimon Fraser University

Burnaby, BC, [email protected]

AbstractMisinformation detection at the level of fullnews articles is a text classification problem.Reliably labeled data in this domain is rare.Previous work relied on news articles collectedfrom so-called “reputable” and “suspicious”websites and labeled accordingly. We leveragefact-checking websites to collect individually-labeled news articles with regard to the verac-ity of their content and use this data to testthe cross-domain generalization of a classifiertrained on bigger text collections but labeledaccording to source reputation. Our resultssuggest that reputation-based classification isnot sufficient for predicting the veracity levelof the majority of news articles, and that thesystem performance on different test datasetsdepends on topic distribution. Therefore col-lecting well-balanced and carefully-assessedtraining data is a priority for developing robustmisinformation detection systems.

1 Introduction

Automatic detection of fake from legitimate newsin different formats such as headlines, tweets andfull news articles has been approached in recentNatural Language Processing literature (Vlachosand Riedel, 2014; Vosoughi, 2015; Jin et al., 2016;Rashkin et al., 2017; Volkova et al., 2017; Wang,2017; Pomerleau and Rao, 2017; Thorne et al.,2018). The most important challenge in automaticmisinformation detection using modern NLP tech-niques, especially at the level of full news arti-cles, is data. Most previous systems built to iden-tify fake news articles rely on training data la-beled with respect to the general reputation of thesources, i.e., domains/user accounts (Fogg et al.,2001; Lazer et al., 2017; Rashkin et al., 2017).Even though some of these studies try to identifyfake news based on linguistic cues, the questionis whether they learn publishers’ general writ-ing style (e.g., common writing features of a few

clickbaity websites) or deceptive style (similari-ties among news articles that contain misinforma-tion).

In this study, we collect two new datasets thatinclude the full text of news articles and individ-ually assigned veracity labels. We then addressthe above question, by conducting a set of cross-domain experiments: training a text classificationsystem on data collected in a batch manner fromsuspicious and reputable websites and then test-ing the system on news articles that have been as-sessed in a one-by-one fashion. Our experimentsreveal that the generalization power of a modeltrained on reputation-based labeled data is not im-pressive on individually assessed articles. There-fore, we propose to collect and verify larger col-lections of news articles with reliably assigned la-bels that would be useful for building more robustfake news detection systems.

2 Data Collection

Most studies on fake news detection have exam-ined microblogs, headlines and claims in the formof short statements. A few recent studies have ex-amined full articles (i.e., actual ‘fake news’) to ex-tract discriminative linguistic features of misinfor-mation (Yang et al., 2017; Rashkin et al., 2017;Horne and Adali, 2017). The issue with these stud-ies is the data collection methodology. Texts areharvested from websites that are assumed to befake news publishers (according to a list of sus-picious websites), with no individual labeling ofdata. The so-called suspicious sources, however,sometimes do publish facts and valid information,and reputable websites sometimes publish inaccu-rate information (Mantzarlis, 2017). The key tocollect more reliable data, then, is to not rely onthe source but on the text of the article itself, andonly after the text has been assessed by human

10

annotators and determined to contain false infor-mation. Currently, there exists only small col-lections of reliably-labeled news articles (Rubinet al., 2016; Allcott and Gentzkow, 2017; Zhanget al., 2018; Baly et al., 2018) because this type ofannotation is laborious. The Liar dataset (Wang,2017) is the first large dataset collected through re-liable annotation, but it contains only short state-ments. Another recently published large datasetis FEVER (Thorne et al., 2018), which containsboth claims and texts from Wikipedia pages thatsupport or refute those claims. This dataset, how-ever, has been built to serve the slightly differentpurpose of stance detection (Pomerleau and Rao,2017; Mohtarami et al., 2018), the claims havebeen artificially generated, and texts are not newsarticles.

Our objective is to elaborate on the distinc-tion between classifying reputation-based la-beled news articles and individually-assessednews articles. We do so by collecting and usingdatasets of the second type in evaluation of a textclassifier trained on the first type of data. In thissection, we first introduce one large collection ofnews text from previous studies that has been la-beled according to the list of suspicious websites,and one small collection that was labeled manu-ally for each and every news article, but only con-tains satirical and legitimate instances. We thenintroduce two datasets that we have scraped fromthe web by leveraging links to news articles men-tioned by fact-checking websites (Buzzfeed andSnopes). The distinguishing feature of these newcollections is that they contain not only the fulltext of real news articles found online, but alsoindividually assigned veracity labels indicative oftheir misinformative content.

Rashkin et al. dataset: Rashkin et al. (2017)published a collection of roughly 20k news ar-ticles from eight sources categorized into fourclasses: propaganda (The Natural News and Ac-tivist Report), satire (The Onion, The BorowitzReport, and Clickhole), hoax (American Newsand DC Gazette) and trusted (Gigaword News).This dataset is balanced across classes, and sincethe articles in their training and test splits comefrom different websites, the accuracy of the trainedmodel on test data should be demonstrative of itsunderstanding of the general writing style of eachtarget class rather than author-specific cues. How-ever, we suspect that the noisy strategy to label

all articles of a publisher based on its reputationhighly biases the classifier decisions and limitsits power to distinguish individual misinformativefrom truthful news articles.

Rubin et al. dataset: As part of a study on satir-ical cues, Rubin et al. (2016) published a dataset of360 news articles. This dataset contains balancednumbers of individually evaluated satirical and le-gitimate texts. Even though small, it is a cleandata to test the generalization power of a systemtrained on noisy data such as the above explaineddataset. We use this data to make our point aboutthe need for careful annotation of news articles ona one-by-one fashion, rather than harvesting fromwebsites generally knows as hoax, propaganda orsatire publishers.

BuzzfeedUSE dataset: The first source of in-formation that we used to harvest full news arti-cles with veracity labels is from the Buzzfeed fact-checking company. Buzzfeed has published a col-lection of links to Facebook posts, originally com-piled for a study around the 2016 US election (Sil-verman et al., 2016). Each URL in this dataset wasgiven to human experts so they can rate the amountof false information contained in the linked arti-cle. The links were collected from nine Facebookpages (three right-wing, three left-wing and threemainstream publishers).1 We had to follow thefacebook URLs and then the link to the originalnews articles to obtain the news texts. We scrapedthe full text of each news article from its originalsource. The resulting dataset includes a total of1,380 news articles on a focused topic (US elec-tion and candidates). Veracity labels come in a 4-way classification scheme including 1,090 mostlytrue, 170 mixture of true and false, 64 mostly falseand 56 articles containing no factual content.

Snopes312 dataset: The second source of infor-mation that we used to harvest full news articleswith veracity labels is Snopes, a well-known ru-mor debunking website run by a team of experteditors. We scraped the entire archive of fact-checking pages. On each page they talk about aclaim, cite the sources (news articles, forums orsocial networks where the claim was distributed)and provide a veracity label for the claim. Weautomatically extracted all links mentioned on aSnopes page, followed the link to each original

1https://www.kaggle.com/mrisdal/fact-checking-facebook-politics-pages

11

Table 1: Results of the manual assessment of Snopes312 collection for items of each veracity labelAssessment / Veracity label false mixture mostly false mostly true true Allambiguous 2 0 1 0 0 3context 19 31 17 32 26 125debunking 0 1 0 0 0 1irrelevant 9 10 7 2 10 38supporting 21 30 28 37 29 145All 51 72 53 71 65 312

Table 2: Contingency table on disagreements between the first and second annotator in Snopes312 datasetFirst annotator / Second annotator ambiguous context debunking irrelevant supporting Allambiguous 0 0 0 0 0 0context 1 0 1 8 71 81debunking 0 0 0 0 1 1irrelevant 0 36 0 0 16 52supporting 0 11 1 0 0 12All 1 47 2 8 88 146

news article, and extracted the text. The resultingdatafile includes roughly 4,000 rows, each con-taining a claim discussed by Snopes annotators,the veracity label assigned to it, and the text of anews article related to the claim. The main chal-lenge in using this data for training/testing a fakenews detector is that some of the links on a Snopespage that we collect automatically do not actuallypoint to the discussed news article, i.e., the sourceof the claim. Many links are to pages that pro-vide contextual information for the fact-checkingof the claim. Therefore, not all the texts in ourautomatically extracted dataset are reliable or sim-ply the “supporting” source of the claim. To comeup with a reliable set of veracity-labeled news arti-cles, we randomly selected 312 items and assessedthem manually. Two annotators performed inde-pendent assessments on the 312 items. A third an-notator went through the entire list of items for afinal check and resolving disagreements. Snopeshas a fine-grained veracity labeling system. Weselected [fully] true, mostly true, mixture of trueand false, mostly false, and [fully] false stories.Table 1 shows the distribution of these labels inthe manually assessed 312 items, and how manyfrom each category of news articles were veri-fied to be the “supporting” source (distributingthe discussed claim), “context” (providing back-ground or related information about the topic ofthe claim), “debunking” (against the claim), “irrel-evant” (completely unrelated to the claim or dis-torted text) and ambiguous (not sure how it related

to the claim). Table 2 provides information on theconfusing choices: About 50% of the items re-ceived different category labels from the two firstannotators. The first annotator had a more conser-vative bias, trying to avoid mistakes in the “sup-porting” category, whereas the second annotatoroften assigned either “supporting” or “context”,and rarely “irrelevant”. For the disagreed items,the third annotator (who had access to all outputs)chose the final category. Results in Table 1 arebased on this final assessment. We use the “sup-porting” portion of the data (145 items) in the fol-lowing experiments.

3 Experiments

In text classification, Convolutional Neural Net-works (CNNs) have been competing with the TF-IDF model, a simple but strong baseline usingscored n-grams (Le and Mikolov, 2014; Zhanget al., 2015; Conneau et al., 2017; Medvedevaet al., 2017). These methods have been used forfake news detection in previous work (Rashkinet al., 2017; Wang, 2017). For our experiments, wetrained and tuned different architectures of CNNand several classic classifiers (Naive Bayes andSupport Vector Machines) with TF-IDF featureson Rashkin et al.’s dataset. The best results on thedevelopment data were obtained from a SupportVector Machine (SVM) classifier using unigramTF-IDF features with L2 regularization.2 There-

2We used the same train/dev/test split as in Rashkin’spaper. However, the performance of our SVM classi-

12

Figure 1: Classification of news articles from four test datasets by a model trained on Rashkin et al.’s training data.Labels assigned by the classifier are Capitalized (plot legend), actual labels of test items are in lowercase (x-axis).

fore, we use this model to demonstrate how a clas-sifier trained on data labeled according to pub-lisher’s reputation would identify misinformativenews articles.

It is evident in the first section of Figure 1, thatthe model performs well on similarly collected testitems, i.e., Hoax, Satire, Propaganda and Trustednews articles within Rashkin et al.’s test dataset.However, when the model is applied to Rubin etal.’s data, which was carefully assessed for satiri-cal cues in each and every article, the performancedrops considerably (See the second section of thefigure). Although the classifier detects more ofthe satirical texts in Rubin et al.’s data, the dis-tribution of the given labels is not very different tothat of legitimate texts. One important feature ofRubin et al.’s data is that topics of the legitimateinstances were matched and balanced with topicsof the satirical instances. The results here suggestthat similarities captured by the classifier can bevery dependent on the topics of the news articles.

Next we examine the same model on our col-lected datasets, BuzzfeedUSE and Snopes312, astest material. The BuzzfeedUSE data comes with4 categories (Figure 1). The classifier does seemto have some sensitivity to true vs. false infor-mation in this dataset, as more of the mostly truearticles were labeled as Trusted. The differencewith mostly false articles, however, is negligible.The most frequent label assigned by the classi-fier was Hoax in all four categories, which sug-gests that most BuzzfeedUSE articles looked likeHoax in Rashkin’s data. Finally, the last sectionof 1 shows the results on the Snopes312 plotted

fier was significantly better on both dev and test sets:0.96 and 0.75 F1-score, respectively, compared to 0.91and 0.65 reported in their paper. Source code willbe made available at https://github.com/sfu-discourse-lab/Misinformation_detection

along the 6-category distinction. A stronger corre-lation can be observed between the classifier deci-sions and the veracity labels in this data comparedto BuzzfeedUSE. This suggests that distinguishingbetween news articles with true and false informa-tion is a more difficult task when topics are thesame (BuzzfeedUSE data is all related to the USelection). In Snopes312, news articles come froma variety of topics. The strong alignment betweenthe classifier’s Propaganda and Hoax labels withthe mostly false and [fully] false categories in thisdataset reveals that most misinformative news ar-ticles indeed discuss the topics or use the languageof generally suspicious publishers. This is an en-couraging result in the sense that, with surface fea-tures such as n-grams and approximate reputation-based training data, we already can detect some ofthe misinformative news articles. Observing clas-sification errors across these experiments, how-ever, indicates that the model performance varies alot with the type of test material: In a focused topicsituation, it fails to distinguish between categories(false vs. true, or satirical vs. legitimate arti-cles). While a correlation is consistently observedbetween labels assigned by the classifier and theactual labels of target news articles,3 reputation-based classification does not seem to be sufficientfor predicting the veracity level of the majority ofnews articles.

4 Conclusion

We found that collecting reliable data for auto-matic misinformation detection at the level of fullnews articles is a challenging but necessary taskfor building robust models. If we want to benefit

3A chi-square test indicates a significant correlation (p <0.001) between assigned and actual labels in all four datasets.

13

from state-of-the-art text classification techniques,such as CNNs, we require larger datasets thanwhat is currently available. We took the first steps,by scraping claims and veracity labels from fact-checking websites, extracting and cleaning of theoriginal news articles’ texts (resulting in roughly4,000 items), and finally manual assessment of asubset of the data to provide reliable test mate-rial for misinformation detection. Our future planis to crowd-source annotators for the remainingscraped texts and publish a large set of labelednews articles for training purposes.

Acknowledgement

This project was funded by a Discovery Grantfrom the Natural Sciences and Engineering Re-search Council of Canada. We gratefully acknowl-edge the support of NVIDIA Corporation with thedonation of the Titan Xp GPU used for this re-search. We would like to thank members of theDiscourse Processing Lab at SFU, especially Ya-jie Zhou and Jerry Sun for their help checking thedatasets.

ReferencesHunt Allcott and Matthew Gentzkow. 2017. Social me-

dia and fake news in the 2016 election. Journal ofEconomic Perspectives, 31:211–236.

Ramy Baly, Mitra Mohtarami, James Glass, LluısMarquez, Alessandro Moschitti, and Preslav Nakov.2018. Integrating stance detection and fact checkingin a unified corpus. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers), vol-ume 2, pages 21–27.

Alexis Conneau, Holger Schwenk, Loc Barrault, andYann LeCun. 2017. Very deep convolutional net-works for text classification. In Proceedings of the15th Conference of the European Chapter of theAssociation for Computational Linguistics, pages1107–1116, Valencia.

B. J. Fogg, Jonathan Marshall, Othman Laraki, AlexOsipovich, Chris Varma, Nicholas Fang, Jyoti Paul,Akshay Rangnekar, John Shon, Preeti Swani, andMarissa Treinen. 2001. What makes web sites cred-ible? A report on a large quantitative study. In Pro-ceedings of the SIGCHI Conference on Human Fac-tors in Computing Systems, CHI ’01, pages 61–68,New York.

Benjamin D Horne and Sibel Adali. 2017. This just in:Fake news packs a lot in title, uses simpler, repetitivecontent in text body, more similar to satire than realnews. arXiv preprint arXiv:1703.09398.

Zhiwei Jin, Juan Cao, Yongdong Zhang, and Jiebo Luo.2016. News verification by exploiting conflictingsocial viewpoints in microblogs. In Proceedings ofthe Thirtieth AAAI Conference on Artificial Intelli-gence, pages 2972–2978, Phoenix.

David Lazer, Matthew Baum, Nir Grinberg, Lisa Fried-land, Kenneth Joseph, Will Hobbs, and CarolinaMattsson. 2017. Combating fake news: An agendafor research and action. Harvard Kennedy School,Shorenstein Center on Media, Politics and PublicPolicy, 2.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In Proceed-ings of the 31st International Conference on Ma-chine Learning, pages II–1188–II–1196, Beijing.

Alexios Mantzarlis. 2017. Not fake news, just plainwrong: Top media corrections of 2017. Poyn-ter News. Https://www.poynter.org/news/not-fake-news-just-plain-wrong-top-media-corrections-2017.

Maria Medvedeva, Martin Kroon, and Barbara Plank.2017. When sparse traditional models outperformdense neural networks: The curious case of discrim-inating between similar languages. In Proceedingsof the 4th Workshop on NLP for Similar Languages,Varieties and Dialects (VarDial), pages 156–163,Valencia.

Mitra Mohtarami, Ramy Baly, James Glass, PreslavNakov, Llus Mrquez, and Alessandro Moschitti.2018. Automatic stance detection using end-to-endmemory networks. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages767–776, New Orleans, LA.

Dean Pomerleau and Delip Rao. 2017. Fake news chal-lenge. Http://fakenewschallenge.org/.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and polit-ical fact-checking. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2921–2927, Copenhagen.

Victoria L Rubin, Niall J Conroy, Yimin Chen, andSarah Cornwell. 2016. Fake news or truth? Usingsatirical cues to detect potentially misleading news.In Proceedings of NAACL-HLT, pages 7–17, SanDiego.

Craig Silverman, Lauren Strapagiel, Hamza Sha-ban, Ellie Hall, and Jeremy Singer-Vine.2016. Hyperpartisan Facebook pages arepublishing false and misleading informa-tion at an alarming rate. BuzzFeed News.Https://www.buzzfeed.com/craigsilverman/partisan-fb-pages-analysis.

14

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: A large-scale dataset for Fact Extractionand VERification. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 809–819, New Orleans, LA.

Andreas Vlachos and Sebastian Riedel. 2014. Factchecking: Task definition and dataset construction.In Proceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational Social Sci-ence, pages 18–22.

Svitlana Volkova, Kyle Shaffer, Jin Yea Jang, andNathan Hodas. 2017. Separating facts from fiction:Linguistic models to classify suspicious and trustednews posts on Twitter. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics, volume 2, pages 647–653, Van-couver.

Soroush Vosoughi. 2015. Automatic detection and ver-ification of rumors on Twitter. Ph.D. thesis, Mas-sachusetts Institute of Technology.

William Yang Wang. 2017. ‘Liar, liar pants on fire’: Anew benchmark dataset for fake news detection. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, volume 2,pages 422–426, Vancouver.

Fan Yang, Arjun Mukherjee, and Eduard Dragut. 2017.Satirical news detection and analysis using attentionmechanism and linguistic features. In Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing, pages 1979–1989.

Amy X Zhang, Aditya Ranganathan, Sarah EmlenMetz, Scott Appling, Connie Moon Sehat, NormanGilmore, Nick B Adams, Emmanuel Vincent, Jen-nifer Lee, Martin Robbins, et al. 2018. A struc-tured response to misinformation: Defining and an-notating credibility indicators in news articles. InCompanion of the The Web Conference 2018 on TheWeb Conference 2018, pages 603–612. InternationalWorld Wide Web Conferences Steering Committee.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Proceedings of the 28th InternationalConference on Neural Information Processing Sys-tems, pages 649–657, Montreal.

15


Crowdsourcing Semantic Label Propagation in Relation Classification

Anca DumitracheVrije Universiteit Amsterdam

IBM CAS [email protected]

Lora AroyoVrije Universiteit [email protected]

Chris WeltyGoogle Research, New [email protected]

Abstract

Distant supervision is a popular method forperforming relation extraction from text that isknown to produce noisy labels. Most progressin relation extraction and classification hasbeen made with crowdsourced corrections todistant-supervised labels, and there is evidencethat indicates still more would be better. Inthis paper, we explore the problem of propa-gating human annotation signals gathered foropen-domain relation classification throughthe CrowdTruth methodology for crowdsourc-ing, that captures ambiguity in annotations bymeasuring inter-annotator disagreement. Ourapproach propagates annotations to sentencesthat are similar in a low dimensional embed-ding space, expanding the number of labelsby two orders of magnitude. Our experimentsshow significant improvement in a sentence-level multi-class relation classifier.

1 Introduction

Distant supervision (DS) (Mintz et al., 2009) is apopular method for performing relation extractionfrom text. It is based on the assumption that, whena knowledge-base contains a relation between apair of terms, then any sentence that contains thatpair is likely to express the relation. This approachcan generate false positives, as not every mentionof a term pair in a sentence means a relation is alsoexpressed (Feng et al., 2017).

Recent results (Angeli et al., 2014; Liu et al.,2016) have shown strong evidence that the com-munity needs more annotated data to improve thequality of DS data. This work explores the possi-bility of automatically expanding smaller human-annotated datasets to DS scale. Sterckx et al.(2016) proposed a method to correct labels of sen-tence dependency paths by using expert annota-tors, and then propagating the corrected labels to acorpus of DS sentences by calculating the similar-

ity between the labeled and unlabeled sentences inthe embedding space of their dependency paths.

In this paper, we adapt and simplify seman-tic label propagation to propagate labels withoutcomputing dependency paths, and using the crowdinstead of experts, which is more scalable. Oursimplified algorithm propagates crowdsourced an-notations from a small sample of sentences to alarge DS corpus. To evaluate our approach, weperform an experiment in open domain relationclassification in the English-language, using a cor-pus of sentences (Dumitrache et al., 2017) whoselabels have been collected using the CrowdTruthmethod (Aroyo and Welty, 2014).

2 Related Work

There exist several efforts to correct DS withthe help of crowdsourcing. Angeli et al. (2014)present an active learning approach to select themost useful sentences that need human re-labelingusing a query by committee. Zhang et al. (2012)show that labeled data has a statistically signifi-cant, but relatively low impact on improving thequality of DS training data, while increasing thesize of the DS corpus has a more significant im-pact. In contrast, Liu et al. (2016) prove that acorpus of labeled sentences from a pool of highlyqualified workers can significantly improve DSquality. All of these methods employ large anno-tated corpora of 10,000 to 20,000 sentences. In ourexperiment, we show that a comparatively smallercorpus of 2,050 sentences is enough to correct DSerrors through semantic label propagation.

Levy et al. (2017) have shown that a smallcrowdsourced dataset of questions about relationscan be exploited to perform zero-shot learningfor relation extraction. Pershina et al. (2014) usea small dataset of hand-labeled data to generaterelation-specific guidelines that are used as addi-

16

Figure 1: Fragment of the crowdsourcing task template.

tional features in the relation extraction. The la-bel propagation method was introduced by Xiao-jin and Zoubin (2002), while Chen et al. (2006)first applied it to correct DS, by calculating simi-larity between labeled and unlabeled examples anextensive list of features, including part-of-speechtags and target entity types. In contrast, our ap-proach calculates similarity between examples inthe word2vec (Mikolov et al., 2013) feature space,which it then uses to correct the labels of train-ing sentences. This makes it easy to reuse by thestate-of-the-art in both relation classification andrelation extraction – convolutional (Ji et al., 2017)and recurrent neural network methods (Zhou et al.,2016) that do not use extensive feature sets. Toevaluate our approach, we used a simple convolu-tional neural network to perform relation classifi-cation in sentences (Nguyen and Grishman, 2015).

3 Experimental Setup

3.1 Annotated DataThe labeled data used in our experiments con-sists of 4,100 sentences: 2,050 sentences fromthe CrowdTruth corpus (Dumitrache et al., 2017),which we have augmented by another 2,050 sen-tences picked at random from the corpus of An-geli et al. (2014). The resulting corpus con-tains sentences for 16 popular relations fromthe open domain, as shown in in Figure 1,1 aswell as candidate term pairs and DS seed re-lations for each sentence. As some relations

1The alternate names relation appears twice in the list,once referring to alternate names of persons, and the otherreferring to organizations.

are more general than others, the relation fre-quency in the corpus is slightly unequal – e.g.places of residence is more likely to be in a sen-tence when place of birth and place of deathoccur, but not the opposite.

The crowdsourcing task (Figure 1) was de-signed in our previous work (Dumitrache et al.,2017). We asked workers to read the given sen-tence where the candidate term pair is highlighted,and then pick between the 16 relations or none ofthe above, if none of the presented relations apply.The task was multiple choice and run on the Fig-ure Eight2 and Amazon Mechanical Turk3 crowd-sourcing platforms. Each sentence was judged by15 workers, and each worker was paid $0.05 persentence.

Crowdsourcing annotations are aggregated usu-ally by measuring the consensus of the workers(e.g. using majority vote). This is based on theassumption that a single right annotation exists foreach example. In the problem of relation classifi-cation, the notion of a single truth is reflected inthe fact that a majority of proposed solutions treatrelations as mutually exclusive, and the objectiveof the classification task is usually to find the bestrelation for a given sentence and term pair. In con-trast, the CrowdTruth methodology proposes thatcrowd annotations are inherently diverse (Aroyoand Welty, 2015), due to a variety of factors suchas the ambiguity that is inherent in natural lan-guage. We use a comparatively large number ofworkers per sentences (15) in order to collect inter-

2https://www.figure-eight.com/3https://www.mturk.com/

17

annotator disagreement, which results in a morefine-grained ground truth that separates betweenclear and ambiguous expressions of relations. Thisis achieved by labeling examples with the inter-annotator agreement on a continuous scale, as op-posed to using binary labels.

To aggregate the results of the crowd, we useCrowdTruth metrics4 (Dumitrache et al., 2018) tocapture and interpret inter-annotator disagreementas quality metrics for the workers, sentences, andrelations in the corpus. The annotations of oneworker over one sentence are encoded as a binaryworker vector with 17 components, one for eachrelation and including none. The quality metricsfor the workers, sentences and relations, are basedon average cosine similarity over the worker vec-tors – e.g. the quality of a worker w is given by theaverage cosine similarity between the worker vec-tor of w and the vectors of all other workers thatannotated the same sentences. These metrics aremutually dependent (e.g. the sentence quality isweighted by the relation quality and worker qual-ity), the intuition being that low quality workersshould not count as much in determining sentencequality, and ambiguous sentences should have lessof an impact in determining worker quality, etc.

We reused these scores in our experiment, fo-cusing on the sentence-relation score (srs), rep-resenting the degree to which a relation is ex-pressed in the sentence. It is the ratio of workersthat picked the relation to all the workers that readthe sentence, weighted by the worker and relationquality. A higher srs should indicate that the rela-tion is more clearly expressed in a sentence.

3.2 Propagating Annotations

Inspired by the semantic label propagationmethod (Sterckx et al., 2016), we propagate thevectors of srs scores on each crowd annotatedsentence to a much larger set of distant super-vised (DS) sentences (see datasets descriptionin Section 3.3), scaling the vectors linearly bythe distance in low dimensional word2vec vectorspace (Mikolov et al., 2013). One of the reasonswe chose the CrowdTruth set for this experiment isthat the annotation vectors give us a score for eachrelation to propagate to the DS sentences, whichhave only one binary label.

Similarly to Sultan et al. (2015), we calcu-

4https://github.com/CrowdTruth/CrowdTruth-core

late the vector representation of a sentence as theaverage over its word vectors, and like Sterckxet al. (2016) we get the similarity between sen-tences using cosine similarity. Additionally, werestrict the sentence representation to only con-tain the words between the term pair, in order toreduce the vector space to the one that is mostlikely to express the relations. For each sentences in the DS dataset, we find the sentence l′ fromthe crowd annotated set that is most similar to s:l′ = argmax

l∈Crowdcos sim(l, s). The score for rela-

tion r of sentence s is calculated as the weightedaverage between the srs(l′, r) and the originalDS annotation, weighted by the cosine similarityto s (cos sim(s, s) = 1 for the DS term, andcos sim(s, l′) for the srs term):

DS∗(s, r) =DS(s, r) + cos sim(s, l′) · srs(l′, r)

1 + cos sim(s, l′)(1)

where DS(s, r) ∈ 0, 1 is the original DS anno-tation for the relation r on sentence s.

3.3 Training the ModelThe crowdsourced data is split evenly into a devand a test set of 2,050 sentences each chosen atrandom. In addition, we used a training set of235,000 sentences annotated by DS from freebaserelations, used in Riedel et al. (2013).

The relation classification model employed isbased on Nguyen and Grishman (2015), who im-plement a convolutional neural network with fourmain layers: an embedding layer for the words inthe sentence and the position of the candidate termpair in the sentence, a convolutional layer with asliding window of variable length of 2 to 5 wordsthat recognizes n-grams, a pooling layer that de-termines the most relevant features, and a softmaxlayer to perform classification.

We have adapted this model to be both multi-class and multi-label – we use a sigmoid cross-entropy loss function instead of softmax cross-entropy, and the final layer is normalized with thesigmoid function instead of softmax – in order tomake it possible for more than one relation to holdbetween two terms in one sentence. The loss func-tion is computed using continuous labels insteadof binary positive/negative labels, in order to ac-commodate the use of the srs in training. Thefeatures of the model are the word2vec embed-dings of the words in the sentences, together withthe position embeddings of the two terms that ex-

18

Figure 2: Precision / Recall curve, calculated for eachsentence-relation pair.

press the relation. The word embeddings are ini-tialized with 300-dimensional word2vec vectorspre-trained on the Google News corpus5. Both theposition and word embeddings are nonstatic andbecome optimized during training of the model.The model is trained for 25,000 iterations, afterthe point of stabilization for the train loss. The val-ues of the other hyper-parameters are the same asthose reported by Nguyen and Grishman (2015).The model was implemented in Tensorflow (Abadiet al., 2016), and trained in a distributed manner onthe DAS-5 cluster (Bal et al., 2016).

For our experiment, we split the crowd data intoa dev and a test set of equal size, and compared theperformance of the model on the held-out test setwhen trained by the following datasets:

1. DS: The 235,000 sentences annotated by DS.

2. DS + CT: The 2,050 crowd dev annotatedsentences added directly to the DS dataset.

3. DS + W2V CT: The DS∗ dataset (Eq. 1),with relation scores propagated over the2,050 crowd dev sentences.

4 Results and Discussion

To evaluate the performance of the models, we cal-culate the micro precision and recall (Figure 2), aswell as the cosine similarity per sentence with thetest set (Figure 3). In order to calculate the pre-cision and recall, a threshold of 0.5 was set in thesrs, and each sentence-relation pair was labeledeither as positive or negative. However, for calcu-lating the cosine similarity, the srs was used with-out change, in order to better reflect the degree of

5https://code.google.com/archive/p/word2vec/

Figure 3: Distribution of sentence-level cosine similar-ity with test set values.

agreement the crowd had over annotating each ex-ample. We observe that DS + W2V CT, with aprecision/recall AUC = 0.512, significantly out-performs DS (P/R AUC = 0.294). DS + CT (P/RAUC = 0.372) also does slightly better than DS,but not enough to compete with the semantic labelpropagation method. The cosine similarity result(Figure 3) shows that DS + W2V CT also pro-duces model predictions that are closer to the dif-ferent agreement levels of the crowd. Take advan-tage of the agreement scores in the CrowdTruthcorpus, the cosine similarity evaluation allows usto assess relation confidence scores on a continu-ous scale. The crowdsourcing results and modelpredictions are available online.6

One reason for which the semantic label propa-gation method works better than simply adding thecorrectly labeled sentences to the train set is thehigh rate of incorrectly labeled examples in the DStraining data. Figure 4 shows that some relations,such as origin and places of residence, have aratio of over 0.8 false positive sentences, meaningthat a vast majority of training examples are incor-rectly labeled. The success of the DS + W2V CTcomes in part because the method relabels all sen-tences in DS. Adding correctly labeled sentencesto the train set would require a significantly largercorpus in order to correct the high false positiverate, but semantic label propagation only requiresa small corpus (two orders of magnitude smallerthan the train set) to achieve significant improve-ments.

5 Conclusion and Future Work

This paper explores the problem of propagatinghuman annotation signals in distant supervision

6https://github.com/CrowdTruth/Open-Domain-Relation-Extraction

19

Figure 4: DS false positive ratio in combined crowddev and test sets.

data for open-domain relation classification. Ourapproach propagates human annotations to sen-tences that are similar in a low dimensional em-bedding space, using a small crowdsourced datasetof 2,050 sentences to correct training data labeledwith distant supervision. We present experimen-tal results from training a relation classifier, whereour method shows significant improvement overthe DS baseline, as well as just adding the labeledexamples to the train set.

Unlike Sterckx et al. (2016) who employ ex-perts to label the dependency path representationof sentences, our method uses the general crowd toannotate the actual sentence text, and is thus eas-ier to scale and not dependent on methods for ex-tracting dependency paths, so it can be more eas-ily adapted to other languages and domains. Also,since the semantic label propagation is applied tothe data before training is completed, this methodcan easily be reused to correct train data for anymodel, regardless of the features used in learning.In our future work, we plan to use this method tocorrect training data for state-of-the-art models inrelation classification, but also relation extractionand knowledge-base population.

We also plan to explore different ways of col-lecting and aggregating data from the crowd.CrowdTruth (Dumitrache et al., 2017) proposescapturing ambiguity through inter-annotator dis-agreement, which necessitates multiple annotatorsper sentence, while Liu et al. (2016) propose in-creasing the number of labeled examples added tothe training set by using one high quality workerper sentence. We will compare the two meth-ods to determine whether quality or quantity ofdata are more useful for semantic label propaga-tion. To achieve this, we will investigate whetherdisagreement-based metrics such as sentence and

relation quality can also be propagated through thetraining data.

ReferencesMartın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In OSDI, volume 16, pages 265–283.

Gabor Angeli, Julie Tibshirani, Jean Wu, and Christo-pher D Manning. 2014. Combining distant andpartial supervision for relation extraction. In Pro-ceedings of the 2014 conference on empirical meth-ods in natural language processing (EMNLP), pages1556–1567.

Lora Aroyo and Chris Welty. 2014. The Three Sidesof CrowdTruth. Journal of Human Computation,1:31–34.

Lora Aroyo and Chris Welty. 2015. Truth is a lie:Crowd truth and the seven myths of human anno-tation. AI Magazine, 36(1):15–24.

Henri Bal, Dick Epema, Cees de Laat, Rob van Nieuw-poort, John Romein, Frank Seinstra, Cees Snoek,and Harry Wijshoff. 2016. A medium-scale dis-tributed system for computer science research: In-frastructure for the long term. Computer, 49(5):54–63.

Jinxiu Chen, Donghong Ji, Chew Lim Tan, andZhengyu Niu. 2006. Relation extraction using la-bel propagation based semi-supervised learning. InProceedings of the 21st International Conferenceon Computational Linguistics and the 44th AnnualMeeting of the Association for Computational Lin-guistics, ACL-44, pages 129–136, Stroudsburg, PA,USA. Association for Computational Linguistics.

Anca Dumitrache, Lora Aroyo, and Chris Welty. 2017.False positive and cross-relation signals in distantsupervision data. In Proceedings of the 6th Work-shop on Automated Knowledge Base Construction.

Anca Dumitrache, Oana Inel, Lora Aroyo, BenjaminTimmermans, and Chris Welty. 2018. CrowdTruth2.0: Quality Metrics for Crowdsourcing with Dis-agreement. arXiv preprint arXiv:1808.06080.

Xiaocheng Feng, Jiang Guo, Bing Qin, Ting Liu, andYongjie Liu. 2017. Effective deep memory net-works for distant supervised relation extraction. InProceedings of the Twenty-Sixth International JointConference on Artificial Intelligence, IJCAI 2017,Melbourne, Australia, August 19-25, 2017, pages4002–4008. ijcai.org.

Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al.2017. Distant supervision for relation extractionwith sentence-level attention and entity descriptions.In AAAI, pages 3060–3066.

20

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extraction viareading comprehension. CoNLL 2017, page 333.

Angli Liu, Stephen Soderland, Jonathan Bragg,Christopher H Lin, Xiao Ling, and Daniel S Weld.2016. Effective crowd annotation for relation ex-traction. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 897–906.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2, pages 1003–1011. Association forComputational Linguistics.

Thien Huu Nguyen and Ralph Grishman. 2015. Rela-tion extraction: Perspective from convolutional neu-ral networks. In Proceedings of the 1st Workshop onVector Space Modeling for Natural Language Pro-cessing, pages 39–48.

Maria Pershina, Bonan Min, Wei Xu, and Ralph Gr-ishman. 2014. Infusion of labeled data into dis-tant supervision for relation extraction. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), volume 2, pages 732–738.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of the 2013 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages74–84.

Lucas Sterckx, Thomas Demeester, Johannes Deleu,and Chris Develder. 2016. Knowledge base popula-tion using semantic label propagation. Knowledge-Based Systems, 108(C):79–91.

Md Arafat Sultan, Steven Bethard, and Tamara Sum-ner. 2015. DLS @ CU: Sentence Similarity fromWord Alignment and Semantic Vector Composition.In Proceedings of the 9th International Workshop onSemantic Evaluation (SemEval 2015), pages 148–153.

Zhu Xiaojin and Ghahramani Zoubin. 2002. Learn-ing from labeled and unlabeled data with label prop-agation. Technical Report CMU-CALD-02–107,Carnegie Mellon University.

Ce Zhang, Feng Niu, Christopher Re, and Jude Shav-lik. 2012. Big data versus the crowd: Looking forrelationships in all the right places. In Proceedingsof the 50th Annual Meeting of the Association forComputational Linguistics: Long Papers-Volume 1,pages 825–834. Association for Computational Lin-guistics.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, BingchenLi, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory net-works for relation classification. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),volume 2, pages 207–212.

21


Retrieve and Re-rank: A Simple and Effective IR Approach to SimpleQuestion Answering over Knowledge Graphs

Vishal GuptaIIIT Hyderabad, Indiavishal.gupta@

research.iiit.ac.in

Manoj ChinnakotlaMicrosoft, Bellevue, USA

[email protected]

Manish ShrivastavaIIIT Hyderabad, Indiam.shrivastava@

iiit.ac.in

Abstract

SimpleQuestions is a commonly used bench-mark for single-factoid question answering(QA) over Knowledge Graphs (KG). ExistingQA systems rely on various components tosolve different sub-tasks of the problem (suchas entity detection, entity linking, relation pre-diction and evidence integration). In this work,we propose a different approach to the prob-lem and present an information retrieval stylesolution for it. We adopt a two-phase ap-proach: candidate generation and candidatere-ranking to answer questions. We proposea Triplet-Siamese-Hybrid CNN (TSHCNN)to re-rank candidate answers. Our approachachieves an accuracy of 80% which sets anew state-of-the-art on the SimpleQuestionsdataset.

1 Introduction and Related Work

Knowledge Bases (KB) like Freebase (Google,2017) and DBpedia1 contain a vast wealth of in-formation. A KB has information in the form oftuples, i.e. a combination of subject, predicate andobject (s, p, o). SimpleQuestions (Bordes et al.,2015) is a common benchmark used for single fac-toid QA over KB.

Question answering (QA), both on KB(Lukovnikov et al., 2017; Yin et al., 2016; Faderet al., 2014) and in open domain (Chen et al.,2017; Hermann et al., 2015) is a well studiedproblem. Learning to rank approaches have alsobeen applied successfully in QA (Agarwal et al.,2012; Bordes et al., 2014).

In this paper, we introduce an information re-trieval (IR) style approach to the QA task andpropose a Triplet-Siamese-Hybrid ConvolutionalNeural Network (TSHCNN) that jointly learns torank candidate answers.

1http://dbpedia.org/

Many earlier works (Ture and Jojic, 2017; Yuet al., 2017; Yin et al., 2016) that tackle Simple-Questions divide the task into multiple sub-tasks(such as entity detection, entity linking, relationprediction and evidence integration), whereas ourmodel tackles all sub-tasks jointly. Lukovnikov(2017) is more similar to our approach whereinthey train a neural network in an end-to-end man-ner. However, we differ in the fact that we generatecandidate answers jointly (matching both subjectand predicate using a single query) as well as thefact that we combine both the subject and predi-cate as well as the question before obtaining thesimilarity score. At no stage in our approach, dowe differentiate between the subject and the pred-icate. Thus our approach can also be applied inother QA scenarios with or without KBs.

Compared to existing approaches (Yin et al.,2016; Yu et al., 2017; Golub and He, 2016),our model does not employ Bi-LSTMs, attentionmechanisms or separate segmentation models andachieves state-of-the-art results. We also intro-duce a custom negative sampling technique thatimproves results significantly. We conclude withan evaluation of our method and show an abla-tion study as well as qualitative analysis of our ap-proach.

2 Our System: IRQA

Our system which consists of two components isas follows: (1) the candidate generation methodfor finding the set of relevant candidate answersand (2) a candidate re-ranking model, for gettingthe top answer from the list of candidate answers.

2.1 Candidate GenerationAny tuple in Freebase (specifically, the object ina tuple is the answer to the question) can be ananswer to our question. Freebase contains mil-lions of tuples and the FB2M subset provided with

22

Figure 1: TSHCNN Architecture

SimpleQuestions contains 10.8 million tuples. Assuch, it is important to reduce the search spaceto make it feasible to apply semantic-based neu-ral approaches. Thus, we propose a candidate re-trieval system to narrow down our search spaceand focus on re-ranking only the most relevantcandidates.

Solr2 is an inverted index search system. We useSolr to index all our freebase tuples (FB2M) andquery for the top-k relevant candidates providing aquestion as the input query. We adopt BM25 as thescoring metric to rank results. Our results demon-strate the effectiveness of the proposed method.

2.2 Candidate Re-ranking

We use Convolutional Neural Networks (CNN)to learn the semantic representation for input text(Kim, 2014; Hu et al., 2015; Zhang et al., 2015).CNNs learn globally word order invariant fea-tures and at the same time pick the order in shortphrases. Thus, CNNs are ideal for a QA task sincedifferent users may paraphrase the same questionin different ways. Siamese networks have shownpromising results in distance-based learning meth-ods (Bromley et al., 1993; Chopra et al., 2005;Das et al., 2016) and they possess the capabilityto learn a similarity metric between questions andanswers.

Our candidate re-ranking module is motivatedby the success of neural models in various im-age and text tasks (Vo and Hays, 2016; Das et al.,

2http://lucene.apache.org/solr/

2016). Our network as shown in figure 1, is aTriplet-Siamese Hybrid Convolutional neural net-work (TSHCNN). Vo and Hays (2016) show thatclassification-siamese hybrid and triplet networkswork well on image similarity tasks. TSHCNNcan jointly extract and exchange information fromthe question and tuple inputs. We attribute it to thefact that we concatenate the pooled outputs of thequestion and tuple before input to the fully con-nected network.

All convolution layers are siamese and shareweights in TSHCNN. The fully connected layersalso share weights. This weight sharing guaran-tees that the question and its relevant answer arenearer to each other in the semantics space and ir-relevant answers to it are far away. It also reducesthe required number of parameters to be learned.

We provide additional inputs to our networkwhich is the concatenation of both the input ques-tion and tuple. This additional input is motivatedby the need to learn features for both the questionand tuple.

2.2.1 Loss FunctionWe use the distance based logistic triplet loss (Voand Hays, 2016), which Vo and Hays (2016) re-port exhibits better performance in image similar-ity tasks. Considering Spos / Sneg as the scoreobtained by the question+positive tuple / ques-tion+negative tuple, respectively and L as the lo-gistic triplet loss, we have:

L = loge(1 + e(Sneg−Spos)) (1)

23

Table 1: Network ParametersParameter ValueBatch Size 100Non-linearity ReluCNN Filters &Width

90, 10 and 10 filters ofwidth 1, 2 and 3 resp.

Pool Type Global Max PoolingStride Length 1FC Layer 1 100 units + 0.2 DropoutFC Layer 2 100 units + 0.2 DropoutFC Layer 3 1 unit + No ReluOptimizer Adam (default params)

Table 2: End-to-End Answer Accuracy for EnglishQuestions

Model Acc.

Memory NN Bordes et al. (2015) 62.7

Attn. LSTM Golub and He (2016) 70.9

GRU Lukovnikov et al. (2017) 71.2

BiLSTM & BiGRUMohammed et al. (2017)

74.9

CNN & Attn. CNN &BiLSTM-CRF Yin et al. (2016)

76.4

HR-BiLSTM & CNN &BiLSTM-CRF Yu et al. (2017)

77.0

BiLSTM-CRF & BiLSTMPetrochuk and Zettlemoyer (2018)

78.1

Candidate Generation (Ours) 68.4Solr & TSHCNN (Ours) 80.0

Table 3: Candidate generation results: Recall of top-kanswer candidates.K 1 2 5 10 50 100 200

68.4 75.7 82.3 85.6 91.4 92.9 94.3

Table 4: Candidate Re-ranking: Ablation Study. CQT:Additional inputs, concatenate question and tuple ,SCNS: Solr Candidates as Negative Samples

CQT SCNS Accuracy

no no 49.1yes no 68.2no yes 69.6yes yes 80.0

3 Experiments

We show experiments on the SimpleQuestions(Bordes et al., 2015) dataset which comprises75.9k/10.8k/21.7k training/validation/test ques-tions. Each question is associated with an an-swer, i.e. a tuple (subject, predicate, object) from aFreebase subset (FB2M or FB5M). The subject isgiven as a MID (a unique ID referring to entities inFreebase), and we obtain its corresponding entityname by processing the Freebase data dumps. Wewere unable to obtain entity name mappings forsome MIDs, and removed these from our final set.Our resulting set contained 74,509/10,639/21,300training/validation/test questions. As with previ-ous work, we show results over the 2M-subset ofFreebase (FB2M).

We use pre-trained word embeddings3 pro-vided by Fasttext (Bojanowski et al., 2016) andrandomly initialized embeddings between [-0.25,0.25] for words without embeddings.

3.1 Generating negative samples

In our experiments, we observe that the negativesample generation method has a significant influ-ence on the results. We develop a custom negativesample generation method that generates negativesamples similar to the actual answer and helps fur-ther increase the discriminatory ability of our net-work.

We generate 10 negative samples for each train-ing sample. We use the approach in Bordes et al.(2014) to generate 5 of these 10 negative sam-ples. These candidates are samples picked at ran-dom and then corrupted following Bordes et al.(2014). Essentially, Given (q, t) ∈ D, Bordeset al. (2014) create a corrupted triple twith the fol-lowing method: pick another random triple t fromK, and then, replace with 66% chance each mem-ber of t (left entity, predicate and right entity) bythe corresponding element in t.

Further, we obtain 5 more negative samplesby querying the Solr index for top-5 candidates(excluding the answer candidate) providing eachquestion in the training set as the input query. Thissecond policy is unique as we generate negativesamples closer to the actual answer thereby pro-viding fine-grained negative samples to our net-work as compared to Bordes et al. (2014) who gen-erate only randomly corrupted negative samples.

3https://fasttext.cc/

24

Table 5: Qualitative Analysis. CA: Correct Answer, PA: Predicted Answer

Examples

Example 1: CA (have wheels will travel, book written work subjects, family)Question: what is the have wheels will travel book about?

Predicted Answer: (have wheels will travel, book written work subjects, adolescence)

Example 2: CA (traditional music, music genre artists, the henrys)Question: which quartet is known for traditional music?

Predicted Answer: (traditional music, music genre albums, music and friends)

3.2 Evaluation

We report results using the standard evaluation cri-teria (Bordes et al., 2015), in terms of path-levelaccuracy, which is the percentage of questions forwhich the top-ranked candidate fact is correct. Aprediction is correct if the system retrieves the cor-rect subject and predicate. Network parametersand decisions are presented in Table 1. We usetop-200 candidates as input to the re-ranking step.

4 Results

In Table 3, we report candidate generation re-sults. As expected, recall increases as we in-crease k. This initial candidate generation stepsurpasses (Table 2) the original Bordes (2015) pa-per and comes close to other complex neural ap-proaches (Golub and He, 2016; Lukovnikov et al.,2017). This is surprising since this initial step is aninverted-index based approach which retrieves themost relevant candidates based on term matching.

In Table 2, we present end-to-end results4 of ex-isting approaches as well as our model. There is asignificant improvement of 17% in our accuracyafter candidate re-ranking. We attribute it to ourTSHCNN model. To obtain insights into these im-provements, we do an ablation study (Table 4) ofthe various components in TSHCNN and describethem in more detail further.

SCNS: Using Solr Candidates as Negative Sam-ples. The scores obtained using our custom nega-tive sample generation method (described in sec-tion 3.1), were 17.3% and 41.8% higher as com-pared to using only 10 negative samples generatedas per Bordes et al. (2014), with and without ad-ditional inputs respectively. This is a significantimprovement in scores, and we attribute it to thereason that negative candidates similar to the ac-

4(Ture and Jojic, 2017) reported a 86.8% accuracy but(Petrochuk and Zettlemoyer, 2018) and (Mohammed et al.,2017) have not been able to replicate their results.

tual answer increase the discriminatory ability ofthe network and lead to the robust training of ournetwork.

CQT: Additional inputs, concatenate questionand tuple. Compared to our model without addi-tional inputs, we obtain an improvement of 14.9%and 38.9% in our scores when we provide addi-tional inputs in the form of concatenated ques-tion and tuple, with and without our custom neg-ative sampling approach respectively. One possi-ble explanation for this increase is that this aug-mented network has 50% more features that helpit in learning better intermediate representations.To verify this, we add more filters to our convolu-tion layer such that the total features equalled thatwhen additional input is provided. However, theimprovement in results was only marginal. An-other explanation for this improvement would bethat the max pooling layer picks out the dominantfeatures from this additional input, and these fea-tures improve the distinguishing ability of our net-work.

Combining both these techniques, we gain animpressive 62.9% in scores as compared to ourmodel without neither of these techniques. Over-all, we achieve an accuracy of 80%, a new state-of-the-art despite having a simple model.

In Table 5, some example outputs of our modelare shown. Example 1 shows that the predictedanswer is correct (subject and predicate match)but does not match the answer that comes withthe question. Example 2 shows we can correctlypredict the subject but cannot obtain the correctpredicate owing to the high similarity between thecorrect answer predicate and the predicted answerpredicate.

5 Conclusion

This paper proposes a simple and effective IR styleapproach for QA over a KB. Our TSHCNN model

25

shows impressive results on the SimpleQuestionsbenchmark. It outperforms many other approachesthat use Bi-LSTMs, attention mechanisms or sep-arate segmentation models. We also introduce anegative sample generation method which signif-icantly improves results. Such negative samplesobtained through Solr increase the discriminatoryability of our network. Our experiments highlightthe effectiveness of using simple IR models for theSimpleQuestions benchmark.

ReferencesArvind Agarwal, Hema Raghavan, Karthik Sub-

bian, Prem Melville, Richard D. Lawrence, DavidGondek, and James Z Fan. 2012. Learning to rankfor robust question answering. In CIKM.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale Simple QuestionAnswering with Memory Networks.

Antoine Bordes, Jason Weston, and Nicolas Usunier.2014. Open Question Answering with Weakly Su-pervised Embedding Models. In Lecture Notesin Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notesin Bioinformatics), volume 8724 LNAI, pages 165–180.

Jane Bromley, Isabelle Guyon, Yann LeCun, EduardSackinger, and Roopak Shah. 1993. Signature ver-ification using a ”siamese” time delay neural net-work. In Proceedings of the 6th International Con-ference on Neural Information Processing Systems,NIPS’93, pages 737–744, San Francisco, CA, USA.Morgan Kaufmann Publishers Inc.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics.

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005.Learning a similarity metric discriminatively, withapplication to face verification. In Proceed-ings of the 2005 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition(CVPR’05) - Volume 1 - Volume 01, CVPR ’05,pages 539–546, Washington, DC, USA. IEEE Com-puter Society.

Arpita Das, Harish Yenala, Manoj Kumar Chinnakotla,and Manish Shrivastava. 2016. Together we stand:Siamese networks for similar question retrieval. In

Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2016,August 7-12, 2016, Berlin, Germany, Volume 1:Long Papers.

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.2014. Open question answering over curated andextracted knowledge bases. In Proceedings of the20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’14,pages 1156–1165, New York, NY, USA. ACM.

David Golub and Xiaodong He. 2016. Character-LevelQuestion Answering with Attention.

Google. 2017. Freebase data dumps. https://developers.google.com/freebase/data.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Proceedings ofthe 28th International Conference on Neural Infor-mation Processing Systems - Volume 1, NIPS’15,pages 1693–1701, Cambridge, MA, USA. MITPress.

Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. 2015. Convolutional Neural Network Archi-tectures for Matching Natural Language Sentences.NIPS, page 2009.

Yoon Kim. 2014. Convolutional Neural Networks forSentence Classification. pages 1746–1751.

Denis Lukovnikov, Asja Fischer, Jens Lehmann, andSoren Auer. 2017. Neural Network-based QuestionAnswering over Knowledge Graphs on Word andCharacter Level. Proceedings of the 26th Interna-tional Conference on World Wide Web - WWW ’17,pages 1211–1220.

Salman Mohammed, Peng Shi, and Jimmy Lin. 2017.Strong Baselines for Simple Question Answeringover Knowledge Graphs with and without NeuralNetworks.

Michael Petrochuk and Luke Zettlemoyer. 2018. Sim-pleQuestions Nearly Solved: A New Upperboundand Baseline Approach.

Ferhan Ture and Oliver Jojic. 2017. No Need toPay Attention: Simple Recurrent Neural NetworksWork! (for Answering ”Simple” Questions). Em-pirical Methods in Natural Language Processing(EMNLP), pages 2866–2872.

Nam N. Vo and James Hays. 2016. Localizing and ori-enting street views using overhead imagery. Lec-ture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and LectureNotes in Bioinformatics), 9905 LNCS:494–509.

26

Wenpeng Yin, Mo Yu, Bing Xiang, Bowen Zhou, andHinrich Schutze. 2016. Simple question answeringby attentive convolutional neural network. In Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers, pages 1746–1756. The COLING 2016Organizing Committee.

Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dosSantos, Bing Xiang, and Bowen Zhou. 2017. Im-proved Neural Relation Detection for KnowledgeBase Question Answering. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages571–581, Stroudsburg, PA, USA. Association forComputational Linguistics.

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.2015. Character-level convolutional networks fortext classification. In NIPS.

27


Information Nutrition Labels: A Plugin for Online News Evaluation

Kevin VincentiusPiyush Aggarwal

Ali Sahan

Birte HogdenNeelu Madan

Anusha Bangaru

Claudia SchwengerFarid Muradov

Ahmet Aker

University of [email protected]

Abstract

In this paper we present a browser pluginNewsScan that assists online news readers inevaluating the quality of online content theyread by providing information nutrition labelsfor online news articles. In analogy to gro-ceries, where nutrition labels help consumersmake choices that they consider best for them-selves, information nutrition labels tag onlinenews articles with data that help readers judgethe articles they engage with. This paper dis-cusses the choice of the labels, their imple-mentation and visualization.

1 Introduction

Nowadays, the amount of online news content isimmense and its sources are very diverse. For thereaders and other consumers of online news whovalue balanced, diverse and reliable information,it is necessary to have access to methods of evalu-ating the news articles available to them.

This is somewhat similar to food consumptionwhere consumers are presented with a huge va-riety of alternatives and therefore face the chal-lenge of deciding what is good for their health.This is why food packages come with nutritionlabels that guide the consumers in their deci-sion making. Taking this analogy, Fuhr and col-leagues (2018) discuss the idea of implement-ing information nutrition labels for news arti-cles. They propose to label every online newsarticle with information nutrition labels that de-scribe the ingredients of the article and givereaders a chance to make an informed judgmentabout what they are reading. The authors dis-cuss nine different information nutrition labels:factuality, readability, virality, emotion/sentiment,

opinion/subjectivity/objectivity, controversy, au-thority/credibility/trust, technicality and topical-ity. Gollub and colleagues (2018) categorize theselabels into fewer dimensions. Their aim is to es-tablish group labels that are easily understood byreaders. However, both studies do not go beyonddiscussing, proposing and grouping the labels.

In this work we actually implement informationnutrition labels and deliver them as a browser plu-gin that we call NewsScan. Therefore we providea basis for evaluating how well labels describethe online news content and for investigating howuseful they are to real users for making decisionsabout whether to read the news and whether totrust its content. To avoid biasing the user in anyway with respect to the consumption of an article,the information is solely presented but not inter-preted. Judgments about news sources should bemade by users themselves. Whether an article willbe read or discarded depends on the user’s ownweighing of importance of the information nutri-tion labels.

The plugin supports the reader in these tasksthrough easy-to-understand visualizations of thelabels. In this paper we discuss the methods be-hind the label computation (Section 2) and the de-sign of the user interface (Section 3).

2 Information nutrition labels

NewsScan implements six information nutritionlabels: source popularity, article popularity, easeof reading, sentiment, objectivity and politicalbias. Ease of reading, sentiment and objectivityhave been proposed by Fuhr et al. (2018). We pro-pose to add three more nutrition labels: Source aswell as article popularity and political bias. Sim-ilar to food nutrition labels the information nutri-

28

tion labels aim to provide the reader some base tojudge about the reliability of the article’s content.The credibility nutrition label proposed by Fuhr etal. (2018), for instance, is able to give the readerthe indication whether e.g. the source where thearticle come from is credible or not. However,the credibility label entails already a judgment.It already sums some pieces of information andmakes conclusion based on them. We think in-stead of providing the reader such a judgment theuser might be better informed when we provideinformation that are possible bases for computinge.g. the credibility label. The proposed three newlabels aim this purpose, i.e. providing enough de-tails to enable the user to make an informed judg-ment about an articles content.

In the following we describe the nutrition labelscurrently implemented within the plugin.

2.1 Source popularity

The label source popularity encompasses two di-mensions: the reputation of the news source andits influence.

The reputation of a source is analyzed using theWeb of Trust Score1. This score is computedby an Internet browser extension that helps peo-ple make informed decision about whether to trusta website or not. It is based on a unique crowd-sourcing approach that collects rating and reviewsfrom global community of millions of users whorate and comment on websites based on their per-sonal experiences.

The influence of a source is computed usingAlexa Rank, Google PageRank2 and popularity onTwitter.

Alexa Rank is a virtual ranking system set byAlexa.com (a subsidiary of Amazon) that auditsand publishes the frequency of visits on variouswebsites. The Alexa ranking is the geometricmean of reach and page views, averaged over aperiod of three months.

Google PageRank is a link analysis algorithmthat assigns a numerical weight to each element ofa hyperlinked set of documents, such as the WorldWide Web, with the purpose of measuring its rela-tive importance within the set.

Twitter Popularity is calculated as an averageof the scores for the following two metrics:

1https://www.mywot.com/2https://www.domcop.com/openpagerank/

• Followers Count: This gives the amount ofusers that are following a source.• Listed Count: This indicates the number of

memberships of the source to different top-ics. It is based on the user’s activity toadd/remove the source from their customizedlist. The higher it is, the more diverse thesource is.

An overall source popularity score shown to theuser is calculated by averaging these four metrics.However, when the icon card is flipped the usercan also get detailed information about each of theabove scores.

2.2 Article popularity

Popularity = alog(bx+ 1) (1)

where x is the average amount of tweets perhour, so that the article popularity is 0 when x is0. The most popular article we found had around23 tweets per hour in its peak 24 hours. This isused as a reference value, i.e. an article must havethis many tweets to reach a score of 100. The log-arithmic function is used because the output has tobe scaled properly. For example, an article withfive tweets per hour is still relatively popular, eventhough it is just a fraction of the reference score.Choosing a large value for b will make the func-tion close to being linear, which will cause eventhe relatively popular articles to have low scores.A small b will make the function more curved.If b is too big however, any article with a decentamount of tweets will have a score very close to100. b is chosen empirically to be 1 so that thescores are distributed well between 0 and 100 overa variety of typical news articles. a is determinedto be 73 to give the reference article a score of 100.

2.3 Ease of reading

As described by Schwarm and Ostendorf (2005)the readability level is used to characterize the ed-ucational level a reader needs to understand a text.This topic has been in research since 1930 and sev-eral automatic solutions have been proposed to de-termine the readability level of an input text (Vaj-jala and Meurers, 2013; Xia et al., 2016; Schwarmand Ostendorf, 2005). The core concept in thesestudies is to use machine learning along with fea-ture engineering covering lexical, structural, andheuristic based features. We followed this coreconcept and used Random Forest with features in-spired by earlier studies. This approach achieved

29

73% accuracy on a data set of texts written bystudents in Cambridge English examinations (Xiaet al., 2016). The classifier predicts five differentlevels of readability varying from A2 (easy) to C2(difficult) (Xia et al., 2016). We map these valuesto percentages so that A2 becomes 100% (easy toread) and C2 becomes 20% (difficult to read) (seeTable 1)

Table 1: Levels of readabilityText level A2 B1 B2 C1 C2

Value 100% 80% 60% 40% 20%

2.4 Sentiment

A text containing sentiment is written in an emo-tional style. To determine the sentiment value ofan article, our algorithm uses the pattern3.en li-brary (Hayden and de Smet). In this library everyword is assigned a sentiment value, which can benegative or positive [-1; 1]. If a word shows in-tense positive emotions (e.g. happy, amazing), itis given a high positive value. In line with that,a term indicating intense negative emotions (e.g.bad, disgusting) is assigned a high negative value.A word not containing any emotions (e.g. the, you,house), has a value of near to zero. First, the algo-rithm calculates the sentiment value for every sen-tence by averaging all absolute values of sentimentfor the distinct words. After that, the overall senti-ment value of the whole news article is calculated.For that, the average of the sentences is taken andmultiplied by 100. 3

2.5 Objectivity

Objectivity is given when a text is written from aneutral rather than a personal perspective. Phraseslike ”in my opinion” or ”I think” are used by au-thors to reflect their individual thoughts, beliefsand attitudes. The process of determining the ob-jectivity of a text is similar to the process of cal-culating the sentiment value. The aforementionedlibrary pattern3.en (Hayden and de Smet) also in-cludes a value of subjectivity for every word.4

Therefore we use it to obtain an objectivity scorefor articles. Values range from 0 to 1, with a value

3Since we use the absolute values of the sentiment scoreswe are interested in knowing how sentimental a news articleis rather than focusing on the valence of emotions.

4Similar to sentiment the algorithm calculates the subjec-tivity score for every sentence by averaging all subjectivityscores of its words.

near to 0 indicating objectivity and a value nearto 1 indicating subjectivity. The overall score forsubjectivity contained in an article is calculated asthe average over all sentences. However, since wewant to examine the objectivity and not the sub-jectivity of a text, the values need to be inverted:

Objectivity = 1− Subjectivity (2)

This score is normalized by multiplying it by100 to attain a consistent score range for all labels.

2.6 Political bias

Bias measures the degree to which an article iswritten from a one-sided perspective that enforcesusers to believe in a specific viewpoint withoutconsidering opposing arguments.

For calculating political bias we followed Fair-banks et al. (2018) and used two classes that repre-sent different political orientations: conservatism(sources that are biased towards the right) and lib-eralism (sources that are biased towards the left).The authors also argue that the content of the ar-ticle is a strong discriminant to distinguish be-tween biased and non-biased articles. Followingthe authors we built a content based model forprediction of political bias in the news articles.To achieve that, a logistic regression classifier istrained on a dataset containing articles from TheGlobal Database of Events Language and ToneProject (The GDELT Project). This database mon-itors the world’s broadcast news in over 100 lan-guages and provides a computing platform. How-ever, it does not contain any information about thepolitical bias. To retrieve the bias contained inan article, we crawled from the Media Bias FactCheck 5 the required bias information. The MediaBias Fack Check contains human annotated factchecks for various source domains. For our arti-cles we have left-biased, right-biased and neutrallabels. We use a simple bag-of-words approachas features to guide our logistic regression model.As the label values in our plugin are all shownin a range from 0-100%, the label’s landing pageshows 0% when the article has no political biasotherwise 100% – regardless whether the articleis left or right biased. When the label’s card isflipped the reader can see whether the article hasleft or right political bias.

5https://mediabiasfactcheck.com

30

3 Visualization

3.1 Colors and icons

Our information nutrition labels are representedby simple, easy-to-identify and well-known icons,which have been shown to be easily understood byusers (Antunez et al., 2013; Campos et al., 2011;Hersey et al., 2013; Roberto and Khandpur, 2014).Moreover, previous work reports that simple addi-tional texts allow for a quicker processing of in-formation represented by an icon (Campos et al.,2011). Therefore the information nutrition label isshown additionally as text.

To make nutrition labels more comprehensible,colors indicating amounts of nutrients are helpful(Aschemann-Witzel et al., 2013; Crosetto et al.,2016; Ducrot et al., 2016). Relevant researchreports that both traffic light colors and mono-chromic colors work equally well (Aschemann-Witzel et al., 2013). Traffic light colors are mostcommon with red indicating high (i.e. negative)and green indicating low (i.e. positive) levels ofnutrients (Kim et al., 2018). Since we do not wantto bias the users towards reading an article or not,but rather give information about its content, wechose to use different shades of blue in our plugin.A light blue indicates low and a dark blue indi-cates high levels of a certain label. Additionally,blue stands for trust, honesty and security (Ven-ngage, 2018), which should indicate that the useris operating a reliable tool.

When deciding on what charts and figures touse, we again took into account that simple andcommonly known visualizations are easiest tocomprehend (Campos et al., 2011). Thus, wechose plain bar charts for representing overall nu-trition label scores as well as scores of sub-labels.Additionally, we enriched it with percentages aswell as coloring. Consequently, the amount of alabel contained in a news article is visualized in anunderstandable and easy-to-process way.

For lettering, the font Futura is used. It is amodern, straightforward and clean typeface oftenused in state-of-the-art websites and fits the simpleand genuine layout.

3.2 Positioning and information distribution

Following the so-called gestalt laws of grouping6,objects that are closer to each other (law of prox-imity) are perceived as belonging together. More-

6e-teaching.org

Figure 1: NewsScan plugin

over, to indicate the grouping of information, sep-arations between those groups are useful (law ofcontinuity). Therefore, the distinct labels are sep-arated from each other by horizontal and verticallines.

To obviate information overload (Eppler andMengis, 2004) when using NewsScan, we reducedthe information on the landing page to a minimum.Only the icons with their respective overall scoreare visible on the front side of the cards we usedfor visualization (as shown in Figure 1 for articlepopularity, ease of reading, sentiment, objectivityand political bias). Therefore, the users can get afirst impression about different nutrition labels ofthe article. When time is scarce, a simple visual-ization where users can find the demanded infor-mation easily is most practicable (Crosetto et al.,2016). However, if users are interested in gettingmore detailed information, we created a backsidefor each card. The backside also shows the totalscore and, if available, relevant sub-labels that areused to calculate the overall scores (see Figure 1:source popularity). Additionally, on hovering overthe names of the labels and sub-labels, the usergets a short explanation about what the wordingand score mean. To further avoid possible con-fusion on the user side, all of the labels are rep-resented in the same way. Overall and sub-labelscores are mapped to a range from 0-100% and

31

icons, texts and charts are arranged consistently.

4 Evaluation

To evaluate NewsScan in terms of wording, color-ing and usability, we will conduct qualitative userstudies. Participants will be interviewed and askedabout their perception of the tool in general as wellas concerning specific features. Since our aim isto not bias the user towards a consumption of onenews article or another, we need to evaluate theplugin regarding that. Especially the wording ofthe features could affect the user in forming anopinion about an article. However, we want tohelp our users making an informed decision, so weneed some kind of guiding, hence wording. To en-sure that our tool only works as a guide and not aspecific recommender, we do not interpret, for ex-ample, an easy-to-read article as being not worth-while reading but just easy to understand. How-ever, we believe some threshold label values aboutworthwhile and not worthwhile articles would in-deed help readers in their decision making. In ourevaluation we will aim to incorporate such infor-mation and draw conclusion between cases withthreshold and without threshold values.

5 Conclusion and implications

In this paper we introduced NewsScan, a browserplugin to assist consumers of online news websitesin their decision making about the content theyengage with. Readers are guided to make an in-formed decision about editorials based on six la-bels: source popularity, article popularity, ease ofreading, sentiment, objectivity and political bias.Label values are computed when a news article isretrieved. Through simple visualizations and anintuitive design, the user is confronted with themeta-information of the respective piece. To avoidbiasing the user in any way with respect to the con-sumption of an article, the information is solelypresented but not interpreted. Judgments aboutnews sources should be made by users themselves.If an article is read or discarded relies on the user’sopinion and individual weighing of the importanceof the six labels.

In our immediate future work we plan to con-duct user studies to analyse the validity of infor-mation nutrition labels and their usefulness forusers. We also plan to investigate and integratefurther information nutrition labels. Moreover itwould be interesting to apply NewsScan to further

media like videos or images accompanying news.

ReferencesLucia Antunez, Leticia Vidal, Alejandra Sapolinski,

Ana Gimenez, Alejandro Maiche, and Gaston Ares.2013. How do design features influence consumerattention when looking for nutritional informationon food labels? results from an eye-tracking studyon pan bread labels. International Journal of FoodSciences and Nutrition, 64(5):515–527.

Jessica Aschemann-Witzel, Klaus G. Grunert,Hans C.M. van Trijp, Svetlana Bialkova,Monique M. Raats, Charo Hodgkins, GrazynaWasowicz-Kirylo, and Joerg Koenigstorfer. 2013.Effects of nutrition label format and product assort-ment on the healthfulness of food choice. Appetite,71:63 – 74.

Sarah Campos, Juliana Doxey, and David Hammond.2011. Nutrition labels on pre-packaged foods:a systematic review. Public Health Nutrition,14(8):1496–1506.

Paolo Crosetto, Laurent Muller, and Bernard Ruffieux.2016. Helping consumers with a front-of-pack la-bel: Numbers or colors?: Experimental compari-son between guideline daily amount and traffic lightin a diet-building exercise. Journal of EconomicPsychology, 55:30 – 50. Special issue on Foodconsumption behavior: Economic and psychologi-cal perspectives.

Pauline Ducrot, Chantal Julia, Caroline Mjean,Emmanuelle Kesse-Guyot, Mathilde Touvier,Lopold K. Fezeu, Serge Hercberg, and SandrinePneau. 2016. Impact of different front-of-packnutrition labels on consumer purchasing intentions:A randomized controlled trial. American Journal ofPreventive Medicine, 50(5):627 – 636.

Martin J. Eppler and Jeanne Mengis. 2004. The con-cept of information overload: A review of literaturefrom organization science, accounting, marketing,mis, and related disciplines. The Information So-ciety, 20(5):325–344.

James Fairbanks, Fitch, Knauf, and Briscoe. 2018.Credibility assessment in the news: Do we need toread? ACM ISBN 123-4567-24-567/08/06.

Norbert Fuhr, Anastasia Giachanou, Gregory Grefen-stette, Iryna Gurevych, Andreas Hanselowski,Kalervo Jarvelin, Rosie Jones, YiquN Liu, JosianeMothe, Wolfgang Nejdl, et al. 2018. An informa-tion nutritional label for online documents. In ACMSIGIR Forum, volume 51, pages 46–66. ACM.

Tim Gollub, Martin Potthast, and Benno Stein. 2018.Shaping the information nutrition label. ECIR.

Andy Hayden and Tom de Smet. Pattern 3. Re-trieved July 27, 2018 from https://github.com/pattern3.

32

James C Hersey, Kelly C Wohlgenant, Joanne E Arse-nault, Katherine M Kosa, and Mary K Muth. 2013.Effects of front-of-package and shelf nutrition la-beling systems on consumers. Nutrition Reviews,71(1):1–14.

Eojina Kim, Liang (Rebecca) Tang, Chase Meusel,and Manjul Gupta. 2018. Optimization of menu-labeling formats to drive healthy dining: An eyetracking study. International Journal of HospitalityManagement, 70:37 – 48.

Christina A. Roberto and Neha Khandpur. 2014. Im-proving the design of nutrition labels to promotehealthier food choices and reasonable portion sizes.International Journal of Obesity, 38(S1):S25.

Sarah E Schwarm and Mari Ostendorf. 2005. Readinglevel assessment using support vector machines andstatistical language models. In Proceedings of the43rd Annual Meeting on Association for Computa-tional Linguistics, pages 523–530. Association forComputational Linguistics.

The GDELT Project. Watching our world unfold.Retrieved July 27, 2018 from https://www.gdeltproject.org/.

Sowmya Vajjala and Detmar Meurers. 2013. On theapplicability of readability models to web texts. InProceedings of the Second Workshop on Predictingand Improving Text Readability for Target ReaderPopulations, pages 59–68.

Venngage. 2018. What marketers should know aboutthe psychology of visual content. Retrieved July 25,2018 from https://venngage.com/blog/marketing-psychology/.

Menglin Xia, Ekaterina Kochmar, and Ted Briscoe.2016. Text readability assessment for second lan-guage learners. In Proceedings of the 11th Work-shop on Innovative Use of NLP for Building Educa-tional Applications, pages 12–22.

33


Joint Modeling for Query Expansion and Information Extraction withReinforcement Learning

Motoki Taniguchi, Yasuhide Miura and Tomoko OhkumaFuji Xerox Co., Ltd.

motoki.taniguchi, yasuhide.miura, [email protected]

Abstract

Information extraction about an event can beimproved by incorporating external evidence.In this study, we propose a joint model forpseudo-relevance feedback based query ex-pansion and information extraction with re-inforcement learning. Our model generatesan event-specific query to effectively retrievedocuments relevant to the event. We demon-strate that our model is comparable or has bet-ter performance than the previous model intwo publicly available datasets. Furthermore,we analyzed the influences of the retrieval ef-fectiveness in our model on the extraction per-formance.

1 Introduction

Information extraction about an event is gaininggrowing interest because of the increases in textdata. The task of information extraction about anevent from a text is defined to identify a set of val-ues including entities, temporal expressions andnumerical expressions that serve as a participantor attribute of the event. The extracted informa-tion is useful for various applications such as riskmonitoring (Borsje et al., 2010) and decision mak-ing support (Wei and Lee, 2004).

Conventional information extraction systemsprovide higher performance if the amount of la-beled data is larger. Labeled training data isexpensive to produce and thus the data amountis limited. In this case, extraction accuracycan be improved using an alternative approachthat incorporates evidence from external sourcessuch as the Web (Kanani and McCallum, 2012;Narasimhan et al., 2016; Sharma et al., 2017).However, this approach faces the following chal-lenges: issuing an effective query to the externalsource, identifying documents relevant to a targetevent from retrieval results and reconciling the val-ues extracted from the relevant documents.

To overcome these problems, several attemptshave been made to model the decisions as aMarkov Decision Process (MDP) with deep rein-forcement learning (RL) (Narasimhan et al., 2016;Sharma et al., 2017). The agent of these models istrained to maximize expected rewards (extractionaccuracy) by performing actions to select an ex-panded query for external source and to reconcilevalues extracted from documents retrieved fromthe external source. The models use the title ofsource document as an original query and tem-plates to expand this query. Expansion terms ofthe template are same in any event even though anoptimal query depends on the event. Therefore, itis still a challenge to issue an effective query to anexternal source.

In this study, we extended the previous mod-els by introducing a query expansion based onpseudo-relevance feedback (PRF) (Xu and Croft,1996). The PRF based query expansion assumesthat the top-ranked documents retrieved by anoriginal query are relevant to the original query.An agent of our model selects a term from thosedocuments and generates an expanded query byadding the term into the original query. The PRFbased query expansion enables us to add an event-specific term into the query without additional re-sources. For instance, let us consider an infor-mation extraction about a shooting incident. Thequery “Shooting on Warren Ave. leaves one dead”retrieves the documents which may contain theterm “New York”. The addition of the event-specific term “New York” to the query leads to thefiltering out of irrelevant documents, and thus toimproves retrieval performance because “WarrenAve.” is located in “New York”. Therefore, we ex-pect to improve extraction accuracy by introduc-ing PRF based query expansion.

In contrast to the previous models, candidateterms for query expansion in our model vary de-

34

Figure 1: Illustration of states, actions and state transition in our framework. For simplicity, only the sets of currentand newly extracted values are shown for the states.

pending on an event. Therefore, we exploit anoriginal query and its candidate terms informationas inputs of policy networks in addition to stateinformation.

The contributions of this paper are follows:

• We propose a joint model for PRF basedquery expansion and information extractionwith RL.

• We investigated the oracle extraction accu-racy as an indicator of the model’s retrievalperformance to reveal that the PRF basedquery expansion outperforms the templatequery in two publicly available datasets.

• We demonstrate that our model is comparableor better extraction performance compared tothe previous model in the datasets.

2 Related Work

Information extraction incorporating externalsources: Information extraction incorporating ex-ternal sources has been increasingly investigatedin knowledge base population (Ji and Grishman,2011; West et al., 2014) and multiple documentextraction (Mann and Yarowsky, 2005). In con-trast to the tasks of these studies, a challenge ex-ists in our task to identify documents relevant toa target event, and this complicates extraction ofinformation.

Narasimhan et al. (2016) and Sharma et al.(2017) modeled the information extraction tasksas an MDP with RL. They demonstrated that theirmodels outperformed conventional extractors andmeta-classifier models. There are two crucial dif-ferences between our model and aforementionedmodels. First, the proposed model is trained to se-lect an optimal term of query expansion for each

original query instead of a query template. Sec-ond, the proposed model also leverages an originalquery and its candidate terms as the input of pol-icy networks, whereas the above-mentioned mod-els use only state information.

Query expansion: Query expansion canbe categorized into global and local methods(Manning et al., 2008). The global methods in-clude query expansion using a manual thesaurus(Voorhees, 1994), and an automatically generatedthesaurus based on word co-occurrence statisticsover corpus (Qiu and Frei, 1993). The templatequery that is used in the previous RL based modelbelongs to the global methods.

There are several approaches of local methodsthat use query log (Cui et al., 2002), and are basedon PRF (Xu and Croft, 1996). We employed aPRF based method because it does not require ad-ditional resources. Moreover, local methods havebeen evidenced to outperform global methods ininformation retrieval (Xu and Croft, 1996).

Nogueira and Cho (2017) proposed an RLbased approach to model query reformulation ininformation retrieval. They also reformulated aquery through PRF. In contrast, our proposed ap-proach targets information extraction rather thandocument retrieval. The goal of our task is toextract multiple values of event attributes as wellas to retrieve relevant documents. Moreover, thedocument collection in the RL-based approach(Nogueira and Cho, 2017) is limited to Wikipediapages or academic papers, while we use the Webas an open document-collection platform.

3 Proposed Model

3.1 Framework

We model the information extraction taskas an MDP in a similar manner to that by

35

Narasimhan et al. (2016) and Sharma et al.(2017); however, the query-selection strategydiffers. An agent in our framework selects a termto add to an original query instead of selectinga query template. In this section, we mainlydescribe the difference between our frameworkand the previous framework.

At the beginning of each episode, an agent isgiven a source document to extract informationabout an event. Figure 1 illustrates an exampleof state transition. The state comprises the con-fidence score of current and newly extracted val-ues, the match between current and new values,term frequency-inverse document frequency (TF-IDF) of context words, and TF-IDF similarity be-tween source and current documents, similar to themethod by Narasimhan et al. (2016). At each step,the agent performs two actions: a reconciliationdecision ad, which involves the accepting of ex-tracted values for one or all attributes or rejectingall newly extracted values (or ending an episode),and a term selection for query expansion aw. Inthe example, the agent takes a reconciliation deci-sion ad to update the value of City attribute fromBaltimore to Philadelphia based on the state st intime step t. Simultaneously, the agent performsterm selection aw that is used to form the ex-panded query to retrieve the next document. Thecandidate terms W = w1, w2, · · · , wN are col-lected from the documents retrieved by an originalquery q0. Here, we use the title of the source doc-ument as q0. The agent selects a term wi fromW and generates expanded query qi by adding wi

into q0. qi is used to retrieve the next documentand the new values are extracted from the docu-ment by base extractor. State st+1 of the next stepis determined according to the updated values ofthe event attributes and the newly extracted val-ues. The agent receives a reward, which is de-fined as the difference between the accuracy inthe current time step and the previous time step.We add a negative constant to the reward to avoidlong episodes. In the subsequent time steps, theagent sequentially chooses the two actions ad andaw until ad is a stop decision. For more de-tails of state, reward and base extractor, refer to(Narasimhan et al., 2016).

The agent is trained to optimize the expected re-wards by choosing actions to select an expansionterm for query expansion and to reconcile valuesextracted from documents.

Figure 2: Overview of our model. FC: fully connectedlayer; ⊕: concatenation of vectors.

3.2 Network Architecture

We use neural networks to model decision pol-icy πd(ad|st) and term selection policy πq(aw|st)as probability distributions over candidate actionsad and aw. Figure 2 represents an overview ofour policy networks. Decision policy πd(ad|st)and value function V (st) is calculated using astate representation rst that is obtained with twofully connected layers in the same manner as inSharma et al. (2017).

In contrast to the previous framework, candi-date terms in our framework depend on an event.Hence, we utilize a pairwise interaction functionwhose input is the state representation rst and ex-panded query representation rqi to calculate termselection policy πq(aw|st). The words of q0 areembedded using a word embedding layer and pro-cess using a convolutional neural network (CNN)and a max pooling layer, similar to the method byKim (2014). rqi for each term wi is obtained byfeeding the concatenation of the output of the maxpooling layer and the word embedding of a candi-date term wi to a fully connected layer FCQ1. Wefeed the concatenation of rst and rqi for each termwi to a fully connected layer FCQ2. The parame-ters of the FCQ1 and FCQ2 are shared among thecandidate terms. πq(aw|st) is obtained by normal-izing the outputs of the FCQ2 over the candidateterms.

We train the policy and value networks byusing the Asynchronous Advantage Actor-Critic(A3C) (Mnih et al., 2016) as used in Sharma et al.(2017). A3C can speed up the learning processof the policy networks by training multiple agentsasynchronously. Further details on the A3C can befound in Sharma et al. (2017).

36

Table 1: The number of source documents and down-loaded documents in each set.

4 Experiments

4.1 Experimental Setting

We evaluated our model on Shooting and Adul-teration datasets that used in Narasimhan et al.(2016). For each an original query, we collectedthe candidate terms obtained from the first Mwords of the top-K documents retrieved throughBing Search API1. The vocabulary size of the can-didate terms N is defined as MK + 1 because thenull token, namely no query expansion, is also in-cluded in the candidate terms. We downloaded thetop 20 documents from the Bing Search API asthe external sources through an expanded query.Statistics of the original datasets and downloadeddocuments is described in Table 1.

Word embeddings are set to fixed vectors of 300dimensions and is initialized with word2vec em-bedding trained on Google News Dataset2. Weset the unit size of FCS1 and FCS2 to 20, FCQ1

to 300, FCV and FCQ2 to 1. Further, we set thenumber of feature maps of CNN to 200 and thewindow size of CNN to 3. Discount factor andthe constant of entropy regularization were set to0.8 and 0.01, respectively. We utilize RMSprop(Hinton et al., 2012) as the optimizer and set thenumber of threads in A3C to 16.

We employed RLIE-A3C (Sharma et al., 2017)as a baseline model to compare with our model.We used their public implementation3.

4.2 Results

We evaluated the average extraction accuracy forattributes as done in Sharma et al. (2017). Table2 shows the results of Shooting and Adulteration.

1https://azure.microsoft.com/en-us/services/cognitive-services/bing-web-search-api/

2https://code.google.com/archive/p/word2vec/3https://github.com/adi-sharma/RLIE A3C

Our model achieved 1.9 pt increase of average ac-curacy in Shooting and 0.2 pt decrease in Adul-teration against the RLIE-A3C model when thenumber of expanded queries N was 5 for Shoot-ing and 4 for Adulteration; these correspond to(M, K) = (4, 1) or (3, 1) respectively. We varied(M, K) to (10, 1), (10, 2) and (10, 3), which indi-cate N = 11, 21 and 31, to evaluate the effect ofthe number of expanded queries. The accuracy inour models rarely changes even though the num-ber of expanded queries increases.

Table 2: Extraction accuracy of RLIE-A3C and ourmodels on Shooting and Adulteration datasets. Under-lined values represent the best values on each datasets.

4.3 Discussion

We evaluated oracle extraction accuracy to deter-mine the effectiveness of PRF-based query expan-sion and to discover why our model does not per-form well in Adulteration. The oracle extractionaccuracy is calculated if an agent perfectly takesactions to select a term for query and reconcile thevalues from the documents. In other words, theoracle extraction accuracy can be regarded as anindicator of retrieval performance.

Table 3 presents the oracle extraction accuraciesfrom only source documents, the documents re-trieved by the original queries and the documentsretrieved by the expanded queries by using tem-plates and PRF. Compared with template queries,PRF based query expansion with the same numberof queries performed better in Shooting and Adul-teration. Oracle extraction accuracy further im-

Table 3: Oracle extraction accuracy. “Documents”refers to documents to extract information.

37

Shooting1 Dead, 3 Wounded In Shooting On Roxbury StreetPolice: 3 Dead, 1 In Surgery After Dallas Shooting CBS DallasSan Francisco police confirm 4 dead following shooting in Hayes Valley

AdulterationSubway investigates china media reports of doctored expiry datesHigh level of seafood fraud found DenmarkFake honey worthing leads to trading standards prosecution

Table 4: Examples of titles of source documents in Shooting and Adulteration datasets

proved with the increases in the number of queriesN . However, no difference was found in its ex-traction performance (see Table 2). This indicatesthat increasing the number of queries N compli-cates the selection of an optimal term for queryexpansion.

Compared to Shooting, the oracle accuracy ofthe original queries in Adulteration is relativelylow. Therefore, the assumption that the top-rankeddocuments retrieved by the original query are rel-evant to the original query is not satisfied in Adul-teration. We consider that this is why our modeldoes not achieve an improvement in the extractionperformance on Adulteration. Table 4 shows ex-amples of the titles of source document used asoriginal queries. We can observe that named entityand numerical expression appeared in more titlesof the source documents in Shooting than thosein Adulteration. Therefore, the original queries inAdulteration lack specifics and weaken the extrac-tion performance.

5 Conclusions

We integrated the PRF based query expansion tothe task of information extraction using RL. Ourmodel can expand a query into an event-specificquery without additional resources. To integratethe PRF based query expansion, we introduceda pairwise interaction function to calculate termselection policy πq(aw|s). Experimental resultsshowed that our model is comparable or betterthan the previous model in terms of extraction per-formance in two datasets. Furthermore, we ana-lyzed the relationship between retrieval effective-ness and extraction performance.

In the future work, we plan to develop a modelthat can generate a complete term sequence of theexpanded query rather than adding a term to aquery.

References

Jethro Borsje, Frederik Hogenboom, and Flavius Fras-incar. 2010. Semi-automatic financial events discov-

ery based on lexico-semantic patterns. Int. J. WebEng. Technol., 6(2):115–140.

Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-YingMa. 2002. Probabilistic query expansion usingquery logs. In Proceedings of the 11th InternationalConference on World Wide Web, WWW ’02, pages325–332, New York, NY, USA. ACM.

Geoffrey Hinton, Nitish Srivastava, and Kevin Swer-sky. 2012. Neural networks for machine learninglecture 6a overview of mini-batch gradient descent.

Heng Ji and Ralph Grishman. 2011. Knowledge basepopulation: Successful approaches and challenges.In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 1148–1158. Associ-ation for Computational Linguistics.

Pallika H. Kanani and Andrew K. McCallum. 2012.Selecting actions for resource-bounded informationextraction using reinforcement learning. In Pro-ceedings of the Fifth ACM International Conferenceon Web Search and Data Mining, WSDM ’12, pages253–262, New York, NY, USA. ACM.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1746–1751. As-sociation for Computational Linguistics.

Gideon Mann and David Yarowsky. 2005. Multi-field information extraction and cross-document fu-sion. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Linguistics(ACL’05), pages 483–490. Association for Compu-tational Linguistics.

Christopher D. Manning, Prabhakar Raghavan, andHinrich Schutze. 2008. Introduction to InformationRetrieval. Cambridge University Press, New York,NY, USA.

Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. 2016. Asyn-chronous methods for deep reinforcement learning.In International Conference on Machine Learning,pages 1928–1937.

Karthik Narasimhan, Adam Yala, and Regina Barzilay.2016. Improving information extraction by acquir-ing external evidence with reinforcement learning.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages

38

2355–2365. Association for Computational Linguis-tics.

Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-oriented query reformulation with reinforcementlearning. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 574–583. Association for ComputationalLinguistics.

Yonggang Qiu and Hans-Peter Frei. 1993. Conceptbased query expansion. In Proceedings of the16th Annual International ACM SIGIR Conferenceon Research and Development in Information Re-trieval, SIGIR ’93, pages 160–169, New York, NY,USA. ACM.

Aditya Sharma, Zarana Parekh, and Partha Talukdar.2017. Speeding up reinforcement learning-basedinformation extraction training using asynchronousmethods. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2658–2663. Association for Compu-tational Linguistics.

Ellen M. Voorhees. 1994. Query expansion usinglexical-semantic relations. In Proceedings of the17th Annual International ACM SIGIR Conferenceon Research and Development in Information Re-trieval, SIGIR ’94, pages 61–69, New York, NY,USA. Springer-Verlag New York, Inc.

Chih-Ping Wei and Yen-Hsien Lee. 2004. Event de-tection from online news documents for support-ing environmental scanning. Decis. Support Syst.,36(4):385–401.

Robert West, Evgeniy Gabrilovich, Kevin Murphy,Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014.Knowledge base completion via search-based ques-tion answering. In WWW.

Jinxi Xu and W. Bruce Croft. 1996. Query expansionusing local and global document analysis. In Pro-ceedings of the 19th Annual International ACM SI-GIR Conference on Research and Development inInformation Retrieval, SIGIR ’96, pages 4–11, NewYork, NY, USA. ACM.

39


Towards Automatic Fake News Detection:Cross-Level Stance Detection in News Articles

Costanza [email protected]

Mohammad Taher [email protected]

Language Technology Lab, University of Cambridge

Nigel [email protected]

Abstract

In this paper, we propose to adapt thefour-staged pipeline proposed by Zubiaga etal. (2018) for the Rumor Verification task tothe problem of Fake News Detection. Weshow that the recently released FNC-1 corpuscovers two of its steps, namely the Trackingand the Stance Detection task. We identifyasymmetry in length in the input to be a keycharacteristic of the latter step, when adaptedto the framework of Fake News Detection, andpropose to handle it as a specific type of Cross-Level Stance Detection. Inspired by theoriesfrom the field of Journalism Studies, we im-plement and test two architectures to success-fully model the internal structure of an articleand its interactions with a claim.

1 Introduction

The rise of social media platforms, which al-low for real-time posting of news with very lit-tle (or none at all) editorial review at the source,is responsible for an unprecedented growth in theamount of the information available to the pub-lic. While this constitutes an invaluable source offree information, it also facilitates the spread ofmisinformation. In particular, the literature dis-tinguishes between rumors, i.e., pieces of infor-mation which are unverified at the time of post-ing and therefore can turn out to be true or false,and fake news (or hoaxes), i.e., false stories whichare instrumentally made up with the intent to mis-lead the readers and spread disinformation (Zubi-aga et al., 2018).

Both Rumor Verification (RV) and Fake NewsDetection (FND) constitute very difficult taskseven for trained professionals. Therefore, ap-proaching them in an end-to-end fashion has gen-erally been avoided. Both tasks, however, can beeasily split into a number of sub-steps. For in-stance, Zubiaga et al. (2018) proposed a model

for RV which consists of four stages: a rumordetection stage, where potentially rumorous postsare identified, followed by a tracking stage, whereposts concerning the identified rumor are col-lected; after determining the orientation expressedin each post with respect to the rumor (stance de-tection), the final truth value of the rumor is ob-tained by aggregating those single stance judg-ments (veracity classification). As shown in Fig-ure 1, this pipeline can be naturally adapted toFND.

In recent years, several efforts have been madeby the research community toward the automati-zation of some of these stages, in order to pro-vide effective tools to enhance the performanceof human journalists in rumor and fake news de-bunking (Thorne and Vlachos, 2018). Concern-ing FND, Pomerleau and Rao (2017) recently re-leased a dataset for the Stance Detection step in theframework of the Fake News Challenge1 (FNC-1).The core of the corpus is constituted by a col-lection of articles discussing 566 claims, 300 ofwhich come from the EMERGENT dataset (Fer-reira and Vlachos, 2016). Each article is summa-rized in a headline and labeled as agreeing (AGR),disagreeing (DSG) or discussing (DSC) the claim.Additionally, unrelated (UNR) samples were cre-ated by pairing headlines with random articles.The goal of the challenge was to classify the pairsconstituted by a headline and an article as AGR,DSG, DSC or UNR.

Following the pipeline discussed above, it isclear that the FNC-1 actually covers two of thefour steps, namely: (1) The tracking step, consist-ing in filtering out the irrelevant UNR samples; (2)The actual stance detection step, consisting in theclassification of a related headline/article pair intoAGR, DSC or DSC.

1http://www.fakenewschallenge.org/

40

Collection of posts/articles,discussing a given: RV FND

rumor claim

Stream of posts/claims, labeled as: RV FND

Filter out samples discussing a given topic

Identify an emerging topic. Optional module, to use whentopics are not known a priori

1. DETECTION 2. TRACKING 3. (ASYMMETRIC) STANCE DETECTION

4. VERACITYCLASSIFICATION

Label each filtered samplewith the stance expressedtoward the topic

Aggregate the stance judgements into a single veracityjudgement

rumor/not rumors potentially fake/not fake

Collection of posts/articles,classified as: RV

FND

support, deny, query, comment agree, disagree, discuss

Veracity judgement of therumor/claim, as: RV FND

true, false, unverified true (not fake), false (fake), unverified

Figure 1: The rumor verification (RV) pipeline proposed by Zubiaga et al. (2018). The first row describes thecorresponding step whereas the second row shows the outputs of each step for both the RV and the fake newsdetection (FND) tasks. The red rectangle indicates steps covered by the FNC-1 corpus. Figure adapted fromZubiaga et al. (2018).

Note that the amount of semantic understandingneeded for the second task is much higher than forthe first. In fact, even humans struggle in the re-lated sample classification, as empirically demon-strated by Hanselowski et al. (2018): the inter-annotator agreement of five human judges dropsfrom Fleiss’ κ of .686 to .218, after filtering outthe UNR samples. For this reason, we concentrateon the stance detection step, and we make the fol-lowing contributions:

1. We identify asymmetry in length betweenheadlines and articles as a key characteris-tic of the FNC-1 corpus: on average, an ar-ticle contains more than 30 times the num-ber of words contained in its associated head-line. This is peculiar with respect to mostof the commonly used datasets for stance de-tection (Mohammad et al., 2017) and requirethe development of architectures specificallytailored to this considerable asymmetry. Fol-lowing on the terminology introduced by Ju-rgens et al. (2014) for Semantic Similarity,we propose to handle the problem as a Cross-Level Stance Detection task. To our knowl-edge, it is the first time that this task is inves-tigated in isolation.

2. Inspired by theoretical principles in the fieldof Journalism Studies, we propose two sim-ple neural architectures to model the argu-mentative structure of an article, and its com-plex interplay with a headline. We demon-strate that our systems can beat a strongfeature-based baseline, based on one of theFNC-1 winning architectures, and that theycan successfully model the internal structureof a news article and its relations with a

claim, leveraging only word embeddings asinput.

2 Related Work

2.1 Stance Detection

Stance Detection (SD) has been defined as the taskof determining the attitude expressed in a shortpiece of text with respect to a target, usually ex-pressed with one or few words (as Feminism orClimate Change, Mohammad et al. (2016)). Infact, most of the available corpora for SD considervery short samples, as Tweets. SD became verypopular in recent years, resulting in a large num-ber of publications (Mohammad et al., 2017).

To our knowledge, however, no one explicitlyconsidered the problem of stance detection givingas input two items which are considerably asym-metric in length, that is, a long and structured doc-ument and a target expressed in the form of a com-plete sentence and not as a concept. For this rea-son, we propose to call the task introduced in theFNC-1 challenge Cross-Level Stance Detection.This is in line with the definition of Cross-LevelSemantic Similarity, which measures the degree towhich the meaning of a larger (in terms of length)linguistic item is captured by a smaller item (Jur-gens et al., 2014).

After reporting on the systems participating tothe FNC-1, which released the first SD dataset col-lecting long documents, we briefly mention someof the most relevant works on SD using Twitterdata.

Fake News Challenge. With more than 50 par-ticipating groups, the FNC-1 drew high inter-est from both the research community and in-

41

LEAD

BODY

TAIL

1. "An astonishing image appears to show a giant crab, nearly 50 feet across, lurking in the har bor at Whitstable, Kent, and while some assert that it is a playful hoax, others believe they havefound evidence of a genuine aquatic monster.2. [...] The giant animal is shaped like an edible crab, a species commonly found in British waters, butwhich only grows to be ten inches across, on average.

3. People have flocked to the website Weird Whitstable [...] to judge its authenticity for themselves.4. Quinton Winter, [...] is now convinced that there truly is a strange animal [...]5. Last year, Winter claims to have spotted the giant crab [...] as he related to The Daily Express.6. Save yourselves, Crabzilla has arrived in Whitstable http://<URL> pic.twitter.com/<URL>7. In July of last year, another image emerged, depicting a giant crab [...]8. Another image, said to be taken in July of last year [...] show[s] a giant, albeit smaller, crab [...]9. Graphic artist Ashley Austen noted his skepticism of the aerial image [...] to Kent Online [...]10. The image of the giant crab can be quite easily recreated in Photoshop,” he said. [...]

11. Meet Crabzilla, a giant Japanese spider crab http:/<URL>pic.twitter.com/<URL>12. Earlier this year, another photograph of an unknown creature emerged from England [...]. 13. The largest known species of crustacean is the Japanese Spider Crab. [...]14. [Images: Quinton Winter via The Daily Express and Weird Whitstablog]"

Claim: Crabzilla! Satellite Picture Reveals Giant Crustacean Lurking Off The Coast Of Whitstable

DSC

(noise)

DSCAGR AGR (noise)(noise)(noise)DSG DSG

(noise)(noise)(noise)(noise)

Figure 2: Article from the FNC-1 test set (sample no. 998), analyzed following the inversed pyramid princi-ples (Scanlan, 1999). Notice that single sentences may express a different stance with respect to a claim, whileothers can be irrelevant, as shown in the leftmost column.

dustry. Due to the high number of UNR sam-ples, which constituted almost three quarters ofthe training set, most of the groups proposed ar-chitectures which could perform well in this spe-cific class - that is, in the tracking step of theFND pipeline. The second (Hanselowski et al.,2017) and third (Riedel et al., 2017) classifiedteams proposed multi-layer perceptrons (MLPs)-based systems. The best performing system (Bairdet al., 2017) is an ensemble of a convolutional neu-ral network (CNN) and a gradient-boosted deci-sion tree. All models, with the exception of theCNN, take as input a number of hand-engineeredfeatures. Recently, Hanselowski et al. (2018)enriched the feature set used in Hanselowski etal. (2017) and added a stacked BiLSTM layer totheir model, resulting in a modest gain in perfor-mance.

All models described above performed verywell in the UNR classification (with F1 usuallyabove .98 for this class), achieving considerablyworse results on the related samples (Hanselowskiet al., 2018).

Rumor Stance Detection on Tweets. The mostcommonly used datasets for rumor stance detec-tion, the RumorEval (Derczynski et al., 2017)and the PHEME (Zubiaga et al., 2016b) corpora,collect Tweets. State-of-the-art results on thePHEME corpus has been obtained by Aker etal. (2017), who used a very rich set of problem-specific features. Their model beat the previousstate-of-the-art system by Zubiaga et al. (2016a),

who modeled the tree-structured Twitter conver-sations using a LSTM, taking as input a conversa-tion’s branch at time.

2.2 Journalism Studies: News-writing Prose

Each genre develops its peculiar narrative forms,which allow for the most effective transmission ofa message. In modern news-writing prose, espe-cially in the Anglo-Saxon journalism, the invertedpyramid style is widely adopted (Scanlan, 1999).

Key element of this well standardized templateconsists in the fact that the most newsworthy facts(the so-called 5W), are presented at the very begin-ning of the article - the lead - with the remaininginformation following, in order of importance, inthe body of the article: in this section, we can findless essential element as quotes, interviews andbackground or explanatory information; any addi-tional input, as related stories, images and credits,are put in the very last paragraphs, the tail (Scan-lan, 1999). Usually, no more than one or two ideasare expressed in the same paragraph (Sun Techni-cal Publications, 2003). Those characteristic ele-ments are clearly visible in Figure 2. This styleis particularly suited for rapidly evolving break-ing news event, where a journalist can update anarticle by attaching a new paragraph with the lastupdates at the beginning of it. Moreover, puttingmost newsworthy facts at the beginning of an arti-cle allows the impatient readers to quickly decideon their level of interest in the report.

After manual analysis of excerpts of the FNC-1corpus, we concluded that most articles were actu-

42

w1 w2 w3 w4

w1 w2 w3

w1 w2 w3 w4 w5

Sent

ence

nSe

nten

ce 2

Se

nten

ce 1

....

ARTICLE

Backward LSTM conditionedon the previous sentence

Input features(word embeddings, ...)

Forward LSTM conditionedon the previous sentence

Figure 3: The architecture of our article encoder, whichis based on that of Augenstein (2016). Dotted arrowsrepresent conditional encoding and networks with thesame color share the weights.

ally written following the inverted pyramid princi-ples.

3 Modeling

3.1 Encoding the article

Based on the elements of Journalism Studies dis-cussed above, we propose a simple architecturebased on bidirectional conditional encoding (Au-genstein et al., 2016) to encode an article split inton sentences.

Each sentence Si is first converted into its em-bedding representation ESi ∈ Re×si , where e isthe embedding size and si is the length of theith tokenized sentence. Then, we encode the arti-cle using BiLSTMA, a Bidirectional LSTM whichreads the article sentence by sentence in backwardorder, initializing the first states of its forwardand backward components with the last states ithas produced after processing the previous sen-tence (Figure 3).

Notice that we process the article from the bot-tom to the top, as we assume the most salient infor-mation to be concentrate in the lead. By consider-ing an article as an ensemble of sentences whichare separately encoded conditioned to their pre-ceding ones, we can model the relationship of eachsentence with respect to the others and, at the sametime, reduce the number of parameters.

Sent

ence

2

Sent

ence

1

....

HEA

DLI

NE

AR

TIC

LE

LSTM conditioned onthe headline

LSTM conditioned onthe previous sentence

Figure 4: Detail of the forward component of thedouble-conditional encoding architecture (best seen incolor). Dotted arrows represent conditional encodingand networks with the same color share the weights.The system reads an article from the last sentence to thefirst, processing each sentence twice: first conditioningon the headline, then conditioning on the previous sen-tence. Due to lack of space, only the first two sentencesof the article are represented.

3.2 Encoding the relationship between theheadline and the article

After having encoded the article, we model its re-lationship with the headline. As shown in Fig-ure 2, we expect single sentences to express a po-tentially different stance with respect to the head-line, while some sentences - especially in the bodyand the tail - can be irrelevant. For this reason,we separately evaluate the relationship of eachsentence, conditioned on the previous sentence(s),with the headline. In this paper, we consider twoapproaches:

Double-conditional Encoding. As a firstmethod, we modeled the relationship betweenthe headline and the article using conditionalencoding.

First, the headline is encoded using a bidirec-tional LSTM. Then, we separately process eachsentence of the article with BiLSTMH , a BiL-STM conditioned on the last states of the BiLSTMwhich processed the headline. We finally stackBiLSTMA on top of BiLSTMH . In this way, weobtain a matrix HSi ∈ Rl×si for each sentence Si.

43

Following Wang et al. (2018), we notate this as:

HSi = Bi-LSTMH(ESi) ∀i ∈ 1, ..., n (1)

HSi = Bi-LSTMA(HSi) ∀i ∈ 1, ..., n (2)

This process is shown in Figure 4. In this way,we read each sentence Si, which is encoded in aheadline-specific manner, conditioning on the pre-vious sentence(s). Clearly, it could have been pos-sible to obtain a hidden representation for eachsentence by first conditioning on the previous sen-tence(s), and then on the headline. Results of pre-liminary experiments, however, showed worse re-sults for this variant, suggesting that having theconditioning on the previous sentence(s) nearer tothe decoder is beneficial for the cross-level stancedetection task.

Co-matching Attention. We also explored theuse of attention in order to connect the headlineHH ∈ Rl×c, encoded using a BiLSTM layer, withthe article’s sentences HS1 ...HSn , embedded asexplained in Subsection 3.1. Inspired by the archi-tecture proposed by Wang et al. (2018) for multi-choice reading comprehension, we obtain a ma-trix HSi , attentively read with respect to the head-line, for each sentence at position i ∈ 1, ..., nas follows: we first obtain an aggregated rep-resentation of the headline and the ith sentenceHSi ∈ Rl×Si (Eq 4), obtained by dot product ofHH with the attention weights Ai ∈ Rc×si (Eq 3);then, we obtain co-matching states of each sen-tence with HHi using Eq 5:

Ai = softmax((WhHH + bh))>HSi (3)

HHi = HHAi (4)

HSi = ReLU(Ws

[HHi HSi

HHi ⊗HSi

]+ bs) (5)

where Wh ∈ Rl×l, Ws ∈ Rl×2l, bh ∈ Rl andbs ∈ R2l are the parameters to learn. As in Wanget al. (2018), we use the element-wise subtrac-tion and multiplication⊗ to build matching rep-resentations of the headline.

Self-attention. After encoding of the relation-ship between the headline and the article, we em-ploy a similar self-attention mechanism as in Yanget al. (2016) in order to soft-select the most rele-vant elements of the encoded sentence. Given thesequence of vectors h1, ..., hS in HSi , obtainedwith the double-conditional encoding or the co-matching attention approaches described above,

the final vector representation of the ith sentenceSi is obtained as follows:

uit =tanh(Wshit + bs) (6)

αit =expu>itus∑t u>itus

(7)

si =∑

t

αthit (8)

where the hidden representation of the word at po-sition t, uit, is obtained though a one-layer MLP(Eq 6). The normalized attention matrix αt is thenobtained though a softmax operation (Eq 7). Fi-nally, si is computed by a weighted sum of all hid-den states ht with the weight matrix αt (Eq 8).

3.3 DecodingFollowing the inverted pyramid principles, ac-cording to which the most relevant informa-tion is concentrated at the beginning of the arti-cle, we aggregate the sentence vector representa-tions s1, ..., sn using a backward LSTM. The fi-nal prediction y is finally obtained with a softmaxoperation over the tagset.


4.1 Data and PreprocessingWe downloaded the FNC-1 corpus from the chal-lenge website2. As we wanted to concentrate onthe cross-level stance detection sub-task, we onlyconsidered related (AGR, DSC and DSC) samples,discarding the noisy UNR samples, which wouldconstitute the output of the tracking step. Thedistribution of related samples is also very unbal-anced, with the DSC class constituting more than ahalf of the subset and the DSG samples accountingfor only 7.5% of the related samples, as shown inTable 1.

samples AGR DSG DSC UNR

all 75,385 7.4% 2.0% 17.7% 72.8%REL 20,491 27.2% 7.5% 65.2% -

Table 1: Label distribution for the FNC-1 dataset, con-sidering all classes, or only the related samples.

As discussed in the Introduction, the cross-levelstance detection task is characterized by an asym-metry in length in the input: on average, head-lines are 12.40 tokens long, while articles spanfrom 4 up to 4788 tokens, with an average length

2https://github.com/FakeNewsChallenge/

44

headline entire article sentence

avg no. tokens 12.40 417.69 30.88

Table 2: Asymmetry in length between headlines andarticles in the FNC-1 corpus.

of 417.69 tokens. An article, however, presents acompositional internal structure, as it can be di-vided into smaller elements. We used the NLTKsentence tokenizer3 to split articles into sentences,obtaining an average number of 11.97 sentencesper article. On average, sentences are 30.88 tokenslong, as reported in Table 2.

4.2 Baseline

As a baseline, we implemented the Athena modelproposed by Hanselowski et al. (2017), whichscored second in the FNC-1. We did not usethe first-ranked system, as it is an ensemblemodel, nor the modification to Athena proposedin Hanselowski et al. (2018), as the new featureset and the BiLSTM layer did not significantly im-prove the performance of the original model. Themodel consists in a 7-layers deep MLP, with vary-ing number of units, followed by a softmax. Inputis presented as a large matrix of concatenated fea-tures, some of which separately encode the head-line or the body:

• Presence of refuting and polarity words• Tf-idf-weighted BoW unigram features, con-

sidering a vocabulary of 5000 entries.

while others jointly consider the headline/body:

• Word overlap between headline and article.• Count of headline’s token and ngrams which

appear in the article.• Cosine similarity of the embeddings of nouns

and verbs of the headline and the article.

Moreover, they use topic models based on non-negative matrix factorization, latent Dirichlet allo-cation, and latent semantic indexing. This resultsin a final set of 10 features. In this way, the asym-metry in length between is solved by compressingboth the headline and the article into two fixed-sized vectors of the same size.

The same hyperparameters as in Hanselowski etal. (2017) have been used for the implementation

3https://www.nltk.org/_modules/nltk/tokenize.html

Max headline length (in tokens) 15Max sentence length (in tokens) 35Max number of sentences for article 7Word embedding size 300BiLSTM cell size 2×128Embedding dropout 0.1BiLSTM dropout 0.3Dense layer dropout 0.2Epochs 70Batch size 32Optimizer AdamLearning rate 0.001

Experiments using additional input channels

Max word length (in characters) 35Character embedding size 30Character BiLSTM cell size 2×64Character BiLSTM dropout 0.2NE embedding size 30

Table 3: Hyperparameters configuration

of the model. For training, we downloaded the fea-ture matrices which had been used in Athena bestsubmission 4, taking only the indices correspond-ing to the related samples.

4.3 (Hyper-)Parameters

The high-level structure of the models has beenimplemented with Keras, while single layers havebeen written in Tensorflow. (Hyper-)parametersused for training, useful for experiments replica-tion, are reported in Table 3. Concerning vo-cabulary creation, we included only words oc-curring more than 7 times. The embedding ma-trix has been initialized with word2vec embed-dings5, which performed better than other set ofpre-trained embeddings according to some prelim-inary experiments. This can be partially explainedas word2vec embeddings are trained on part of theGoogle News corpus, thus on the same domain asthe FNC-1 dataset. OOV words have been zero-initialized. In order to avoid overfitting, we didnot update word vectors during training.

4.4 Evaluation Metrics

As we are not considering the UNR samples, theFNC-1 score would not constitute a good met-ric, as it distinguishes between related and unre-lated samples for scoring6. Following Zubiaga et

4https://drive.google.com/open?id=0B0-muIdcdTp7UWVyU0duSDRUd3c

5https://code.google.com/archive/p/word2vec/

6https://github.com/FakeNewsChallenge/fnc-1-baseline/blob/master/utils/score.py

45

Figure 5: Performance of the co-matching encoder in term of macro-averaged F1 score on the test set, consideringthe first n sentences of an article. Blue and red violins represent respectively backward and forward encoding ofthe considered sentences.

al. (2018) and Hanselowski et al. (2018), we usemacro-average precision, recall and F1 measure,which is less affected by the high class unbalance(Table 1). We also consider the accuracy with re-spect to the single AGR, DSG and DSC classes.


As shown in Table 4, both encoders described inSection 3 outperformed the baseline (line 1), de-spite having a considerably minor number of pa-rameters. In particular, the feature-based modelobtained a relatively good performance in clas-sifying the very infrequent DSG labels, probablythanks to its large number of hand-engineered fea-tures. However, it shows some difficulties in dis-criminating between AGR and DSC samples. Thisis probably a consequence of the system flatteningthe entire article into a fixed-size vector: this in-evitably causes the system to loose the subtle nu-ances in the argumentative structure of the newsstory, which allows for distinguishing betweenAGR and DSC samples, and to favor the most com-mon DSC class. On the contrary, our architecturesapproach the asymmetry in length in the input bycarefully encoding the articles as a hierarchical se-quence of sentences, and by separately modelingtheir relative positions with respect to the headline.In this way, they are able to successfully discrimi-nate between AGR and DSC samples.

In general, the encoder based on co-matchingattention performed clearly better than the archi-tecture based on double-conditional encoding (line

2 and 10), reaching a higher performance in allmetrics but the classification of the DSC samples.

5.1 Modeling the Inverted Pyramid

In order to test our assumption that the great ma-jority of the FNC-1 corpus were written follow-ing the inverted pyramid style, we took the co-matching attention model, which performed bet-ter than the double-conditionally encoded archi-tecture, and progressively reducing the number ofconsidered sentences. Moreover, we modified thearticle encoder (Subsection 3.1) in order to processthe input sequence in forward and not in backwardorder. For each of these 13 settings7, we run 10simulations.

As the violin plots in Figure 5 show (blue vi-olins), considering a reduced number sentencesdoes not correlate with an overly big drop in per-formance, until a number of less than four sen-tences is taken. Below this threshold, the ability ofthe system to correctly classify the stance of the ar-ticle is compromised. This can be explained withthe inverted pyramid theory: until we consider anumber of sentences sufficient in order to includethe lead and part of the body of the article, the sys-tem can rely on a sufficient number of elements inorder to discriminate its stance. On the contrary,if we consider only the very first sentences, thesystem can get confused, being exposed to only

7Specifically: 7 backward-encoded co-matching architec-tures (considering a number of sentences from 1 up to 7) and6 forward-encoded co-matching architectures (considering anumber of sentences from 2 up to 7).

46

anonymized accuracy macro-averagedModel input AGR DSG DSC P R F1

1 Baseline – 26.69 11.76 74.77 39.39 37.74 38.00

2D

oubl

e-co

nditi

onal

Enc

odin

g

– no 68.84 09.61 77.42 52.50 51.25 49.813 – yes 63.11 09.76 76.03 52.31 49.63 51.774 + char no 51.45 23.96 76.32 50.12 50.57 50.325 + char yes 59.64 16.93 77.64 53.27 51.40 51.386 + ner no 69.57 05.02 76.97 52.14 51.22 48.867 + ner yes 75.78 09.33 69.96 54.41 51.31 51.178 + char + ner no 62.11 13.20 77.11 52.83 50.80 50.429 + char + ner yes 76.77* 12.34 67.47 53.45 49.85 50.56

10

Co-

mat

chin

gat

tent

ion

– no 69.57 33.0* 74.91 64.14* 58.53* 59.01*11 – yes 64.37 29.27 78.94* 59.64 55.20 57.1212 + char no 74.46 24.82 71.95 61.77 56.46 58.5513 + char yes 70.52 15.06 76.39 63.88 55.23 56.0114 + ner no 66.05 18.79 72.38 58.51 51.48 52.5215 + ner yes 69.42 31.85 72.38 58.59 54.67 57.3216 + char + ner no 63.95 20.95 77.76 57.32 53.10 53.9017 + char + ner yes 67.26 11.05 76.99 57.19 51.84 54.85

Table 4: Results of experiments using double-conditional encoding or co-matching attention. Best results for eachencoding type are shown in bold. Best results overall are indicated with an asterisk.

a portion of the (sometimes opposing) opinionsexpressed in the article. Interestingly, our sys-tem seems to be pretty robust to the noisy sen-tences which could be included when consideringa higher number of sentences.

The assumption that most of the articles in theFNC-1 corpus are written following the invertedpyramid principles is further confirmed by the factthat, after the threshold of 4 considered sentences,simulations using forward encoding perform al-ways consistently worse than using backwards en-coding (the red violins in Figure 5). Reasonably,below this threshold, we do not observe a con-siderable difference in performance between back-ward and forward models.

5.2 Additional Experiments

5.2.1 Using additional Input ChannelsTo investigate the impact of features other thanword embeddings, we consider two further inputchannels:

• Named Entities (NE) - NEs were obtainedusing the Stanford NE Recognizer (Finkelet al., 2005), resulting in a tagset of 13 labels.• Characters - Each input word was split into

characters. Only characters occurring morethan 100 times in the training set were con-sidered, obtaining a final vocabulary of 149characters. As in Lample et al. (2016), in weconcatenate the output of a BiLSTM run overthe character sequence.

The output of each input channel is concatenatedwith the word embedding, and passed to the articleencoder described in Section 3.1. Hyperparame-ters used for experiments are reported in Table 3.

5.2.2 Anonymizing the input

After manual analysis of the predictions, we sus-pected that some models could have spotted somecorrelations between certain Named Entities and aspecific stance in the training set. Some of thosecorrelations are well known and can be useful inveracity detection (Wang, 2017). In this paper,however, we wanted to train a model for stancedetection only based on its language understand-ing, without counting on such possibly accidentalcorrelations.

In order to avoid the systems to rely on chancecorrelations, which would not generalize on thetest set, we modified the input sequences by sub-stituting all input tokens labeled as <PERSON>,<ORGANIZATION> and <LOCATION> by theStanford Named Entity Recognizer with the cor-responding NE tags.

5.2.3 Results

Results of experiments concatenating the previ-ously mentioned features to the word embeddinginput to both architectures are reported in Table 4(even lines). In general, using NE embeddingsalone with word embeddings was not beneficialfor both models. Considering the architecturebased on double-conditional encoding, using both

47

characters and NE features actually lead to (some-times small) improvements in almost all consid-ered evaluation metrics. Moving to the archi-tecture using co-matching attention, adding char-acters or NE embeddings, even in combination,caused a considerable drop in all evaluation met-rics, apart on some single label classification (asthe AGR class).

As shown in Table 4 (odd lines), anonymiz-ing the input was always useful for the archi-tecture using double-conditional encoding, result-ing in a consistently higher macro-averaged F1

score. Considering the architecture based on co-matching attention, however, anonymizing the in-put was beneficial only for architectures leverag-ing NE tags (only with word embeddings, or incombination with character embeddings), whichwere also the ones showing the highest drop inperformance with respect to the model using onlyword embeddings.

The best performance according to macro-averaged precision, recall and F1 score is obtainedusing the co-matching attention model leveragingonly word embeddings. The high performance ofthis model is mainly due to its ability to discrimi-nate the very unfrequent DSG class.

6 Conclusions

We proposed two simple architectures for Cross-Level Stance Detection, which were carefully de-signed to model the internal structure of a newsarticle and its relations with a claim. Results showthat our “journalistically”-motivated approach canbeat a strong feature-based baseline, without rely-ing on any language-specific resources other thanword embeddings. This indicates that an interdis-ciplinary dialogue between Natural Language Pro-cessing and Journalism Studies can be very fruitfulfor fighting Fake News.

In future work, we aim to put together the dif-ferent stages of the FND pipeline. Following thework of Kochkina et al. (2018) for RV, it couldbe interesting to compare a sequential approach toseparately solve each step of the pipeline in isola-tion, with a joint multi-task system. The generaliz-ability of the models trained on the FND pipelineto other domains could be tested with the recentlyreleased ARC corpus (Hanselowski et al., 2018),which has similar statistical characteristics as theFNC-1 corpus.

Acknowledgments

Special thanks to Chiara Severgnini, journal-ist at 7 (the weekly magazine supplement ofCorriere della Sera) for her comments on Sec-tion 2.2. The first author (CC) would like tothank the Siemens Machine Intelligence Group(CT RDA BAM MIC-DE, Munich) and the NERCDREAM CDT (grant no. 1945246) for partiallyfunding this work. The third author (NC) is grate-ful for support from the UK EPSRC (grant no.EP/MOO5089/1). We thank the anonymous re-viewers of this paper for their efforts and for theconstructive comments and suggestions.

ReferencesAhmet Aker, Leon Derczynski, and Kalina Bontcheva.

2017. Simple open stance classification for rumouranalysis. In Proceedings of the International Con-ference Recent Advances in Natural Language Pro-cessing, RANLP 2017, Varna, Bulgaria, September2 - 8, 2017, pages 31–39.

Isabelle Augenstein, Tim Rocktaschel, Andreas Vla-chos, and Kalina Bontcheva. 2016. Stance detec-tion with bidirectional conditional encoding. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2016,Austin, Texas, USA, November 1-4, 2016, pages876–885.

Sean Baird, Doug Sibley, and Yuxi Pan. 2017. Talostargets disinformation with fake news challenge vic-tory. https://blog.talosintelligence.com/2017/06/.

Leon Derczynski, Kalina Bontcheva, Maria Liakata,Rob Procter, Geraldine Wong Sak Hoi, and ArkaitzZubiaga. 2017. Semeval-2017 task 8: Rumoureval:Determining rumour veracity and support for ru-mours. In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation, SemEval@ACL2017, Vancouver, Canada, August 3-4, 2017, pages69–76.

William Ferreira and Andreas Vlachos. 2016. Emer-gent: a novel data-set for stance classification. InProceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: Human language technologies,pages 1163–1168.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd annual meet-ing on association for computational linguistics,pages 363–370. Association for Computational Lin-guistics.

48

Andreas Hanselowski, PVS Avinesh, BenjaminSchiller, and Felix Caspelherr. 2017. Description ofthe system developed by team athene in the fnc-1.Technical report, Technical report.

Andreas Hanselowski, Avinesh P. V. S., BenjaminSchiller, Felix Caspelherr, Debanjan Chaudhuri,Christian M. Meyer, and Iryna Gurevych. 2018. Aretrospective analysis of the fake news challengestance-detection task. In Proceedings of the 27th In-ternational Conference on Computational Linguis-tics, COLING 2018, Santa Fe, New Mexico, USA,August 20-26, 2018, pages 1859–1874.

David Jurgens, Mohammad Taher Pilehvar, andRoberto Navigli. 2014. Semeval-2014 task 3:Cross-level semantic similarity. In Proceedings ofthe 8th international workshop on semantic evalua-tion (SemEval 2014), pages 17–26.

Elena Kochkina, Maria Liakata, and Arkaitz Zubi-aga. 2018. All-in-one: Multi-task learning for ru-mour verification. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,COLING 2018, Santa Fe, New Mexico, USA, August20-26, 2018, pages 3402–3413.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In NAACL HLT 2016, The 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, San Diego California, USA, June 12-17,2016, pages 260–270.

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-hani, Xiaodan Zhu, and Colin Cherry. 2016.Semeval-2016 task 6: Detecting stance in tweets. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 31–41.

Saif M Mohammad, Parinaz Sobhani, and SvetlanaKiritchenko. 2017. Stance and sentiment in tweets.ACM Transactions on Internet Technology (TOIT),17(3):26.

Dean Pomerleau and Delip Rao. 2017. Fake news chal-lenge. http://www.fakenewschallenge.org/.

Benjamin Riedel, Isabelle Augenstein, Georgios P Sp-ithourakis, and Sebastian Riedel. 2017. A sim-ple but tough-to-beat baseline for the fake newschallenge stance detection task. arXiv preprintarXiv:1707.03264.

Christopher Scanlan. 1999. Reporting and writing:Basics for the 21st century. Oxford UniversityPress.

Sun Technical Publications. 2003. Read Me First!:A Style Guide for the Computer Industry. PrenticeHall Professional.

James Thorne and Andreas Vlachos. 2018. Automatedfact checking: Task formulations, methods and fu-ture directions. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,COLING 2018, Santa Fe, New Mexico, USA, August20-26, 2018, pages 3346–3359.

Shuohang Wang, Mo Yu, Jing Jiang, and Shiyu Chang.2018. A co-matching model for multi-choice read-ing comprehension. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2018, Melbourne, Australia, July15-20, 2018, Volume 2: Short Papers, pages 746–751.

William Yang Wang. 2017. ”liar, liar pants on fire”: Anew benchmark dataset for fake news detection. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,Vancouver, Canada, July 30 - August 4, Volume 2:Short Papers, pages 422–426.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1480–1489.

Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva,Maria Liakata, and Rob Procter. 2018. Detectionand resolution of rumours in social media: A survey.ACM Computing Surveys (CSUR), 51(2):32.

Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, RobProcter, and Michal Lukasik. 2016a. Stance clas-sification in rumours as a sequential task exploit-ing the tree structure of social media conversations.In COLING 2016, 26th International Conference onComputational Linguistics, Proceedings of the Con-ference: Technical Papers, December 11-16, 2016,Osaka, Japan, pages 2438–2448.

Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geral-dine Wong Sak Hoi, and Peter Tolmie. 2016b.Analysing how people orient to and spread rumoursin social media by looking at conversational threads.PloS one, 11(3):e0150989.

49


Belittling the Source:Trustworthiness Indicators to Obfuscate Fake News on the Web

Diego EstevesSDA Research

University of Bonn, [email protected]

Aniketh Janardhan Reddy+,*

Carnegie Mellon UniversityPittsburgh, USA

[email protected]

Piyush Chawla*

The Ohio State UniversityOhio, USA

[email protected]

Jens LehmannSDA Research / Fraunhofer IAIS

University of Bonn, [email protected]

Abstract

With the growth of the internet, the num-ber of fake-news online has been proliferat-ing every year. The consequences of suchphenomena are manifold, ranging from lousydecision-making process to bullying and vio-lence episodes. Therefore, fact-checking algo-rithms became a valuable asset. To this aim, animportant step to detect fake-news is to haveaccess to a credibility score for a given infor-mation source. However, most of the widelyused Web indicators have either been shut-down to the public (e.g., Google PageRank)or are not free for use (Alexa Rank). Furtherexisting databases are short-manually curatedlists of online sources, which do not scale.Finally, most of the research on the topic istheoretical-based or explore confidential datain a restricted simulation environment. Inthis paper we explore current research, high-light the challenges and propose solutions totackle the problem of classifying websites intoa credibility scale. The proposed model auto-matically extracts source reputation cues andcomputes a credibility factor, providing valu-able insights which can help in belittling du-bious and confirming trustful unknown web-sites. Experimental results outperform state ofthe art in the 2-classes and 5-classes setting.

1 Introduction

With the enormous daily growth of the Web, thenumber of fake-news sources have also been in-creasing considerably (Li et al., 2012). This socialnetwork era has provoked a communication rev-olution that boosted the spread of misinformation,hoaxes, lies and questionable claims. The prolifer-ation of unregulated sources of information allowsany person to become an opinion provider with

+Work was completed while the author was a student atthe Birla Institute of Technology and Science, India and wasinterning at SDA Research.

*These two authors contributed equally to this work.

no restrictions. For instance, websites spread-ing manipulative political content or hoaxes canbe persuasive. To tackle this problem, differ-ent fact-checking tools and frameworks have beenproposed (Zubiaga et al., 2017), mainly dividedinto two categories: fact-checking over naturallanguage claims (Thorne and Vlachos, 2018) andfact-checking over knowledge bases, i.e., triple-based approaches (Esteves et al., 2018). Over-all, fact-checking algorithms aim at determiningthe veracity of claims, which is considered a verychallenging task due to the nature of underlyingsteps, from natural language understanding (e.g.argumentation mining) to common-sense verifi-cation (i.e., humans have prior knowledge thatmakes far easier to judge which arguments areplausible and which are not). Yet an importantunderlying fact-checking step relies upon com-puting the credibility of sources of information,i.e. indicators that allow answering the question:“How reliable is a given provider of informa-tion?”. Due to the obvious importance of the Weband the negative impact that misinformation cancause, methods to demote the importance of web-sites also become a valuable asset. In this sensethe high number of new websites appearing at ev-eryday (Netcraft, 2016), make straightforward ap-proaches - such as blacklists and whitelists - im-practical. Moreover, such approaches are not de-signed to compute credibility scores for a givenwebsite but rather to binary label them. Thus,they aim at detecting mostly “fake” (threatening)websites; e.g., phishing detection, which is out ofscope of this work. Thus, open credibility mod-els have a great importance, especially due to theincrease of fake news being propagated. There ismuch research into credibility factors. However,they are mostly grouped as follows: (1) theoret-ical research on psychological aspects of credi-bility and (2) experiments performed over private

50

and confidential users information, mostly fromweb browser activities (strongly supported by pri-vate companies). Therefore, while (1) lacks prac-tical results (2) report findings which are not muchappealing to the broad open-source community,given the non-open characteristic of the conductedexperiments and data privacy. Finally, recent re-search on credibility has also pointed out impor-tant drawbacks, as follows:

1. Manual (human) annotation of credibility in-dicators for a set of websites is costly (Haasand Unkel, 2017).

2. Search engine results page (SERP) do notprovide more than few information cues(URL, title and snippet) and the dominantheuristic happens to be the search engine(SE) rank itself (Haas and Unkel, 2017).

3. Only around 42.67% of the websites are cov-ered by the credibility evaluation knowledgebase, where most domains have a low credi-bility confidence (Liu et al., 2015)

Therefore, automated credibility models play animportant role in the community - although notbroadly explored yet, in practice. In this paper,we focus on designing computational models topredict the credibility of a given website ratherthan performing sociological experiments or ex-periments with end users (simulations). In thisscenario, we expect that a website from a domainsuch as bbc.com gets a higher trustworthinessscore compared to one from wordpress.com,for instance.

2 Related Work

Credibility is an important research subject in sev-eral different communities and has been the sub-ject of study over the past decades. Most of theresearch, however, focuses on theoretical aspectsof credibility and its persuasive effect on differ-ent fundamental problems, such as economic the-ories (Sobel, 1985).

2.1 Fundamental Research

A thorough examination of psychological aspectsin evaluating documents credibility has been stud-ied (Fogg and Tseng, 1999; Fogg et al., 2001,2003), which reports numerous challenges. Apartfrom sociological experiments, Web Credibility -

in a more practical perspective - has a different fo-cus of research, described as follows:

Rating Systems, Simulations are mostlyplatform-based solutions to conduct experiments(mostly using private data) in order to detect cred-ibility factors. Nakamura et al. (2007) surveyedinternet users from all age groups to understandhow they identified trustworthy websites. Basedon the results of this survey, they built a graph-based ranking method which helped users in gaug-ing the trustworthiness of search results retrievedby a search engine when issued a queryQ. A studyby Stanford University revealed important factorsthat people notice when assessing website credi-bility (Fogg et al., 2003), mostly visual aspects(web site design, look and information design).The writing style and bias of information play asmall role as defining the level of credibility (se-lected by approximately 10% of the comments).However, this process of evaluating the credibil-ity of web pages by users is impacted only bythe number of heuristics they are aware of (Fogg,2003), biasing the human evaluation w.r.t. a lim-ited and specific set features. An important fac-tor considered by humans to judge credibility re-lies on the search engine results page (SERP).The higher ranked a website is when comparedto other retrieved websites the more credible peo-ple judge a website to be (Schwarz and Morris,2011). Popularity is yet another major credibilityfactor (Giudice, 2010). Liu et al. (2015) proposedto integrate recommendation functionality into aWeb Credibility Evaluation System (WCES), fo-cusing on the user’s feedback. Shah et al. (2015)propose a full list of important features for credi-bility aspects, such as 1) the quality of the designof the website and 2) how well the information isstructured. In particular, the perceived accuracyof the information was ranked only in 6th place.Thus, superficial website characteristics as heuris-tics play a key role in credibility evaluation. Donget al (2015) propose a different method (KBT) toestimate the trustworthiness of a web source basedon the information given by the source (i.e., ap-plies fact-checking to infer credibility). This in-formation is represented in the form of triples ex-tracted from the web source. The trustworthinessof the source is determined by the correctness ofthe triples extracted. Thus, the score is computedbased on endogenous (e.g., correctness of facts)signals rather then exogenous signals (e.g., links).

51

Unfortunately, this research from Google does notprovide open data. It is worth mentioning that -surprisingly - their hypothesis (content is more im-portant than visual) contradicts previous researchfindings (Fogg et al., 2003; Shah et al., 2015).While this might be due to the dynamic character-istic of the Web, this contradiction highlights theneed for more research into the real use of webcredibility factors w.r.t. automated web credibilitymodels. Similar to (Nakamura et al., 2007), Singaland Kohli (2016) proposes a tool (dubbed TNM)to re-rank URLs extracted from Google search en-gine according to the trust maintained by the ac-tual users). Apart from the search engine API,their tool uses several other APIs to collect web-site usage information (e.g., traffic and engage-ment info). (Kakol et al., 2017) perform extensivecrowdsourcing experiments that contain credibil-ity evaluations, textual comments, and labels forthese comments.

SPAM/phishing detection: Abbasi etal. (2010) propose a set of design guidelineswhich advocated the development of SLT-basedclassification systems for fraudulent websitedetection, i.e., despite seeming credible - websitesthat try to obtain private information and defraudvisitors. PhishZoo (Afroz and Greenstadt, 2011)is a phishing detection system which helps usersin identifying phishing websites which look simi-lar to a given set of protected websites through thecreation of profiles.

2.2 Automated Web Credibility

Automated Web Credibility models for websiteclassification are not broadly explored, in practice.The aim is to produce a predictive model giventraining data (annotated website ranks) regardlessof an input queryQ. Existing gold standard data isgenerated from surveys and simulations (see Rat-ing Systems, Simulations related work). Currently,state of the art (SOTA) experiments rely on theMicrosoft Credibility dataset1 (Schwarz and Mor-ris, 2011). Recent research use the website label(Likert scale) released in the Microsoft dataset asa gold standard to train automated web credibilitymodels, as follows:

Olteanu et al. (2013) proposes a number ofproperties (37 linguistic and textual features) and

1It is worth mentioning that this survey is mostly based onconfidential data and it is not available to the open community(e.g., overall popularity, popularity among domain experts,geo-location of users and number of awards)

applies machine learning methods to recognizetrust levels, obtaining 22 relevant features forthe task. Wawer et al. (2014) improve thiswork using psychosocial and psycholinguistic fea-tures (through The General Inquirer (GI) LexicalDatabase (Stone and Hunt, 1963)) achieving stateof the art results.

Finally, another resource is the Content Credi-bility Corpus (C3) (Kakol et al., 2017), the largestWeb credibility Corpus publicity available so far.However, in this work authors did not perform ex-periments w.r.t. automated credibility models us-ing a standard measure (i.e., Likert scale), such asin (Olteanu et al., 2013; Wawer et al., 2014). In-stead, they rather focused on evaluating the theo-ries of web credibility in order to produce a muchlarger and richer corpus.


3.1 State-of-the-art (SOTA) FeaturesRecent research on credibility factors for websites (Olteanu et al., 2013) have initially dividedthe features into the following logical groups:

1. Content-based (25 features): number of spe-cial characters in the text, spelling errors, website category and etc..

(a) Text (20 features)(b) Appearance (4 features)(c) Meta-information (1 feature)

2. Social-based (12 features): Social MediaMetadata (e.g., Facebook shares, Tweetspointing to a certain URL, etc.), Page Rank,Alexa Rank and similar.

(a) Social Popularity (9 features)(b) General Popularity (1 feature)(c) Link structure (2 features)

According to (Olteanu et al., 2013), a resultantnumber of 22 features (out of 37) were selectedas most significant (10 for content-based and allsocial-based features). Surprisingly (but also fol-lowing (Dong et al., 2015)), none from the sub-group Appearance, although studies have system-atically shown the opposite, i.e., that visual aspectsare one of the most important features (Fogg et al.,2003; Shah et al., 2015; Haas and Unkel, 2017).

In this picture, we claim the most negative as-pect is the reliance on Social-based features. Thisdependency not only affects the final performance

52

of the credibility model, but also implies in finan-cial costs as well as presenting high discriminativecapacity, adding a strong bias to the performanceof the model2. The computation of these featuresrelies heavily on external (e.g., Facebook API3 andAdBlock4) and commercial libraries (Alchemy5,PageRank6, Alexa Rank7. Thus, engineering andfinancial costs are a must. Furthermore, popu-larity on Facebook or Twitter can be measuredonly by data owners. Additionally, vendors maychange the underlying algorithms without furtherexplanation. Therefore, also following Wawer etal. (2014), in this paper we have excluded Social-based features from our experimental setup.

On top of that, (Wawer et al., 2014) incre-mented the model, adding features extracted fromthe General Inquirer (GI) Lexical Database, result-ing in a vector of 183 extra categories, apart fromthe selected 22 base features, i.e. total of 205 fea-tures (However, this is subject to contradictions.Please see Section 4.1 for more information).

3.2 Datasets

3.2.1 Website credibility evaluationMicrosoft Dataset (Schwarz and Morris, 2011)consists of thousands of URLs and their credibil-ity ratings (five-point Likert Scale8), ranging from1 (”very non-credible”) to 5 (”very credible”). Inthis study, participants were asked to rate the web-sites as credible following the definition: “A cred-ible webpage is one whose information one canaccept as the truth without needing to look else-where”. Studies by (Olteanu et al., 2013; Waweret al., 2014) use this dataset for evaluation. Con-tent Credibility Corpus (C3)9 is the most recentand the largest credibility dataset currently pub-licly available for research (Kakol et al., 2017). Itcontains 15.750 evaluations of 5.543 URLs from2.041 participants with some additional informa-tion about website characteristics and basic demo-graphic features of users. Among many metadatainformation existing in the dataset, in this workwe are only interested in the URLs and their re-

2authors applied ANOVA test confirming this finding3https://developers.facebook.com/4https://adblockplus.org/5www.alchemyapi.com6excepting for heuristic computations, calculation of

PageRank requires crawling the whole Internet7https://www.alexa.com/siteinfo8https://en.wikipedia.org/wiki/Likert_

scale9also known as Reconcile Corpus

spective five-point Likert scale, so that we obtainthe same information available in the Microsoftdataset.

3.2.2 Fact-checking influenceIn order to verify the impact of our web credibilitymodel in a real use-case scenario, we ran a fact-checking framework to verify a set of input claims.Then we collected the sources (URLs) containingproofs to support a given claim. We used this as adataset to evaluate our web credibility model.

The primary objective is to verify whether ourmodel is able, on average, to assign lower scores tothe websites that contain proofs supporting claimswhich are labeled as false in the FactBench dataset(i.e., the source is providing false information,thus should have a lower credibility score). Sim-ilarly, we expect that websites that support posi-tive claims are assigned with higher scores (i.e.,the source is supporting an accurate claim, thusshould have a higher credibility score).

The (gold standard) input claims were obtainedfrom the FactBench dataset10, a multilingualbenchmark for the evaluation of fact validation al-gorithms. It contains a set of RDF11 models (10different relations), where each model containsa singular fact expressed as a subject-predicate-object triple. The data was automatically extractedfrom DBpedia and Freebase KBs, and manuallycurated in order to generate true and false exam-ples.

The website list extraction was carried out byDeFacto (Gerber et al., 2015), a fact-checkingframework designed for RDF KBs. DeFacto re-turns a set of websites as pieces of evidence to sup-port its prediction (true or false) for a given inputclaim.

3.3 Final FeaturesWe implemented a set of Content-based fea-tures (Section 3.1) adding more lexical and textualbased features. Social-based features were notconsidered due to financial costs associated withpaid APIs. The final set of features for each web-site w is defined as follows:

1. Web Archive: the temporal information w.r.t.cache and freshness. ∆b and ∆e correspond tothe temporal differences of the first and last 2 up-dates, respectively. ∆a represents the age ofw andfinally ∆u represents the temporal difference for

10https://github.com/DeFacto/FactBench11https://www.w3.org/RDF/

53

the last update to today. γ is a penalization factorwhen the information is obtained from the domainof w (wd) instead w.

farc(w) =([ 1

log(∆b ×∆e)+log(∆a)+

1

∆u

])×γ

2. Domain: refers to the (encoded) domain w(e.g. org)

3. Authority: searches for authoritative key-words within the page HTML content wc (e.g.,contact email, business address, etc..)

4. Outbound Links: searches the number ofdifferent outbound links in w ∧ wd ∈ d, i.e.,∑P

n=1 φ(wc) where P is the number of web-basedprotocols.

5. Text Category: returns a vector containing theprobabilities P for each pre-trained category c ofw w.r.t. the sentences of the website ws and pagetitle wt:

∑wss=1 γ(s)_γ(wt). We trained a set of

binary multinomial Naive Bayes (NB) classifiers,one per class, as follows: business, entertainment,politics, religion, sports and tech.

6. Text Category - LexRank: reduces thenoisy of wb by classifying only top N sen-tences generated by applying LexRank (Erkanand Radev, 2004) over wb (S′ = Γ(wb, N)),which is a graph-based text summarizing tech-nique:

∑S′s′=1 γ(s′)_γ(wt).

7. Text Category - LSA: similarly, we ap-ply Latent Semantic Analysis (LSA) (Stein-berger and Jeek, 2004) to detect semanticallyimportant sentences in wb (S′ = Ω(wb, N)):∑S′

s′=1 γ(s′)_γ(wt).8. Readability Metrics: returns a vector result-

ing of the concatenation of several R readabilitymetrics (Si and Callan, 2001)

9. SPAM: detects whether the wb or wt are clas-sified as spam: ψ(wb)

_ψ(wt)

10. Social Tags: returns the frequency of social

tags in wb:R⋃i=1

ϕ(i, wb)

11. OpenSources: returns the open-source clas-sification (x) for a given website:

x =

1, if w ∈ O0, if w 6∈ O

12. PageRankCC: PageRank information com-puted through the CommonCrawl12 Corpus

12http://commoncrawl.org/

13. General Inquirer (Stone and Hunt, 1963): a182-lenght vector containing several lexicons

14. Vader Lexicon: lexicon and rule-based sen-timent analysis tool that is specifically attuned tosentiments

15. HTML2Seq: we introduce the concept ofbag-of-tags, where similarly to bag-of-words13 wegroup the HTML tag occurrences in each web site.We additionally explore this concept along with asequence problem, i.e. we encode the tags andevaluate this considering a window size (offset)from the header of the page.

4 Experiments

Previous research proposes two application set-tings w.r.t. the classification itself, as follows:(A.1) casting the credibility problem as a classi-fication problem and (A.2) evaluating the credi-bility on a five-point Likert scale (regression). Inthe classification scenario, the models are evalu-ated both w.r.t. the 2-classes as well as 3-classes.In the 2-classes scenario, websites ranging from 1to 3 are labeled as “low” whereas 4 and 5 are la-beled as “high” (credibility). Analogously, in the3-classes scenario, websites labeled as 1 and 2 areconverted to “low”, 3 remains as “medium” while4 and 5 are grouped into the “high” class.

We first explore the impact of the bag-of-tagsstrategy. We encode and convert the tags into asequence of tags, similar to a sequence of sen-tences (looking for opening and closing tags, e.g.,<a>and</a>). Therefore, we perform documentclassification over the resulting vectors. Figures 1ato 1d show results of this strategy for both 2 and3-classes scenarios. The x-axis is the log scale ofthe paddings (i.e., the offset of HTML tags we re-trieved from w, ranging from 25 to 10.000). Thecharts reveal an interesting pattern in both gold-standard datasets (Microsoft Dataset and C3 Cor-pus): the first tags are the most relevant to pre-dict the credibility class. Although this strategydoes not achieve state of the art performance14, itpresents reasonable performance by just inspect-ing website metadata: F1-measures = 0.690 and0.571 for the 2-classes and 3-classes settings, re-spectively. However, it is worth mentioning thatthe main advantage of this approach lies in the factthat it is language agnostic (while current research

13https://en.wikipedia.org/wiki/Bag-of-words_model

14F1 measures = 0.745 (2-classes) and 0.652 (3-classes).

54

Microsoft Dataset(Gradient Boosting, K = 25)

Class Precision Recall F1

low 0.851 0.588 0.695high 0.752 0.924 0.829weighted 0.794 0.781 0.772micro 0.781 0.781 0.781macro 0.801 0.756 0.762

C3 Corpus(AdaBoost, K = 75)


low 0.558 0.355 0.434high 0.732 0.862 0.792weighted 0.675 0.695 0.674micro 0.695 0.695 0.695macro 0.645 0.609 0.613

Table 1: Text+HTML2Seq features (2-class): bestclassifier performance

focuses on English) as well as less susceptible tooverfitting.

We then evaluate the performance of the tex-tual features (Section 3.3) isolated. Results for the2-classes scenario are presented as follows: Fig-ure 2a highlights the best models performance us-ing textual features only. While this as a singlefeature does not outperform the lexical features,when we combine the bag-of-tags approach (pre-dictions of probabilities for each class) we boostthe performance (F1 from 0.738 to 0.772) and out-perform state of the art (0.745), as shown in Fig-ure 2b. Tables 1 to 3 shows detailed results for bothdatasets (2-classes, 3-classes and 5-classes config-urations, respectively). For 5-class regression, wefound that the best pad = 100 for the Microsoftdataset and best pad = 175 for the C3 Corpus. Wepreceded the computing of both classification andregression models with feature selection accord-ing to a percentile of the highest scoring features(SelectKBest). We tested the choice of 3, 5, 10,25, 50 75 and K=100 percentiles (thus, no selec-tion) of features and did not find a unique K valuefor every case. It is worth noticing that in generalit is easy to detect high credible sources (F1 for“high” class around 0.80 in all experiments andboth datasets) but recall of “low” credible sourcesis still an issue.

Table 4 shows statistics on the data generated by

Microsoft Dataset(Gradient Boosting, K = 75)


low 0.567 0.447 0.500medium 0.467 0.237 0.315high 0.714 0.916 0.803weighted 0.626 0.662 0.626micro 0.662 0.662 0.662macro 0.583 0.534 0.539

C3 Corpus(AdaBoost, K = 100)


low 0.143 0.031 0.051medium 0.410 0.177 0.247high 0.701 0.916 0.794weighted 0.583 0.660 0.598micro 0.660 0.660 0.660macro 0.418 0.375 0.364

Table 2: Text+HTML2Seq features (3-class): bestclassifier performance

the fact-checking algorithm. For 1500 claims, itcollected pieces of evidence for over 27.000 web-sites. Table 5 depicts the impact of the credibilitymodel in the fact-checking context. We collecteda small subset of 186 URLs from the FactBenchdataset and manually annotated15 the credibilityfor each URL (following the Likert scale). Themodel corrected labeled around 80% of the URLsassociated with a positive claim and, more im-portantly, 70% of non-credible websites linked tofalse claims were correctly identified. This helpsto minimize the number of non-credible informa-tion providers that contain information that sup-ports a false claim.

4.1 DiscussionReproducibility is still one of the cornerstones ofscience and scientific projects (Baker, 2016). Inthe following, we list some relevant issues encoun-tered while performing our experiments:

Experimental results: this gap is also observedw.r.t. results reported by (Olteanu et al., 2013),which is acknowledged by (Wawer et al., 2014),despite numerous attempts to replicate experi-ments. Authors (Wawer et al., 2014) believe this is

15By four human annotators. In the event of a tie we ex-clude the URL from the final dataset.

55

(a) HTML2Seq (F1): Microsoft dataset 2-classes (b) HTML2Seq (F1-measure): Microsoft dataset 3-classes

(c) HTML2Seq (F1): C3 Corpus 2-classes (d) HTML2Seq (F1): C3 Corpus 3-classes

Figure 1: HTML2Seq (F1-measure) over different padding sizes.

(a) Textual Features. (b) Textual+HTML2Seq (best padding) Features.

Figure 2: Evaluating distinct classifiers in the 2-classes setting (Microsoft dataset): increasing almost +3%(from 0.745 to 0.772) on average F1 (Gradient Boosting). Feature selection performed with ANOVASelectKBest method, K=0.25.

56

Microsoft Dataset

model K R2 RMSE MAE EVar

SVR 3 0.232 0.861 0.691 0.238Ridge 3 0.268 0.841 0.683 0.269

C3 Corpus

model K R2 RMSE MAE EVar

SVR 25 0.096 0.939 0.739 0.102Ridge 25 0.133 0.920 0.750 0.134

Table 3: Text+HTML2Seq: regression measures(5-class). Selecting top K lexical features.

FactBench (Credibility Model)

label claims sites

true 750 14.638false 750 13.186- 1500 27.824

Table 4: FactBench: Web sites collected fromclaims.

due to the lack of parameters and hyperparametersexplicitly cited in the previous research (Olteanuet al., 2013).

Microsoft dataset: presents inconsistencies.Although all the web pages are cached (in theory)in order to guarantee a deterministic environment,the dataset - in its original form16 - has a numberof problems, as follows: (a) web pages not physi-cally cached (b) URL not matching (dataset linksversus cached files) (c) Invalid file format (e.g.,PDF). Even though these issues have also beenpreviously identified by related research (Olteanuet al., 2013) it is not clear what the URLs for thefinal dataset (i.e., the support) are nor where thisnew version is available.

Contradictions: w.r.t. the divergence of the im-portance of visual features have drawn our atten-tion (Dong et al., 2015) and (Fogg, 2003; Shahet al., 2015) which corroborate to the need of moremethods to solve the web credibility problem, inpractice. The main hypothesis that supports thiscontradiction relies on the fact that feature-basedcredibility evaluation eventually ignites cat-and-mouse play between scientists and people inter-

16The original dataset can be downloaded fromhttp://research.microsoft.com/en-us/projects/credibility/

FactBench (Sample - Human Annotation)

label claims sites non-cred cred

true 5 96 57 39false 5 80 48 32- 10 186 105 71

FactBench (Sample - Credibility Model)

label non-cred % cred %

true 40 0.81 31 0.79false 34 0.70 24 0.75

Table 5: FactBench Dataset: analyzing the per-formance of the credibility model in the fact-checking task.

ested in manipulating the models. In this case,reinforcement learning methods pose as a good al-ternative for adaptation.

Proposed features: The acknowledgementmade by authors in (Wawer et al., 2014) that “so-lutions based purely on external APIs are diffi-cult to use beyond scientific application and areprone for manipulation” confirming the need toexclude social features from research of (Olteanuet al., 2013) contradicts itself. In the course of ex-periments, authors mention the usage of all fea-tures proposed by (Olteanu et al., 2013): “Table1 presents regression results for the dataset de-scribed in [13] in its original version (37 features)and extended with 183 variables from the GeneralInquirer (to 221 features)”.

Therefore, due to the number of relevant issuespresented w.r.t. reproducibility and contradictionof arguments, the comparison to recent researchbecomes more difficult. In this work, we solvedthe technical issues in the Microsoft dataset andreleased a new fixed version17. Also, since weneed to perform evaluations in a deterministic en-vironment, we cached and released the websitesfor the C3 corpus. After scraping, 2.977 URLswere used (out of 5.543). Others were left dueto processing errors (e.g., 404). The algorithmsand its hyperparameters and further relevant meta-data are available through the MEX InterchangeFormat (Esteves et al., 2015). By doing this, weprovide a computational environment to performsafer comparisons, being engaged in recent discus-sions about mechanisms to measure and enhance

17more information at the project website: https://github.com/DeFacto/WebCredibility

57

the reproducibility of scientific projects (Wilkin-son et al., 2016).

5 Conclusion

In this work, we discuss existing alternatives, gapsand current challenges to tackle the problem ofweb credibility. More specifically, we focused onautomated models to compute a credibility fac-tor for a given website. This research follows theformer studies presented by (Olteanu et al., 2013;Wawer et al., 2014) and presents several contribu-tions. First, we propose different features to avoidthe financial cost imposed by external APIs in or-der to access website credibility indicators. Thisissue has become even more relevant in the lightof the challenges that have emerged after the shut-down of Google PageRank, for instance. To bridgethis gap, we have proposed the concept of bag-of-tags. Similar to (Wawer et al., 2014), we con-duct experiments in a highly-dimensional featurespace, but also considering web page metadata,which outperforms state of the art results in the 2-classes and 5-classes settings. Second, we identi-fied and fixed several problems on a gold standarddataset for web credibility (Microsoft), as well asindexed several web pages for the C3 Corpus. Fi-nally, we evaluate the impact of the model in areal fact-checking use-case. We show that the pro-posed model can help in belittling and supportingdifferent websites that contain evidence of true andfalse claims, which helps the very challenging factverification task. As future work, we plan to ex-plore deep learning methods over the HTML2Seqmodule.

Acknowledgments

This research was partially supported by an EUH2020 grant provided for the WDAqua project(GA no. 642795) and by DAAD under the project“International promovieren in Deutschland furalle” (IPID4all).

ReferencesAhmed Abbasi, Zhu Zhang, David Zimbra, Hsinchun

Chen, and Jay F Nunamaker Jr. 2010. Detectingfake websites: the contribution of statistical learn-ing theory. Mis Quarterly, pages 435–461.

Sadia Afroz and Rachel Greenstadt. 2011. Phishzoo:Detecting phishing websites by looking at them.2012 IEEE Sixth International Conference on Se-mantic Computing, 00:368–375.

Monya Baker. 2016. 1,500 scientists lift the lid on re-producibility. Nature News, 533(7604):452.

Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy,Van Dang, Wilko Horn, Camillo Lugaresi, Shao-hua Sun, and Wei Zhang. 2015. Knowledge-basedtrust: Estimating the trustworthiness of web sources.Proc. VLDB Endow., 8(9):938–949.

Gunes Erkan and Dragomir R. Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. J. Artif. Int. Res., 22(1):457–479.

Diego Esteves, Diego Moussallem, Ciro Baron Neto,Tommaso Soru, Ricardo Usbeck, Markus Acker-mann, and Jens Lehmann. 2015. Mex vocabulary: Alightweight interchange format for machine learningexperiments. In Proceedings of the 11th Interna-tional Conference on Semantic Systems, SEMAN-TICS ’15, pages 169–176, New York, NY, USA.ACM.

Diego Esteves, Anisa Rula, Aniketh Janardhan Reddy,and Jens Lehmann. 2018. Toward veracity assess-ment in rdf knowledge bases: An exploratory anal-ysis. Journal of Data and Information Quality(JDIQ), 9(3):16.

BJ Fogg, Jonathan Marshall, Othman Laraki, Alex Os-ipovich, Chris Varma, Nicholas Fang, Jyoti Paul,Akshay Rangnekar, John Shon, Preeti Swani, et al.2001. What makes web sites credible?: a reporton a large quantitative study. In Proceedings of theSIGCHI conference on Human factors in computingsystems, pages 61–68. ACM.

BJ Fogg and Hsiang Tseng. 1999. The elements ofcomputer credibility. In Proceedings of the SIGCHIconference on Human Factors in Computing Sys-tems, pages 80–87. ACM.

Brian J Fogg. 2003. Prominence-interpretation the-ory: Explaining how people assess credibility on-line. In CHI’03 extended abstracts on human fac-tors in computing systems, pages 722–723. ACM.

Brian J Fogg, Cathy Soohoo, David R Danielson,Leslie Marable, Julianne Stanford, and Ellen RTauber. 2003. How do users evaluate the credibilityof web sites?: a study with over 2,500 participants.In Proceedings of the 2003 conference on Designingfor user experiences, pages 1–15. ACM.

Daniel Gerber, Diego Esteves, Jens Lehmann, LorenzBuhmann, Ricardo Usbeck, Axel-Cyrille NgongaNgomo, and Rene Speck. 2015. Defacto - tempo-ral and multilingual deep fact validation. Web Se-mantics: Science, Services and Agents on the WorldWide Web.

Katherine Del Giudice. 2010. Crowdsourcing credibil-ity: The impact of audience feedback on web pagecredibility. In Proceedings of the 73rd ASIS&T An-nual Meeting on Navigating Streams in an Informa-tion Ecosystem-Volume 47, page 59. American Soci-ety for Information Science.

58

Alexander Haas and Julian Unkel. 2017. Ranking ver-sus reputation: perception and effects of search re-sult credibility. Behaviour & Information Technol-ogy, 36(12):1285–1298.

Michal Kakol, Radoslaw Nielek, and AdamWierzbicki. 2017. Understanding and predict-ing web content credibility using the contentcredibility corpus. Information Processing &Management, 53(5):1043–1061.

Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng,and Divesh Srivastava. 2012. Truth finding on thedeep web: Is the problem solved? Proc. VLDB En-dow., 6(2):97–108.

Xin Liu, Radoslaw Nielek, Paulina Adamska, AdamWierzbicki, and Karl Aberer. 2015. Towards ahighly effective and robust web credibility evalua-tion system. Decision Support Systems, 79:99–108.

Satoshi Nakamura, Shinji Konishi, Adam Jatowt, Hi-roaki Ohshima, Hiroyuki Kondo, Taro Tezuka,Satoshi Oyama, and Katsumi Tanaka. 2007. Trust-worthiness Analysis of Web Search Results. SpringerBerlin Heidelberg, Berlin, Heidelberg.

Netcraft. 2016. Netcraft survey (2016). http://www.webcitation.org/6lhJlHtez. Ac-cessed: 2017-10-01.

Alexandra Olteanu, Stanislav Peshterliev, Xin Liu, andKarl Aberer. 2013. Web credibility: features ex-ploration and credibility prediction. In Europeanconference on information retrieval, pages 557–568.Springer.

Julia Schwarz and Meredith Morris. 2011. Augment-ing web pages and search results to support credibil-ity assessment. In Proceedings of the SIGCHI Con-ference on Human Factors in Computing Systems,pages 1245–1254. ACM.

Asad Ali Shah, Sri Devi Ravana, Suraya Hamid, andMaizatul Akmar Ismail. 2015. Web credibility as-sessment: affecting factors and assessment tech-niques. Information Research, 20(1):20–1.

Luo Si and Jamie Callan. 2001. A statistical model forscientific readability. In Proceedings of the Tenth In-ternational Conference on Information and Knowl-edge Management, CIKM ’01, pages 574–576, NewYork, NY, USA. ACM.

Himani Singal and Shruti Kohli. 2016. Trust neces-sitated through metrics: Estimating the trustwor-thiness of websites. Procedia Computer Science,85:133–140.

Joel Sobel. 1985. A theory of credibility. The Reviewof Economic Studies, 52(4):557–573.

Josef Steinberger and Karel Jeek. 2004. Using latentsemantic analysis in text summarization and sum-mary evaluation. In In Proc. ISIM 04, pages 93–100.

Philip J. Stone and Earl B. Hunt. 1963. A computer ap-proach to content analysis: Studies using the generalinquirer system. In Proceedings of the May 21-23,1963, Spring Joint Computer Conference, AFIPS’63 (Spring), pages 241–256, New York, NY, USA.ACM.

James Thorne and Andreas Vlachos. 2018. Automatedfact checking: Task formulations, methods and fu-ture directions. CoRR, abs/1806.07687.

Aleksander Wawer, Radoslaw Nielek, and AdamWierzbicki. 2014. Predicting webpage credibilityusing linguistic features. In Proceedings of the 23rdinternational conference on world wide web, pages1135–1140. ACM.

Mark D Wilkinson, Michel Dumontier, IJsbrand JanAalbersberg, Gabrielle Appleton, Myles Axton,Arie Baak, Niklas Blomberg, Jan-Willem Boiten,Luiz Bonino da Silva Santos, Philip E Bourne, et al.2016. The fair guiding principles for scientific datamanagement and stewardship. Scientific data, 3.

Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva,Maria Liakata, and Rob Procter. 2017. Detectionand resolution of rumours in social media: A survey.CoRR, abs/1704.00656.

59


Automated Fact-Checking of Claimsin Argumentative Parliamentary Debates

Nona NaderiDepartment of Computer Science

University of TorontoToronto, Ontario, Canada

[email protected]

Graeme HirstDepartment of Computer Science

University of TorontoToronto, Ontario, [email protected]

Abstract

We present an automated approach to distin-guish true, false, stretch, and dodge state-ments in questions and answers in the Cana-dian Parliament. We leverage the truthfulnessannotations of a U.S. fact-checking corpus bytraining a neural net model and incorporatingthe prediction probabilities into our models.We find that in concert with other linguisticfeatures, these probabilities can improve themulti-class classification results. We furthershow that dodge statements can be detectedwith an F1 measure as high as 82.57% in bi-nary classification settings.

1 Introduction

Governments and parliaments that are selectedand chosen by citizens’ votes have ipso facto at-tracted a certain level of trust. However, govern-ments and parliamentarians use combinations oftrue statements, false statements, and exaggera-tions in strategic ways to question other parties’trustworthiness and to thereby create distrust to-wards them while gaining credibility for them-selves. Creating distrust and alienation may beachieved by using ad hominem arguments or byraising questions about someone’s character andhonesty (Walton, 2005). For example, considerthe claims made within the following question thatwas asked in the Canadian Parliament:

Example 1.1 [Dominic LeBlanc, 2013-10-21]The RCMP and Mike Duffy’s lawyer have shownus that the Prime Minister has not been honestabout this scandal. When will he come clean andstop hiding his own role in this scandal?

These claims, including the presupposition of thesecond sentence that the Prime Minister has a rolein the scandal that he is hiding, may be true, false,or simply exaggerations. In order to be able toanalyze how these claims serve their presenter’s

purpose or intention, we need to determine theirtruth.

Here, we will examine the linguistic char-acteristics of true statements, false statements,dodges, and stretches in argumentative parliamen-tary statements. We examine whether falsehoodstold by members of parliament can be identifiedwith previously proposed approaches and we findthat while some of these approaches improve theclassification, identifying falsehoods by membersof parliament remains challenging.

2 Related work

Vlachos and Riedel (2014) proposed to use datafrom fact-checking websites, such as PolitiFactfor the fact-checking task and suggested that oneway to approach this task would be using the se-mantic similarity between statements. Hassan etal. (2015) used presidential debates and proposedthree labels — Non-Factual, Unimportant Fac-tual, and Check-worthy Factual sentence — forthe fact-checking task. They used a traditionalfeature-based method and trained their models us-ing sentiment scores using AlchemyAPI, wordcounts of a sentence, bag of words, part-of-speechtags, and entity types to classify the debates intothese three labels. They found that the part-of-speech tag of cardinal numbers was the most in-formative feature and word counts was the secondmost informative feature. They also found thatcheck-worthy actual claims were more likely tocontain numeric values and non-factual sentenceswere less likely to contain numeric values.

Patwari et al. (2017) used primary debates andpresidential debates for analyzing check-worthystatements. They used topics extracted usingLDA, entity history and type counts, part-of-speech tuples, counts of part-of-speech tags, uni-grams, sentiment, and token counts for their classi-

60

Label True False Dodge Stretch Total# 255 60 70 93 478

Table 1: Distribution of labels in the Toronto Stardataset

Label #True 1,780Mostly true 2,003Half true 2,152Mostly false 1,717False 1,964Pants-on-fire false 867Total 10,483

Table 2: Distribution of labels in the PolitiFact dataset

fication task. Ma et al. (2017) used a kernel-basedmodel to detect rumors in tweets. Wang (2017)used the statements from PolitiFact and the 6-point scale of truthfulness; he compared the per-formance of multiple classifiers and reported someimprovement by using metadata related to the per-son making the statements.

Rashkin et al. (2017) examined the effective-ness of LIWC (Linguistic Inquiry and WordCount) and stylistic lexicon features in determin-ing the reliability of the news corpus and truthful-ness of the PolitiFact dataset. The only reliabil-ity measurement reported on the PolitiFact datasetis by Wang (2017), who manually analyzed 200statements from PolitiFact and reached an agree-ment of 0.82 using Cohen’s kappa measurementwith the journalists’ labels. Jaradat et al. (2018)used a set of linguistic features to rank check-worthy claims. Throne et al. (2018) created adataset for claim verification. This dataset con-sists of 185,445 claims verified against Wikipediapages. Here, we do not consider any external re-sources and we focus only on the text of claims todetermine whether we can classify claims as true,false, dodge, or stretch.

3 Data

For our analysis, we extracted our data from aproject by the Toronto Star newspaper.1 The Starreporters2 fact-checked and annotated questions

1http://projects.thestar.com/question-period/index.html. All the data ispublicly available.

2Bruce Campion-Smith, Brendan Kennedy, MarcoChown Oved, Alex Ballingall, Alex Boutilier, and TondaMacCharles.

and answers from the Oral Question Period of theCanadian Parliament (over five days in April andMay 2018). Oral Question Period is a session of45 minutes in which the Opposition and Govern-ment backbenchers ask questions of ministers ofthe government, and the ministers must respond.The reporters annotated all assertions within boththe questions and the answers as either true, false,stretch, (half true), or dodge (not actually answer-ing the question). Further, they provided a narra-tive justification for the assignment of each label(we do not use that data here). Here is an exampleof the annotated data (not including the justifica-tions):

Example 3.1 Q. [Michelle Rempel] Mr. Speaker,[social programs across Canada are under severestrain due to tens of thousands of unplannedimmigrants illegally crossing into Canada fromthe United States.]False [Forty per cent inToronto’s homeless shelters are recent asylumclaimants.]True [This, food bank usage, andunemployment rates show that many new asylumclaimants are not having successful integrationexperiences.]False

A. [Ahmed Hussen (Minister of Immigration,Refugees and Citizenship)] Mr. Speaker, wecommend the City of Toronto, as well as theProvince of Ontario, the Province of Quebec,and all Canadians, on their generosity towardnewcomers. That is something this country isproud of, and we will always be proud of ourtradition. [In terms of asylum processing, makingsure that there are minimal impacts on provincialsocial services, we have provided $74 millionto make sure that the Immigration and RefugeeBoard does its work so that legitimate claimantscan move on with their lives and those who donot have legitimate claims can be removed fromCanada.]True

Here is an example of dodge annotation:

Example 3.2 Q. [Jacques Gourde] . . . How muchmoney does that represent for the families that willbe affected by the sexist carbon tax over a one-year period?

A. [Catherine McKenna (Minister of Environ-ment and Climate Change)] [Mr. Speaker, I amquite surprised to hear them say they are con-cerned about sexism. That is the party thatclosed 12 out of 16 Status of Women Canadaoffices.]Dodge We know that we must take action

61

Features F1 Accuracy Dodge True False StretchMajority class (True) – 53.35BOW (tf-idf) 49.20 53.14 55.20 67.00 4.60 24.80+ POS 52.92 58.15 62.40 71.00 4.80 27.40+ NUM 53.40 58.58 63.80 70.80 4.80 28.80+ Superlatives (Rashkin et al., 2017) 54.24 59.42 63.80 71.60 9.20 30.00+ PolitiFact predictions 55.10 59.63 63.60 71.60 12.80 30.80BOW + NE 50.66 53.33 57.40 66.40 17.20 24.40

Table 3: Five-fold cross-validation results (F1 and % accuracy) of four-way classification of fact-checking for theoverall dataset and F1 for each class.

on climate change. Canadians know that we havea plan, but they are not so sure if the Conservativesdo.

For our analysis, we extracted the annotatedspan of the text with its associated label. The dis-tribution of the labels in this dataset is shown inTable 1. This is a skewed dataset with more thanhalf of the statements annotated as true.

We also use a publicly available dataset fromPolitiFact, a website at which statements by Amer-ican politicians and officials are annotated with a6-point scale of truthfulness.3 The distribution oflabels in this data is shown in Table 2. We examinePolitiFact data to determine whether these annota-tions can help the classification of the Toronto Starannotations.

4 Method

We formulate the analysis as a multi-class classifi-cation task; given a statement, we identify whetherthe statement is true, false, stretch, or a dodge.

We first examine the effective features used foridentifying deceptive texts in the prior literature.

• Tuples of words and their part-of-speech tags(unigrams and bigrams weighted by tf-idf,represented by POS in the result tables).

• Number of words in the statement (Hassanet al., 2015; Patwari et al., 2017).

• Named entity type counts, including organi-zations and locations (Patwari et al., 2017)(represented by NE in the result tables).

• Total number of numbers in the text, e.g.,six organizations heard the assistant deputy

3The dataset has been made available by HannahRashkin at https://homes.cs.washington.edu/˜hrashkin/factcheck.html.

minister (Hassan et al., 2015) (represented byNUM in the result tables).

• LIWC (Tausczik and Pennebaker, 2010) fea-tures (Rashkin et al., 2017).

• Five lexicons of intensifying words fromWiktionary: superlatives, comparatives, ac-tion adverbs, manner adverbs, modal ad-verbs (Rashkin et al., 2017).

In addition, we leverage the American Politi-Fact data to fact-check the Canadian Parliamen-tary questions and answers by training a GatedRecurrent Unit classifier (GRU) (Cho et al., 2014)on this data. We will use the truthfulness predic-tions of this classifier — the probabilities of the6-point-scale labels — as additional features forour SVM classifier (using the scikit-learn pack-age (Pedregosa et al., 2011)). For training theGRU classifier, we initialized the word represen-tations using the publicly available GloVe pre-trained 100-dimension word embeddings (Pen-nington et al., 2014)4, and restricted the vocabu-lary to the 5,000 most-frequent words and a se-quence length of 300. We added a dropout of 0.6after the embedding layer and a dropout layer of0.8 before the final sigmoid unit layer. The modelwas trained with categorical cross-entropy withthe Adam optimizer (Kingma and Ba, 2014) for10 epochs and batch size of 64. We used 10% ofthe data for validation, with the model achievingan average F1 measure of 31.44% on this data.

5 Results and discussion

We approach the fact-checking of the statementsas a multi-class classification task. Our baselines

4https://nlp.stanford.edu/projects/glove/

62

Features Dodge Stretch FalseTrueMajority class 54.84 52.25 58.62BOW 76.09 54.21 58.20BOW + NE 75.65 52.99 61.67BOW + LIWC 52.38 49.11 53.41BOW + PolitiFact 77.96 55.73 58.11BOW + NE + Politifact 76.25 53.76 63.69BOW + POS + NUM +

Superlative + PolitiFact 77.51 54.96 55.24FalseMajority class 53.85 60.00BOW 81.36 55.89BOW + NE 82.57 56.91BOW + LIWC 52.02 53.31BOW + PolitiFact 80.69 52.97BOW + NE + Politifact 82.52 55.08BOW + POS + NUM +

Superlative + PolitiFact 78.29 54.82StretchMajority class 57.06BOW 75.15BOW + NE 76.93BOW + LIWC 45.37BOW + PolitiFact 79.39BOW + NE + Politifact 77.73BOW + POS + NUM +

Superlative + PolitiFact 80.59

Table 4: Average F1 of different models for two-way classification of fact-checking (five-fold cross-validation).

are the majority class (truths) and an SVM classi-fier trained with unigrams extracted from the an-notated spans of texts (weighted by tf-idf ). Weperformed five-fold cross-validation. Table 3 re-ports the results on the multi-class classificationtask with these baselines and with the additionalfeatures described in section 4, including the truth-fulness predictions of the GRU classifier trainedon PolitiFact data. The best result is achievedusing unigrams, POS tags, total number of num-bers (NUM), superlatives, and the GRU’s truthful-ness predictions (PolitiFact predictions). We ex-amined all five lexicons from Wiktionary providedby Rashkin et al. (2017); however, only superla-tives affected the performance of the classifier, sowe report only the results using superlatives.

We also report in Table 3 the average F1 mea-sure for classification of four labels in multi-classclassification using five-fold cross-validation. Thetruthfulness predictions did not improve the classi-fication of the dodge and true labels in multi-classclassification setting. Superlatives slightly im-proved the classification of all labels except dodge.

We further perform pairwise classification (one-versus-one) for all possible pairs of labels to getbetter insight into the impact of the features and

characteristics of labels.Therefore, we created three rather balanced

datasets of truths and falsehoods by randomly re-sampling the true statements without replacement(85 true statements in each dataset). The samemethod was used for comparing true labels withdodge and stretch labels, i.e., we created three rel-atively balanced datasets for analyzing true anddodge labels and three datasets for analyzing trueand stretch labels. This allows us to compare theprior work on the 6-point scale truthfulness labelson the U.S. data with the Canadian 4-point scale.

Table 4 presents the classification results usingfive-fold cross-validation with an SVM classifier.The reported F1 measure is the average of the re-sults on all three datasets for each pairwise setting.Dodge statements were classified more accuratelythan the other statements with an F1 measure ashigh as 82.57%. This shows that the answers thatdo not provide a response to the question can bedetected with relatively high confidence. The mosteffective features for classifying false against trueand dodge statements were named entities.

The predictions obtained from training the GRUmodel on the PolitiFact annotations, on their own,were not able to distinguish false from true andstretch statements. However, the predictions didhelp in distinguishing true against stretch anddodge statements. None of the models wereable to improve the classification of false againststretch statements over the majority baseline.

Overall, stretch statements were the most diffi-cult statements to identify in the binary classifica-tion setting. This could also be due to some in-consistency in the annotation process, with stretchand false not always clearly separated. Here is anexample of stretch in the data:

Example 5.1 [Catherine McKenna] Carbonpricing works and it can be done while growingthe economy. . . . Once again, I ask the memberopposite, “What are you going to do?” [Under 10years of the [Conservative] Harper government,you did nothing.]Stretch

Elsewhere in the data, essentially the sameclaim is labelled false:

Example 5.2 [Justin Trudeau] The Conservativespromised that they would also tackle environ-mental challenges and that they would do so bymeans other than carbon pricing. . . . They have noproposals, [they did nothing for 10 years.]False

63

We further performed the analysis using the twopredictions of more true and more false from thePolitiFact dataset; however, we didn’t observe anyimprovements. Using the total number of words inthe statements also did not improve the results.

While Rashkin et al. (2017), found that LIWCfeatures were effective for predicting the truthful-ness of the statements in PolitiFact, we did notobserve any improvements in the performance ofthe classifier in our classification task on CanadianParliamentary data. Furthermore, we did not ob-serve any improvements in the classification tasksusing sentiment and subjectivity features extractedusing OpinionFinder (Wilson et al., 2005; Riloffet al., 2003; Riloff and Wiebe, 2003).

6 Comparison with PolitiFact dataset

In this section, we perform a direct analysis withthe PolitiFact dataset. We first train a GRU model(used a sequence length of 200, other hyperparam-eters the same as those of the experiment describedabove) using 3-point scale annotations of Politi-Fact (used 10% of the data for validation). Wetreat the top two truthful ratings (true and mostlytrue) as true; half true and mostly false as stretch;and the last two ratings (false and pants-on-firefalse) as false. We then test the model on threeannotations of true, stretch, and false from theToronto Star project. The results are presented inTable 5. As the results show, none of the falsestatements are detected as false and the overall F1score is lower than the majority baseline.

We further train a GRU model (trained with bi-nary cross-entropy and sequence length of 200,other hyperparameters the same as above) using2-point scale where we treat the top three truthfulratings as true and the last three false ratings asfalse. We then test the model on two annotationsof true and false from the Toronto Star project.The results are presented in Table 6; the F1 scoreremains below baseline.

The Politifact dataset provided by Rashkin etal. includes a subset of direct quotes by originalspeakers. We further performed the 3-point scaleand 2-point scale analysis using only the directquotes. Using only the direct quotes, also shownin Tables 5 and 6, did not improve the classifica-tion performance.

F1 True Stretch FalseMajority 63GRU (All) 40 53 29 0GRU (DQ) 50 75 13 8

Table 5: 3-point scale comparison of the PolitiFactdata and Toronto Star annotations. All: GRU modelis trained with all PolitiFact data and tested on TorontoStar annotations. DQ: GRU model is trained with onlydirect quotes from the PolitiFact data and tested onToronto Star annotations.

F1 True FalseMajority 81GRU (All) 73 84 29GRU (DQ) 72 88 8

Table 6: 2-point scale comparison of the PolitiFactdata and Toronto Star annotations. All: GRU modelis trained with all PolitiFact data and tested on TorontoStar annotations. DQ: GRU model is trained with onlydirect quotes from the PolitiFact data and tested onToronto Star annotations.

7 Conclusion

We have analyzed classification of truths, false-hoods, dodges, and stretches in the Canadian Par-liament and compared it with the truthfulness clas-sification of statements in the PolitiFact dataset.We studied whether the effective features in theprior research can help us characterize the truthful-ness in Canadian Parliamentary debates and foundout that while some of these features help us iden-tify dodge statements with an F1 measure as highas 82.57%, they were not very effective in iden-tifying false and stretch statements. The truthful-ness predictions obtained from training a model onannotations of American politicians’ statements,when used with other features, helped slightly indistinguishing truths from other statements. In fu-ture work, we will take advantage of journalists’justifications in determining the truthfulness of thestatements as relying on only linguistic features isnot enough for determining falsehoods in parlia-ment.

Acknowledgements

This research is financially supported by an On-tario Graduate Scholarship, the Natural Sciencesand Engineering Research Council of Canada, andthe University of Toronto. We thank the anony-mous reviewers for their thoughtful comments andsuggestions.

64

References

KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches. arXiv preprint arXiv:1409.1259,.

Naeemul Hassan, Chengkai Li, and Mark Tremayne.2015. Detecting check-worthy factual claims inpresidential debates. In Proceedings of the 24thACM International on Conference on Informationand Knowledge Management, pages 1835–1838.ACM.

Israa Jaradat, Pepa Gencheva, Alberto Barron-Cedeno,Lluıs Marquez, and Preslav Nakov. 2018. Claim-rank: Detecting check-worthy claims in arabic andenglish. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Demonstrations, pages26–30. Association for Computational Linguistics.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Jing Ma, Wei Gao, and Kam-Fai Wong. 2017. De-tect rumors in microblog posts using propagationstructure via kernel learning. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages708–717, Vancouver, Canada. Association for Com-putational Linguistics.

Ayush Patwari, Dan Goldwasser, and Saurabh Bagchi.2017. TATHYA: A multi-classifier system for de-tecting check-worthy statements in political debates.In Proceedings of the 2017 ACM on Conferenceon Information and Knowledge Management, pages2259–2262. ACM.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine learningin Python. Journal of Machine Learning Research,12:2825–2830.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1532–1543.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and po-litical fact-checking. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2931–2937, Copenhagen,Denmark. Association for Computational Linguis-tics.

Ellen Riloff and Janyce Wiebe. 2003. Learning ex-traction patterns for subjective expressions. In Pro-ceedings of the 2003 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP ’03,pages 105–112, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003.Learning subjective nouns using extraction patternbootstrapping. In Proceedings of the Seventh Con-ference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 25–32, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

Yla R. Tausczik and James W. Pennebaker. 2010. Thepsychological meaning of words: LIWC and com-puterized text analysis methods. Journal of Lan-guage and Social Psychology, 29(1):24–54.


Andreas Vlachos and Sebastian Riedel. 2014. Factchecking: Task definition and dataset construction.In Proceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational Social Sci-ence, pages 18–22, Baltimore, MD, USA. Associa-tion for Computational Linguistics.

Douglas Walton. 2005. Fundamentals of critical argu-mentation. Cambridge University Press.

William Yang Wang. 2017. “Liar, liar pants on fire”:A new benchmark dataset for fake news detection.arXiv preprint arXiv:1705.00648.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Con-ference on Human Language Technology and Em-pirical Methods in Natural Language Processing,HLT ’05, pages 347–354, Stroudsburg, PA, USA.Association for Computational Linguistics.

65


Stance Detection in Fake News: A Combined Feature RepresentationBilal Ghanem

PRHLT Research Center,Universitat Politecnica de Valencia,

[email protected]

Paolo RossoPRHLT Research Center,

Universitat Politecnica de Valencia,Spain

[email protected]

Francisco RangelPRHLT Research Center,

Universitat Politecnica de Valencia,Autoritas Consulting,

[email protected]

Abstract

With the uncontrolled increasing of fake newsand rumors over the Web, different approacheshave been proposed to address the problem. Inthis paper, we present an approach that com-bines lexical, word embeddings and n-gramfeatures to detect the stance in fake news. Ourapproach has been tested on the Fake NewsChallenge (FNC-1) dataset. Given a newstitle-article pair, the FNC-1 task aims at deter-mining the relevance of the article and the title.Our proposed approach has achieved an accu-rate result (59.6 % Macro F1) that is close tothe state-of-the-art result with 0.013 differenceusing a simple feature representation. Further-more, we have investigated the importance ofdifferent lexicons in the detection of the clas-sification labels.

1 Introduction

Recently, many phenomena appeared and spreadin the Internet, especially with the huge propaga-tion of information and the growth of social net-works. Some of these phenomena are fake news,rumors and misinformation. In general, the de-tection of these phenomena is crucial since inmany situations they expose the people to dan-ger1. Journalism made several efforts in address-ing these problems by presenting a validity proofto the audience. Unfortunately, these manual at-tempts take much time and effort from the jour-nalists and, at the same time, they cannot coverthe rapid spread of these fake news. Hence, thereis the need for addressing the problem from an au-tomatic perspective. Fake news gained large atten-tion recently from the natural language processing

1https://www.theguardian.com/media/2016/dec/18/what-is-fake-news-pizzagate

(NLP) research community and many approacheshave been proposed. These approaches investi-gated fake news from network and textual perspec-tives (Shu et al., 2017). Some of the textual ap-proaches handled the phenomenon from a valid-ity aspect, where they labeled a claim as ”False”,”True”, or ”Half-True”. Others tried to tackle itfrom a stance perspective, similar to stance detec-tion works on Twitter (Mohammad et al., 2016;Taule et al., 2017; Lai et al., 2018) that tried todetermine whether a tweet is in favor, against, orneither to a given target entity (person, organiza-tion, etc.). Where in fake news, they replaced thetuple of the tweet and the target entity with a claimand an article; also a different stances’ set is used(agree, disagree, discuss, and unrelated).

Several shared tasks have been proposed: FakeNews Challenge (FNC-1) (Rao and Pomerleau,2017), RumorEval (Derczynski et al., 2017),CheckThat (Barron-Cedeno et al., 2018), and FactExtraction and Verification (FEVER)2. In FNC-1,the organizers proposed the task to be approachedfrom a stance perspective; the goal is to predicthow other articles orient to a specific fact, simi-larly than in RumorEval (task-A). While in bothRumorEval (task-B) and CheckThat (task-B) a ru-mor/claim has been submitted and the task ob-jective is to validate the truthfulness of this sen-tence (true, half-true, or false). In the first taskof CheckThat (task-A) participants were asked todetect claims that are worthy for checking (mayhave facts), as preliminary step to task B. Finally,the purpose of FEVER shared task is to evaluatethe ability of a system to verify a factual claimusing evidences from Wikipedia, where each re-

2http://fever.ai/task.html

66

trieved evidence (in case there are many) shouldbe labeled as ”Supported”, ”Refuted” or ”NotE-noughInfo” (if there isn’t sufficient evidence to ei-ther support or refute it). The given attention tofake news and rumors detection in the literatureis more than the one gained by detecting worthyclaims. The orientation in these works was to-wards inferring these worthy claims using linguis-tic and stylistic aspects (Ghanem et al., 2018c;Hassan et al., 2015).

2 Related Work

From an NLP perspective, many approaches pro-posed to employ statistical (Magdy and Wanas,2010), linguistic (Markowitz and Hancock, 2014;Volkova et al., 2017), and stylistic (Potthast et al.,2017) features. Other approaches incorporateddifferent combination of features, such as wordor character n-grams overlapping score, bag-of-words (BOW), word embeddings, and latent se-mantic analysis features (Riedel et al., 2017;Hanselowski et al., 2017; Karadzhov et al., 2018).In some cases, authors used external features andretrieved evidences from the Web. For example,in (Ghanem et al., 2018b) the authors utilizedboth Google and Bing search engines to investi-gate the factuality of political claims. In (Mi-haylov et al., 2015), a similar work has also re-trieved evidences from Google and online blogs tovalidate sentences in question answering forums.In other attempts, some approaches utilized deeplearning architectures to validate fake news. In(Baird et al., 2017), an approach combined a Con-volutional Neural Network with a Gradient Boostclassifier to predict the stance on FNC. As a re-sult, their approach achieved the highest accuracyin the task results. Using a different deep learningarchitecture, the authors in (Hanselowski et al.,2018) used a Long Short-Term Memory (LSTM)network combined with other features such as bag-of-characters (BOC), BOW and topic model fea-tures based on non-negative matrix factorization,Latent Dirichlet Allocation, and Latent SemanticIndexing. They achieved state-of-the-art results(60.9% Macro F1) on the FNC-1 dataset.The approaches that were proposed in both fakenews and rumors detection are slightly different,since both phenomena were studied in differentenvironment. Fake news datasets generally werecollected from formal sources (political debatesor Web news articles). On the other hand, Twit-

ter was the source for rumors datasets. There-fore, the proposed approaches for rumors focusedmore on the propagation of tweets (ex. retweet ra-tio (Enayet and El-Beltagy, 2017)) and the writingstyle of the tweets (Kochkina et al., 2017).

3 Stance Detection in FNC-1

3.1 Task

Given a pair of text fragments (title and article)obtained from news, the task goal is to estimatethe relative perspective (stance) of these two frag-ments with respect to a specific topic. In otherwords, the stance prediction of an article towardsthe title of this article. For each input pair, thereare 4 stance labels: Agree, Disagree, Discuss, andUnrelated. ”Agree” if the article supports the ti-tle; ”disagree” if refuses it; ”discuss” whether thearticle discusses the title but without showing anin favor or against stance; and ”unrelated” whenthe article describes a different topic than the oneof the title. The task’s dataset is imbalanced in ahigh ratio (see next section). Therefore, the or-ganizers introduced a weighted accuracy score forthe evaluation. Their proposed score gave 25% ofthe final score for predicting the unrelated class,while 75% for the other classes. Later, the au-thors in (Hanselowski et al., 2018) proposed anin-depth analysis to discuss FNC-1 experimentalsetup. They showed that this accuracy metric isnot appropriate and fails to take into account theimbalanced class distribution, where models per-forming well on the majority class and poorly onthe minority classes are favored. Therefore, theyproposed Macro F1 metric to be used in this task.Accordingly, in this paper we show the experimen-tal results using the Macro F1 measure.

3.2 Corpus

The presented dataset was built using 300 differ-ent topics. The training part consists of 49,972tuples in a form of title, article, and label, whilethe test part consists of 25,413 tuples. The ratio ofeach label (class) in the dataset is: 73.13% Unre-lated, 17.82% Discuss, 7.36% Agree, and 1.68%Disagree. Clearly the dataset is heavily biasedtowards the unrelated label. Titles length rangesbetween 8 and 40 words, whereas for the articlesranges between 600 and 7000 words (Bhatt et al.,2018). These numbers show a real challenge topredict the stance between these two fragmentsthat are totally different in lengths.

67

3.3 Tough-to-beat Baseline

The organizers presented a tough baseline usingGradient Boost decision tree classifier. In con-trast to other shared tasks, their baseline employedmore sophisticated features. As features, they em-ployed n-gram co-occurrence between the titlesand articles using both character and word grams(using a combination of multiple lengths) alongwith other hand-crafted features such as: wordoverlapping between the title and the article andthe existence of highly polarized words from a lex-icon (ex. fake, hoax). Their baseline achieved anFNC-1 score value of 75% and 45.4% value ofMacro F1.

4 Approach and Results

The literature work on the FNC dataset showedthat the best results are not obtained with a puredeep learning architecture, and simple BOW rep-resentations showed a good performance. In ourapproach, we combine n-grams, word embeddingsand cue words to detect the stance of the title withrespect to its article.

4.1 Preprocessing

Before building the feature representation, we per-form a set of text preprocessing steps. In some ar-ticles we found links, hashtags, and user mentions(ex. @USER), so we remove them to make thetext less biased. Similarly, we remove non-Englishand special characters.

4.2 Features

In our approach we combine simple feature repre-sentation to model the title-article tuples:

• Cue words: We employ a set of cuewords categories that were used previouslyin (Bahuleyan and Vechtomova, 2017) toidentify the stance of Twitter users towardsrumor tweets. As Table 1 shows, the cuewords categories are Belief, Denial, Doubt,Report, Knowledge, Negation and Fake. TheFake cue list is a combination of some wordsfrom FNC-1 baseline polarized words list andwords from the original list. The providedset of cue words is quite small, therefore, weuse Google News word2vec to expand it. Foreach word, we retrieve the most 5 similarwords. As an example, for the word ”misin-form”, we retrieved ”mislead ”,”misinform-

Feature Example WordsBelief assume, believe, think, considerDenial refuse, reject, rebuff, opposeDoubt wonder, unsure, guess, doubtReport evidence, assert, told, claim

Knowledge confirm, definitely, supportNegation no, not, never, don’t, can’t

Fake liar, false, rumor, hoax, debunk

Table 1: The cue words categories and examples.

ing”,”disinform”,”misinformation”, and ”de-monize” as the most similar words.

• Google News word2vec embedding: Foreach title-article tuple, we measure the co-sine similarity of the embedding of each sen-tence. Also, we use the full 300 length em-bedding vector for both the title and the ar-ticle. The sentence embeddings is obtainedby averaging its words embeddings. Previ-ously in (Ghanem et al., 2018a), the authorsshowed that using the main sentence compo-nents (verbs, nouns, and adjectives) improvedthe detection accuracy of a plagiarism detec-tion approach3 rather than using the full sen-tence components. Therefore, we build theseembeddings vectors using the main sentencecomponents. Furthermore, we maintain theset of cue words that showed in the previouspoint.

• FNC-1 features: we use the same baselinefeature set (see Section 3.3).

4.3 ExperimentsIn our experiments, we tested Support Vector Ma-chines (SVM) (using each Linear and RBF ker-nels), Gradient Boost, Random Forest and NaiveBayes classifiers but the Neural Network (NN)showed better results6. Our NN architecture con-sists of two hidden layers with rectified linear unit(ReLU) activation function as non-linearity forthe hidden layers, and Softmax activation func-tion for the output layer. Also, we employed the

3For extracting the main sentence components, we usedNLTK POS tagger: https://www.nltk.org/book/ch05.html.

5The stackLSTM is not one of the FNC-1 participated ap-proaches, but it achieved state-of-the-art result.

6The Scikit-learn python package was used in our imple-mentation

68

Systems Macro-F1Majority vote 0.210

FNC-1 baseline 0.454Talos (Baird et al., 2017) 0.582

UCLMR (Riedel et al., 2017) 0.583Athene (Hanselowski et al., 2017) 0.604

stackLSTM (Hanselowski et al., 2018) 0.609

Our approach 0.596Cue words 0.250

Word2vec embeddings 0.488

Table 2: The Macro F1 score results of the participantsin the FNC-1 challenge.5

Adam weight optimizer. The used batch size is200. Table 2 shows the results of our approachand those of the FNC-1 participants. We investi-gated the score of each of our features indepen-dently. The word2vec embeddings feature set hasachieved 0.488 Macro F1 value, while the cuewords achieved 0.25. The extension of the cuewords has improved the final result by 2.5%.

The tuples of the ”Unrelated” class had beencreated artificially by assigning articles from dif-ferent documents. This abnormal distribution canaffect the result of the cue words feature whenwe test it independently; since we extract the cuewords feature from the articles part (without thetitles) and some articles could be found with dif-ferent class labels, this can bias the classificationprocess. As we mentioned previously, the state-of-the-art result was obtained by an approach thatcombined LSTM with other features (see Section2). Our approach achieved 0.596 value of MacroF1 score which is very close to the best result.

The combination of the cue words categorieswith the other features has improved the overallresult. Each of them had impact in the classifi-cation process. In Figure 1, we show the impor-tance of each category using the Information Gain.We extract it using Gradient Boost classifier as itachieves the highest result comparing to the otherdecision tree-based classifiers. The figure clarifiesthat Report is the category that has the highest im-portance in the classification process, where Nega-tion and Belief categories have lower importance,whereas both of the Denial and Knowledge cat-egories have the lowest importance. Surprisingly,both of the Fake and Doubt categories have a lowerimportance than the other three. Our intuition was

Figure 1: The importance of each cue words categoryusing Information Gain.

that the Fake category will have the highest im-portance in discriminating the classes, where thiscategory contains words that: may not appear inthe ”Agree” class records, appear profusely in the”Disagree” class (where the title is fake and thearticle proving that), and a medium appearanceamount in the ”Discuss” class. Similarly, for theDoubt category, it seems that it may appear fre-quently in both ”Discuss” and ”Disagree” classeswhere its words normally mentioned when an ar-ticle discusses a specific idea or when refuse it.To understand deeper our Information Gain re-sults, we conducted another experiment to inferthe importance of each category with respect toeach classification class.

To do so, we use SVM classifier coefficients(linear kernel) to extract the most important cat-egory to each classification class. In our initialexperiments, the SVM produced a result that issimilar to the NN (58% Macro F1), so based onthe good performance we used it in this experi-ment, where we couldn’t extract the feature im-portance using the NN. Once the SVM fits the dataand creates a hyperplane that uses support vectorsto maximize the distance between the classes, theimportance of the features can be extracted basedon the absolute size of the coefficients (vector co-ordinates). In Table 3 we show the importanceof each category by their order. We can noticethat for the ”Agree” class, generally, the categoriesthat are used when there is a disagreement (Denial,Fake, Negation) tend to be less important than theother categories. On the contrary, for the ”Dis-agree”, disagreement categories appear in generalin higher order comparing to the ”Agree” class.

69

# Unrelated Discuss Disagree Agree1 Belief Fake Report Belief2 Negat. Negat. Fake Report3 Report Belief Denial Doubt4 Knowl. Knowl. Belief Knowl.5 Doubt Denial Negat. Denial6 Fake Doubt Knowl. Fake7 Denial Report Doubt Negat.

Table 3: Importance order of the cue words categoriesfor each class.

For the ”Discuss” class, due to the unclear stancetowards the title where articles did not show aclear in favor or against stance, we can notice anoverlapping in the highest order between the cate-gories that are important for both ”Disagree” and”Agree” classes. Finally, as we mentioned pre-viously that the articles in the ”Unrelated” classare created artificially by assigning articles fromdifferent titles, the order of the categories is notmeaningful.


Fake news is still an open research topic. Furthercontributions are required, especially to deal au-tomatically with the massive growth of informa-tion over the Web. Our work attempted to ap-proach the stance detection of fake news using asimple model based on a combination of n-grams,word embeddings and lexical representation of cuewords. These lexical cue words have been em-ployed previously in the literature in rumors stancedetection approaches. Although we used a sim-ple feature set, we achieved similar results thanthe state of the art. This work is an initial steptowards a further investigation of features to im-prove stance detection in fake news. As a futurework, we plan to focus on summarizing the arti-cles in the dataset. As we mentioned in Section3.2, the length ratio difference between the titlesand the articles is large. Therefore, summarizingthe articles may be a worthy attempt to improvethe comparison between the two text fragments.

Acknowledgement

This research work was done in the frame-work of the SomEMBED TIN2015-71147-C2-1-PMINECO research project.

ReferencesHareesh Bahuleyan and Olga Vechtomova. 2017.

Uwaterloo at semeval-2017 task 8: Detecting stancetowards rumours with topic independent features. InProceedings of the 11th International Workshop onSemantic Evaluation (SemEval-2017), pages 461–464.

Sean Baird, Doug Sibley, and Yuxi Pan.2017. Talos Targets Disinformation withFake News Challenge Victory. http://blog.talosintelligence.com/2017/06/talos-fake-news-challenge.

Alberto Barron-Cedeno, Tamer Elsayed, ReemSuwaileh, Lluıs Marquez, Pepa Atanasova, WajdiZaghouani, Spas Kyuchukov, Giovanni Da San Mar-tino, and Preslav Nakov. 2018. Overview of theclef-2018 checkthat! lab on automatic identificationand verification of political claims, task 2: Fac-tuality. In CLEF 2018 Working Notes. WorkingNotes of CLEF 2018 - Conference and Labs of theEvaluation Forum, CEUR Workshop Proceedings,Avignon, France. CEUR-WS.org.

Gaurav Bhatt, Aman Sharma, Shivam Sharma, AnkushNagpal, Balasubramanian Raman, and Ankush Mit-tal. 2018. Combining neural, statistical and exter-nal features for fake news stance identification. InCompanion of the The Web Conference 2018 on TheWeb Conference 2018, pages 1353–1357. Interna-tional World Wide Web Conferences Steering Com-mittee.

Leon Derczynski, Kalina Bontcheva, Maria Liakata,Rob Procter, Geraldine Wong Sak Hoi, and ArkaitzZubiaga. 2017. Semeval-2017 task 8: Rumoureval:Determining rumour veracity and support for ru-mours. arXiv preprint arXiv:1704.05972.

Omar Enayet and Samhaa R El-Beltagy. 2017.Niletmrg at semeval-2017 task 8: Determining ru-mour and veracity support for rumours on twitter. InProceedings of the 11th International Workshop onSemantic Evaluation (SemEval-2017), pages 470–474.

Bilal Ghanem, Labib Arafeh, Paolo Rosso, and Fer-nando Sanchez-Vega. 2018a. Hyplag: Hybrid arabictext plagiarism detection system. In InternationalConference on Applications of Natural Language toInformation Systems, pages 315–323. Springer.

Bilal Ghanem, Manuel Montes-y Gomez, FranciscoRangel, and Paolo Rosso. 2018b. Upv-inaoe-autoritas - check that: An approach based on ex-ternal sources to detect claims credibility. In CLEF2018 Working Notes. Working Notes of CLEF 2018- Conference and Labs of the Evaluation Forum,CEUR Workshop Proceedings, Avignon, France.CEUR-WS.org.

Bilal Ghanem, Manuel Montes-y Gomez, FranciscoRangel, and Paolo Rosso. 2018c. Upv-inaoe-autoritas - check that: Preliminary approach for

70

checking worthiness of claims. In CLEF 2018Working Notes. Working Notes of CLEF 2018 - Con-ference and Labs of the Evaluation Forum, CEURWorkshop Proceedings, Avignon, France. CEUR-WS.org.

Andreas Hanselowski, Avinesh PVS, Ben-jamin Schiller, and Felix Caspelherr. 2017.Team Athene on the Fake News Challenge.https://medium.com/@andre134679/team-athene-on-the-fake-news-/challenge-28a5cf5e017b.

Andreas Hanselowski, Avinesh PVS, BenjaminSchiller, Felix Caspelherr, Debanjan Chaud-huri, Christian M Meyer, and Iryna Gurevych.2018. A retrospective analysis of the fake newschallenge stance detection task. arXiv preprintarXiv:1806.05180.

Naeemul Hassan, Chengkai Li, and Mark Tremayne.2015. Detecting check-worthy factual claims inpresidential debates. In Proceedings of the 24thACM International on Conference on Informationand Knowledge Management, pages 1835–1838.ACM.

Georgi Karadzhov, Pepa Gencheva, Preslav Nakov, andIvan Koychev. 2018. We built a fake news & click-bait filter: What happened next will blow your mind!arXiv preprint arXiv:1803.03786.

Elena Kochkina, Maria Liakata, and Isabelle Augen-stein. 2017. Turing at semeval-2017 task 8: Sequen-tial approach to rumour stance classification withbranch-lstm. arXiv preprint arXiv:1704.07221.

Mirko Lai, Viviana Patti, Giancarlo Ruffo, and PaoloRosso. 2018. Stance evolution and twitter inter-actions in an italian political debate. In 23rd In-ternational Conference on Applications of Natu-ral Language to Information Systems, pages 15–27.Springer.

Amr Magdy and Nayer Wanas. 2010. Web-basedstatistical fact checking of textual documents. InProceedings of the 2nd international workshop onSearch and mining user-generated contents, pages103–110. ACM.

David M Markowitz and Jeffrey T Hancock. 2014.Linguistic traces of a scientific fraud: The case ofdiederik stapel. PloS one, 9(8):e105937.

Todor Mihaylov, Georgi Georgiev, and Preslav Nakov.2015. Finding opinion manipulation trolls in newscommunity forums. In Proceedings of the Nine-teenth Conference on Computational Natural Lan-guage Learning, pages 310–314.

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-hani, Xiaodan Zhu, and Colin Cherry. 2016.Semeval-2016 task 6: Detecting stance in tweets. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 31–41,San Diego, California.

Martin Potthast, Johannes Kiesel, Kevin Reinartz,Janek Bevendorff, and Benno Stein. 2017. A sty-lometric inquiry into hyperpartisan and fake news.arXiv preprint arXiv:1702.05638.

Delip Rao and Dean Pomerleau. 2017.Fake News Challenge. http://www.fakenewschallenge.org.

Benjamin Riedel, Isabelle Augenstein, Georgios P Sp-ithourakis, and Sebastian Riedel. 2017. A sim-ple but tough-to-beat baseline for the fake newschallenge stance detection task. arXiv preprintarXiv:1707.03264.

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, andHuan Liu. 2017. Fake news detection on social me-dia: A data mining perspective. ACM SIGKDD Ex-plorations Newsletter, 19(1):22–36.

Mariona Taule, M Antonia Martı, Francisco M Rangel,Paolo Rosso, Cristina Bosco, Viviana Patti, et al.2017. Overview of the task on stance and gen-der detection in tweets on catalan independence atibereval 2017. In 2nd Workshop on Evaluationof Human Language Technologies for Iberian Lan-guages, IberEval 2017, volume 1881, pages 157–177. CEUR-WS.

Svitlana Volkova, Kyle Shaffer, Jin Yea Jang, andNathan Hodas. 2017. Separating facts from fiction:Linguistic models to classify suspicious and trustednews posts on twitter. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), volume 2,pages 647–653.

71


Zero-shot Relation Classification as Textual Entailment

Abiola ObamuyideDepartment of Computer Science


Andreas VlachosDepartment of Computer Science


Abstract

We consider the task of relation classification,and pose this task as one of textual entailment.We show that this formulation leads to severaladvantages, including the ability to (i) performzero-shot relation classification by exploitingrelation descriptions, (ii) utilize existing tex-tual entailment models, and (iii) leverage read-ily available textual entailment datasets, toenhance the performance of relation classifi-cation systems. Our experiments show thatthe proposed approach achieves 20.16% and61.32% in F1 zero-shot classification perfor-mance on two datasets, which further im-proved to 22.80% and 64.78% respectivelywith the use of conditional encoding.

1 Introduction

The task of determining the relation between vari-ous entities from text is an important one for manynatural language understanding systems, includ-ing question answering, knowledge base construc-tion and web search. Relation classification is anessential part of many high-performing relationextraction systems in the NIST-organised TACKnowledge Base Population (TAC-KBP) track (Jiand Grishman, 2011; Adel et al., 2016). As a re-sult of its wide application, many approaches andsystems have been proposed for this task (Zelenkoet al., 2003; Surdeanu et al., 2012; Riedel et al.,2013; Zhang et al., 2017).

A shortcoming common to previous proposedapproaches, however, is that they identify only re-lations observed at training time, and are unableto generalize to new (unobserved) relations at testtime. To address this challenge, we propose to for-mulate relation classification as follows: Given aunit of text T which mentions a subject X anda candidate object Y of a knowledge base rela-tion R(X,Y ), and a natural language descriptiond of R, we wish to evaluate whether T expresses

R(X,Y ). We formulate this task as a textual en-tailment problem in which the unit of text andthe relation description can be considered as thepremise P and hypothesis H respectively. Thechallenge then becomes that of determining thetruthfulness of the hypothesis given the premise.Table 1 gives examples of knowledge base rela-tions and their natural language descriptions.

This formulation brings a number of advan-tages. First, we are able to perform zero-shot clas-sification of new relations by generalizing fromthe descriptions of seen training relations to thoseof unseen relations at test time. Given a collec-tion of relations, for instance, spouse(X,Y) andcity of birth(X,Y) together with their natural lan-guage descriptions and training examples, we canlearn a model that can classify other instances ofthese relations, as well as instances of other rela-tions that were not observed at training time, forinstance child(X,Y), given their descriptions. Inaddition to being able to utilize existing state-of-the-art textual entailment models for relation clas-sification, our approach can use distant supervi-sion data together with data from textual entail-ment as additional supervision for relation classi-fication.

In experiments on two datasets, we assess theperformance of our approach in two supervisionsettings: in a zero-shot setting, where no supervi-sion examples are available for new relations, andin a few-shot setting, where our models have ac-cess to limited supervision examples of new rela-tions. In the former setting our approach achieves20.16% and 61.32% in F1 classification perfor-mance in the two datasets considered, which fur-ther improved to 22.80% and 64.78% respectivelywith the use of conditional encoding. Similar im-provements hold in the latter setting as well.

72

Relation Subject (X) Object (Y) Text (Premise) Description (Hypothesis)religious order Lorenzo Ricci Society of Jesus X (August 1, 1703 – November 24,

1775) was an Italian Jesuit, electedthe 18th Superior General of the Y.

X was a member of thegroup Y

director Kispus Erik Balling X is a 1956 Danish romantic com-edy written and directed by Y.

The director of X is Y

designer Red Baron II Dynamix X is a computer game for the PC,developed by Y and published bySierra Entertainment.

Y is the designer of X

Table 1: Examples of relations, entities, sample text instances, and relation descriptions.

2 Related Work

Most recent work, including Adel et al. (2016) andZhang et al. (2017), proposed models that assumethe availability of supervised data for the task ofrelation classification. Rocktaschel et al. (2015)and Demeester et al. (2016) inject prior knowledgein the form of propositional logic rules to improverelation extraction for new relations with zero andfew training labels, in the context of the universalschema approach (Riedel et al., 2013). They con-sidered the use of propositional logic rules, whichfor instance, can be mined from external knowl-edge bases (such as Freebase (Bollacker et al.,2008)) or obtained from ontologies such as Word-Net (Miller, 1995). However, the use of proposi-tional logic rules assumes prior knowledge of thepossible relations between entities, and is thus oflimited application in extracting new relations.

Levy et al. (2017) showed that a related andcomplementary task, that of entity/attribute rela-tion extraction, can be reduced to a question an-swering problem. The task we address in this workis that of zero-shot relation classification, whichdetermines if a given relation exists between twogiven entities in text. As a result the output of ourapproach is a binary classification decision indi-cating whether a given relation exists between twogiven entities in text. The task performed by Levyet al. (2017) is that of zero-shot entity/attributerelation extraction, since their approach returnsthe span corresponding to the relation arguments(”answers”) from the text.1 In addition, our ap-proach for zero-shot relation classification utilizesrelation descriptions, which is typically availablein relation ontologies, and is thus not reliant oncrowd-sourcing.

Our approach also takes inspiration from var-

1Note that for this reason, a direct comparison betweenthe two approaches is not straightforward, as this would beakin to comparing a text classification model and a questionanswering model.

ious approaches for leveraging knowledge froma set of source tasks to target tasks, such as re-cent transfer learning methods in natural languageprocessing (Peters et al., 2018; McCann et al.,2017). Closest to our work is that of Conneauet al. (2017), who showed that representationslearned from natural language inference data canenhance performance when transferred to a num-ber of other natural language tasks. In this work,we consider the task of zero-shot relation classifi-cation by utilizing relation descriptions.

3 Model

Our approach takes as input two pieces of text, asentence containing the subject and object entitiesof a candidate relation, and the relation’s descrip-tion, and returns as output a binary response in-dicating whether the meaning of the descriptioncan be inferred between the two entities in the sen-tence. See Table 1 for some examples. The prob-lem of determining whether the meaning of a textfragment can be inferred from another is that ofnatural language inference/textual entailment (Da-gan et al., 2005; Bowman et al., 2015).

We take as our base model the Enhanced Se-quential Inference Model (ESIM) introduced byChen et al. (2017), one of the commonly usedmodels for text pair tasks (?). ESIM utilizes Bidi-rectional Long Short-Term Memory (Hochreiterand Schmidhuber, 1997; Graves and Schmidhu-ber, 2005) (BiLSTM) units as a building blockand accepts two sequences of text as input. Itthen passes the two sequences through three modelstages - input encoding, local inference modellingand inference composition, and returns the classc with the highest classification score, where c intextual entailment is one of entailment, contradic-tion or neutral. In our experiments, for each (sen-tence, relation description) pair we return a 2-wayclassification prediction instead.

In this section we briefly describe the input en-

73

coding and inference composition stages, whichwe adapt using conditional encoding as describedin the following subsection. The input encodingand inference composition stages operate analo-gously, and each receives as input two sequencesof vectors, pi and hj, or more compactlytwo matrices P ∈ RI×d for the premise andH ∈ RJ×d for the hypothesis, where I and J arerespectively the number of words in the premiseand hypothesis, and d is the dimensionality ofeach vector representation. In the case of the in-put encoding layer, P and H are word embed-dings of words in the premise and hypothesis re-spectively, while in the case of inference composi-tion, P and H are internal model representationsderived from the preceding local inference mod-elling stage. Then the input sequences are pro-cessed with BiLSTM units to yield new sequencesP ∈ RI×2d for the premise and H ∈ RJ×2d forthe hypothesis:

P , −→c p,←−c p = BiLSTM(P ) (1)

H , −→c h,←−c h = BiLSTM(H) (2)

where −→c p ,←−c p ∈ Rd are respectively the last cellstates in the forward and reverse directions of theBiLSTM that reads the premise. −→c h ,←−c h ∈ Rd

are similarly defined for the hypothesis.

3.1 Conditional encoding for ESIM

When used for zero-shot relation classification,ESIM encodes the sentence independently of therelation description. Given a new target relation’sdescription, it is desirable for representations com-puted for the sentence to take into account therepresentations for the target relation description.Therefore we explicitly condition the representa-tions of the sentence on that of the relation de-scription using a conditional BiLSTM (cBiLSTM)(Rocktaschel et al., 2016) unit. Thus, Equation 1is replaced with:

P = cBiLSTM(P ,−→c h,←−c h) (3)

where −→c h and ←−c h respectively denote the lastmemory cell states in the forward and reverse di-rections of the BiLSTM that reads the relation de-scription. This adaptation is made to both inputencoding and inference composition stages. Werefer to the adapted ESIM as the Conditioned In-ference Model (CIM) in subsequent sections.

4 Datasets

We evaluate our approach using the datasets ofAdel et al. (2016) and (Levy et al., 2017). Thedataset of Adel et al. (2016) (LMU-RC) is split intotraining, development and evaluation sets. Thetraining set was generated by distant supervision,and the development and test data were obtainedfrom manually annotated TAC-KBP system out-puts. We obtained the descriptions for the relationsfrom the TAC-KBP relation ontology guidelines.2

This resulted in a dataset of about 6 million pos-itive and negative instances, each consisting of arelation, its subject and object entities, a sentencecontaining both entities and a relation description.

We applied a similar process to the relation ex-traction dataset of (Levy et al., 2017) (UW-RE).It consists of 120 relations and a set of questiontemplates for each relation, containing both posi-tive and negative relation instances, with each in-stance consisting of a subject entity, a knowledgebase relation, a question template for the relation,and a sentence retrieved from the subject entity’sWikipedia page. We wrote descriptions for eachof the 120 relations in the dataset, with each rela-tion’s question templates serving as a guide. Thusall instances in the dataset (30 million positive and2 million negative ones) now include the corre-sponding relation description, making them suit-able for relation classification using our approach.

In addition to the two datasets, we also utilizethe MultiNLI natural language inference corpus(Williams et al., 2018) in our experiments as asource of supervision. We map its entailment andcontradiction class instances to positive and nega-tive relation instances respectively.

5 Experiments and Results

We conduct two sets of experiments. The firstset of experiments tests the performance of ourapproach in the zero-shot setting, where no su-pervision instances are available for new relations(Section 5.1). The second set of experimentsmeasures the performance of our approach in thelimited supervision regime, where varying levelsof supervision is available (Section 5.2).

Implementation Details Our model is im-plemented in Tensorflow (Abadi et al., 2016).

2https://tac.nist.gov/2015/KBP/ColdStart/guidelines/TAC_KBP_2015_Slot_Descriptions_V1.0.pdf

74

Dataset Model F1 (%)

LMU-RC ESIM 20.16CIM 22.80

UW-RE ESIM 61.32CIM 64.78

Table 2: Zero-shot relation learning results forESIM and CIM.

Dataset Supervision F1 (%)

LMU-RCTE 25.54

TE+DS 26.28

UW-RETE 44.38

TE+DS 62.33

Table 3: Zero-shot relation learning results formodel CIM pre-trained on two sources of data:Textual Entailment (TE), or both Distant Supervi-sion and Textual Entailment (TE+DS). The resultsin Table 2 correspond to DS only supervision.

We initialize word embeddings with 300D Glove(Pennington et al., 2014) vectors. We found afew epochs of training (generally less than 5) tobe sufficient for convergence. We apply Dropoutwith a keep probability of 0.9 to all layers. Theresult reported for each experiment is the averagetaken over five runs with independent randominitializations. In order to prevent overfitting tospecific entities, we mask out the subject andobject entities with the tokens SUBJECT ENTITYand OBJECT ENTITY respectively.

5.1 Zero-shot Relation Learning

For this experiment we created ten folds ofeach dataset, with each fold partitioned intotrain/dev/test splits along relations. In each fold,a relation belongs exclusively to either the train,dev or test split.

Table 2 shows averaged F1 across the folds forthe models on the LMU-RC and UW-RE datasets.We observe that using only distant supervisionfor the training relations and without supervisionfor the test relations, the models were still ableto make predictions for them, though at differentperformance levels. CIM obtained better perfor-mance compared to ESIM, as a result of its use ofconditional encoding.

Table 3 shows F1 scores of model CIM pre-trained on only MultiNLI (referred to as TE) or acombination of MultiNLI and distant supervision(referred to as TE+DS) data in the zero-shot set-

Figure 1: Limited supervision results: F1 scores onUW-RE as fraction of training data (τ ) is varied.When τ=0, we get the zero-shot results in Table 2

ting. We find that CIM pre-trained on only tex-tual entailment data is already able to make pre-dictions for unseen test relations, while using acombination of distant supervision and textual en-tailment data achieved improved F1 scores acrossboth datasets, demonstrating the validity of our ap-proach in this setting. We also note that usingTE+DS data performs worse than DS data alone inthe case of the UW-RE dataset, unlike in the caseof LMU-RC. We hypothesize that this is becauseDS data performs much better for the former.

5.2 Few-shot Relation Learning

For the experiments in the limited-supervision set-ting, we randomly partition the dataset along re-lations into a train/dev/test split. Similar to thezero-shot setting, a relation belongs to each splitexclusively. Then for each experiment, we makeavailable to each model a fraction τ of exampleinstances of the relations in the test set as su-pervision. Note that the particular example in-stances we use are a disjoint set of instances whichare not present in the development and evaluationsets. In addition to ESIM and our proposed modelCIM, we also report results for the TACRED Rela-tion Extractor (TACRED-RE), the position-awareRNN model that was found to achieve state-of-the-art results on the TACRED (Zhang et al., 2017)dataset. TACRED-RE is a supervised model thatexpects labelled data for all relations during train-ing, and thus not applicable in the zero-shot setup.

Results for this set of experiments are shownin Figure 1 for the UW-RE dataset. We find thatonly about 5% of the training data is required forboth ESIM and CIM to reach around 80% in F1

75

Figure 2: F1 scores on LMU-RC as fraction oftraining data (τ ) is varied.

performance, with CIM outperforming ESIM inthe 0-6% interval. However, beyond this interval,we do not observe any major difference in per-formance between ESIM and CIM, demonstratingthat CIM performs well in both the zero-shot andlimited supervision settings. For context, whengiven full supervision on the UW-RE dataset, CIMand TACRED-RE obtain F1 scores of 94.82% and87.73% respectively. A similar trend is observedfor the LMU-RC dataset, whose plot can be foundin Figure 2.

In general, all models obtain better results onUW-RE than on LMU-RC. We hypothesize thatthe performance difference is due to UW-RE beingderived from Wikipedia documents (which typi-cally have well-written text), while LMU-RC wasobtained from different genres and sources (suchas discussion forum posts and web documents),which tend to be noisier.

5.3 Qualitative Results

Figure 3 depicts a visualization of the normalizedattention weights assigned by model CIM on in-stances drawn from the development set. We ob-serve that it is able to attend to words that are se-mantically coherent with the premise (”novel” and”author”, Figure 3a), (”studied” and ”university”,Figure 3b).

6 Conclusions

We show that the task of relation classification canbe achieved through the use of relation descrip-tions, by formulating the task as one of textual en-tailment between the relation description and thepiece of text. This leads to several advantages, in-cluding the ability to perform zero-shot relation

(a)

(b)

Figure 3: Attention visualization

classification and use textual entailment modelsand datasets to improve performance.

Acknowledgments

We are grateful to Pasquale Minervini and JeffMitchell for helpful conversations and sugges-tions, and the Sheffield NLP group and anony-mous reviewers for valuable feedback. This re-search is supported by the EU H2020 SUMMAproject (grant agreement number 688139).

ReferencesMartin Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,and Others. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, volume 16,pages 265–283.

Heike Adel, Benjamin Roth, and Hinrich Schutze.2016. Comparing convolutional neural networks totraditional models for slot filling. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 828–838, San Diego, California. Association for Com-putational Linguistics.

76

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a col-laboratively created graph database for structuringhuman knowledge. SIGMOD 08 Proceedings of the2008 ACM SIGMOD international conference onManagement of data, pages 1247–1250.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In EMNLP.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced lstm fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668. Association for Computational Linguis-tics.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. SupervisedLearning of Universal Sentence Representationsfrom Natural Language Inference Data. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 670–680.Association for Computational Linguistics.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The PASCAL Recognising Textual Entail-ment Challenge. In Machine Learning Challenges,Evaluating Predictive Uncertainty, Visual ObjectClassification and Recognizing Textual Entailment,First PASCAL Machine Learning Challenges Work-shop, MLCW 2005, Southampton, UK, April 11-13,2005, Revised Selected Papers, pages 177–190.

Thomas Demeester, Tim Rocktaschel, and SebastianRiedel. 2016. Lifted Rule Injection for Relation Em-beddings. Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing(EMNLP-16), pages 1389–1399.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectionalLSTM networks. In Proceedings of the Interna-tional Joint Conference on Neural Networks, vol-ume 4, pages 2047–2052.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

Heng Ji and Ralph Grishman. 2011. Knowledge basepopulation: Successful approaches and challenges.In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 1148–1158. Associ-ation for Computational Linguistics.

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-Shot Relation Extractionvia Reading Comprehension.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In Advances in Neural In-formation Processing Systems, pages 6297–6308.

George A. Miller. 1995. WordNet: a lexicaldatabase for English. Communications of the ACM,38(11):39–41.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. GloVe: Global Vectors for WordRepresentation. In Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M. Marlin. 2013. Relation Extractionwith Matrix Factorization and Universal Schemas.Proceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,(June):74–84.

Tim Rocktaschel, Edward Grefenstette, Karl MoritzHermann, Tomas Kocisky, and Phil Blunsom. 2016.Reasoning about entailment with neural attention.In International Conference on Learning Represen-tations (ICLR).

Tim Rocktaschel, Sameer Singh, and Sebastian Riedel.2015. Injecting logical background knowledge intoembeddings for relation extraction. In NAACL HLT2015, The 2015 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Denver,Colorado, USA, May 31 - June 5, 2015, pages 1119–1129.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D Manning. 2012. Multi-instanceMulti-label Learning for Relation Extraction. Pro-ceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning, EMNLP’12, (July):455–465.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers), pages 1112–1122, New Orleans,Louisiana. Association for Computational Linguis-tics.

Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2003. Kernel methods for relation ex-traction. The Journal of Machine Learning Re-search, 3:1083–1106.

77

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot fill-ing. In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing,pages 35–45. Association for Computational Lin-guistics.

78


Teaching Syntax by Adversarial Distraction

Juho KimUniversity of Illinois

Urbana, [email protected]

Christopher MalonNEC Laboratories America

Princeton, [email protected]

Asim KadavNEC Laboratories America

Princeton, [email protected]

Abstract

Existing entailment datasets mainly pose prob-lems which can be answered without attentionto grammar or word order. Learning syntaxrequires comparing examples where differentgrammar and word order change the desiredclassification. We introduce several datasetsbased on synthetic transformations of naturalentailment examples in SNLI or FEVER, toteach aspects of grammar and word order. Weshow that without retraining, popular entail-ment models are unaware that these syntac-tic differences change meaning. With retrain-ing, some but not all popular entailment mod-els can learn to compare the syntax properly.

1 Introduction

Natural language inference (NLI) is a task to iden-tify the entailment relationship between a premisesentence and a hypothesis sentence. Given thepremise, a hypothesis may be true (entailment),false (contradiction), or not clearly determined(neutral). NLI is an essential aspect of naturallanguage understanding. The release of datasetswith hundreds of thousands of example pairs, suchas SNLI (Bowman et al., 2015) and MultiNLI(Williams et al., 2018), has enabled the develop-ment of models based on deep neural networksthat have achieved near human level performance.

However, high accuracies on these datasets donot mean that the NLI problem is solved. Anno-tation artifacts make it possible to correctly guessthe label for many hypotheses without even con-sidering the premise (Gururangan et al., 2018; Po-liak et al., 2018; Tsuchiya, 2018). Successfultrained systems can be disturbed by small changesto the input (Glockner et al., 2018; Naik et al.,2018).

In this paper, we show that existing trained NLIsystems are mostly unaware of the relation be-tween syntax and semantics, particularly of how

word order affects meaning. We develop a tech-nique, “adversarial distraction,” to teach networksto properly use this information. The adversarialdistraction technique consists of creating pairs ofexamples where information matching the premiseis present in the hypothesis in both cases, but dif-fering syntactic structure leads to different entail-ment labels. We generate adversarial distractionsautomatically from SNLI and an NLI dataset de-rived from FEVER (Thorne et al., 2018), thus aug-menting the datasets. We observe the behavior ofseveral existing NLI models on the added exam-ples, finding that they are mostly unaware that thesyntactic changes have affected the meaning. Wethen retrain the models with the added examples,and find whether these weaknesses are limitationsof the models or simply due to the lack of appro-priate training data.

2 Related work

Datasets for NLI: SNLI and MultiNLI are bothbased on crowdsourced annotation. In SNLIall the premises came from image captions(Young et al., 2014), whereas MultiNLI collectedpremises from several genres including fiction, let-ters, telephone speech, and a government report.SciTail (Khot et al., 2018) constructed more com-plicated hypotheses based on multiple-choice sci-ence exams, whose premises were taken from webtext. More recently, FEVER introduced a fact veri-fication task, where claims are to be verified usingall of Wikipedia. As FEVER established groundtruth evidence for or against each claim, premisescan be collected with a retrieval module and la-beled as supporting, contradictory, or neutral foran NLI dataset.

Neural network based NLI systems: Dozensof neural network based models have been submit-ted to the SNLI leaderboard. Some systems have

79

been developed based on sentence representations(Conneau et al., 2017; Nie and Bansal, 2017), butmost common models apply attention between to-kens in the premise and hypothesis. We focus onthree influential models of this kind: Decompos-able Attention (Parikh et al., 2016), ESIM (Chenet al., 2017), and a pre-trained transformer net-work (Radford et al., 2018) which obtains state-of-the-art results for various NLI datasets includ-ing SNLI and SciTail.

Adversarial examples for NLI systems: Jiaand Liang (2017) introduced the notion of distrac-tion for reading comprehension systems by try-ing to fool systems for SQuAD (Rajpurkar et al.,2016) with information nearly matching the ques-tion, added to the end of a supporting passage.Glockner et al. (2018) showed that many NLI sys-tems were confused by hypotheses that were iden-tical to the premise except for the replacement ofa word by a synonym, hypernym, co-hyponym, orantonym. Naik et al. (2018) found that adding thesame strings of words to NLI examples withoutchanging the logical relation could significantlychange results, because of word overlap, negation,or length mismatches.

Other work (Kang et al., 2018; Zhao et al.,2018) aimed to improve model robustness inthe framework of generative adversarial networks(Goodfellow et al., 2014). Ribeiro et al. (2018)generated semantically equivalent examples usinga set of paraphrase rules derived from a machinetranslation model. In contrast to these kinds ofadversarial examples, we focus on the model notbeing sensitive enough to small changes that dochange meaning.

3 Teaching Syntax

Our adversarial examples attack NLI systems froma new direction: not in their failure to capturethe relations between words as in Glockner et al.(2018), but their failure to consider the syntax andword order in premises and hypotheses to decideupon the entailment relation. As position-agnosticapproaches such as Decomposable Attention andSWEM (Shen et al., 2018) provide competitivebaselines to existing datasets, the power of mod-els to interpret token position information has notbeen rigorously tested.

3.1 Passive voice

We first evaluate and teach the use of the passivevoice. By changing a hypothesis to passive, wecan obtain a semantically equivalent hypothesiswith identical tokens except for a different conju-gation of the verb, the insertion of the word “by,”and a different word order.

To perform the conversion, we use semantic rolelabeling results of SENNA (Collobert et al., 2011)to identify verbs and their ARG0 and ARG1 rela-tions. We change the form of the verb and movethe ARG0 and ARG1 phrases to opposite sidesof the verb to form the passive. Here, we usehead words identified in dependency parsing re-sults and the part-of-speech tagging informationof the verb from spaCy (Honnibal and Johnson,2015) to change the verb form correctly accord-ing to the plurality of these nouns and the tense ofthe verb. The transformation is applied only at theroot verb identified in dependency parsing output.

If the addition of the passive were the onlyaugmentation, models would not be learning thatword order matters. Thus, in the cases where theoriginal pair is an entailment, we add an adver-sarial distraction where the label is contradiction,by reversing the subject and object in the hypothe-sis after transformation to passive. We call this thepassive reversal. We filter out cases where the rootverb in the hypothesis is a reciprocal verb, suchas “meet” or “kiss,” or a verb appearing with thepreposition “with,” so that the resulting sentenceis surely not implied by the premise if the originalis. For example, the hypothesis, “A woman is us-ing a large umbrella” (entailment), generates thepassive example, “A large umbrella is being usedby a woman” (entailment), and the passive rever-sal, “A woman is being used by a large umbrella”(contradiction).

3.2 Person reversal

One weakness of adversarial distraction by passivereversal is that many hypotheses become ridicu-lous. A model leveraging language model infor-mation can guess that a hypothesis such as “A manis being worn by a hat” is a contradiction withoutconsidering the premise. Indeed, when we traina hypothesis only baseline (Poliak et al., 2018)with default parameters using the SNLI datasetaugmented with passive and passive reversal ex-amples, 95.98% of passive reversals are classifiedcorrectly from the hypothesis alone, while only

80

SNLI FEVERoriginal passive passive rev original person rev birthday

# train 549,367 129,832 39,482 602,240 3,154 143,053# validation 9,842 2,371 724 42,541 95 9,764# test 9,824 2,325 722 42,970 69 8,880

Table 1: The number of examples in the original SNLI and FEVER data, and the number of examples generatedfor each adversarial distraction.

67.89% of the original and 64.69% of the passiveexamples are guessed correctly.

To generate more plausible adversarial distrac-tions, we try reversing named entities referring topeople. Most person names should be equallylikely with or without reversal, with respect to alanguage model, so the generated examples shouldrely on understanding the syntax of the premisecorrectly.

SNLI generally lacks named entities, because itis sourced from image captions, so we considerFEVER data instead. The baseline FEVER sys-tem (Thorne et al., 2018) retrieves up to five sen-tences of potential evidence from Wikipedia foreach claim, by comparing TFIDF scores. We labeleach of these evidence/claim pairs as entailment orcontradiction, according to the claim label, if theevidence appears in the ground truth evidence set,and as neutral otherwise. Because the potential ev-idence is pulled from the middle of an article, itmay be taken out of context with coreference rela-tions unresolved. To provide a bit of context, weprefix each premise with the title of the Wikipediapage it comes from, punctuated by brackets.

Our person reversal dataset generates contra-dictions from entailment pairs, by considering per-son named entities identified by SENNA in the hy-pothesis, and reversing them if they appear withinthe ARG0 and ARG1 phrases of the root verb.Again, we filter examples with reciprocal verbsand the preposition “with.” For example, theFEVER claim, “Lois Lane’s name was taken fromLola Lane’s name” (entailment), leads to a per-son reversal of “Lola Lane’s name was taken fromLois Lane’s name” (contradiction).

To compare the plausibility of the examples,we train the same hypothesis only baseline on theFEVER dataset augmented with the person rever-sals. It achieves 15.94% accuracy on the added ex-amples, showing that the person reversals are moreplausible than the passive reversals.

3.3 Life spans

Our third adversarial distraction (birthday) in-volves distinguishing birth and death date infor-mation. It randomly inserts birth and death datesinto a premise following a person named entity, inparentheses, using one of two date formats. If itchooses a future death date, no death date is in-serted. A newly generated hypothesis randomlygives a statement about either birth or death, andeither the year or the month, with a label equallybalanced among entailment, contradiction, andneutral. For half of the contradictions it simplyreverses the birth and death dates; otherwise it ran-domly chooses another date. For the neutral exam-ples it asks about a different named entity, takenfrom the same sentence if possible.

For example, a birthday and death date are ran-domly generated to yield the premise, “[DaenerysTargaryen] Daenerys Targaryen is a fictional char-acter in George R. R. Martin’s A Song of Iceand Fire series of novels, as well as the televisionadaptation, Game of Thrones, where she is por-trayed by Emilia Clarke (April 25, 860 – Novem-ber 9, 920),” and hypothesis, “Emilia Clarke diedin April” (contradiction).

4 Evaluation

4.1 Experiments

We consider three NLI systems based on deepneural networks: Decomposable Attention (DA)(Parikh et al., 2016), ESIM (Chen et al., 2017),and a Finetuned Transformer Language Model(FTLM) (Radford et al., 2018). For Decompos-able Attention we take the AllenNLP implemen-tation (Gardner et al., 2017) without ELMo fea-tures (Peters et al., 2018); for the others, we takethe releases from the authors. We modify the codereleased for FTLM to support entailment classifi-cation, following the description in the paper.

For the FEVER-based datasets, for DA andESIM, we reweight each class in the training data

81

SNLI FEVEROriginal Passive Passive rev. Original Person rev. Birthday

DA .8456 .8301 .0111 .8416 (.1503) .0435 .2909ESIM .8786 .8077 .0139 .8445 (.3905) .0290 .3134FTLM .8980 .8430 .0540 .9585 (.6656) .0000 .2953

Table 2: Accuracy and (Cohen’s Kappa) when training on original SNLI or FEVER data and testing on original oradded examples.

SNLI Person rev. BirthdayOriginal Passive Passive rev. Original Added Original Added

DA .8517 .7333 .5042 .8552 (.1478) .1449 .8925 (.1550) .4700ESIM .8781 .8667 .9833 .8406 (.3809) .6232 .8721 (.4404) .9684FTLM .8953 .8920 .9917 .9581 (.6610) .7536 .9605 (.6809) .9926

Table 3: Accuracy and (Cohen’s Kappa) when training on augmented SNLI (SNLI + passive + passive reversal) oraugmented FEVER (FEVER + person reversal or FEVER + birthday) and testing on original or added examples.

in inverse proportion to its number of examples.This reweighting is necessary to produce nontriv-ial (most frequent class) results; the NLI train-ing set we derive from FEVER has 92% neutral,6% entailment, and 2% contradiction examples.FTLM requires no such reweighting. When eval-uating on the original FEVER examples, we re-port Cohen’s Kappa between predicted and groundtruth classifications, in addition to accuracy, be-cause the imbalance pushes DA and ESIM belowthe accuracy of a trivial classifier.

Whereas FTLM uses a byte-pair encoding vo-cabulary (Sennrich et al., 2016) that can representany word as a combination of subword tokens, DAand ESIM rely on word embeddings from GloVe(Pennington et al., 2014), with a single out-of-vocabulary (OOV) token shared for all unknownwords. Therefore it is unreasonable to expectDA and ESIM not to confuse named entities inFEVER tasks. We extend each of these modelsby allocating 10,000 random vectors for out-of-vocabulary words, and taking a hash of each OOVword to select one of the vectors. The vectors areinitialized from a normal distribution with mean 0and standard deviation 1.

4.2 Results

When we train the three models using origi-nal SNLI without augmentation, the models haveslightly lower performance on the passive exam-ples than the original data. However, all threemodels fail to properly classify the passive rever-sal data: without training, it looks too similar to

the original hypothesis. The augmented data suc-ceeds in training two out of three of the modelsabout the passive voice: ESIM and FTLM canclassify the passive examples with approximatelythe accuracy of the original examples, and witheven higher accuracy, they can pick out the passivereversals as preposterous. However, DA cannot dobetter than guess whether a passive sentence is re-versed or not. This is because its model is definedso that its output is invariant to changes in wordorder. Because it must consider the possibility ofa passive reversal, its performance on the passiveexamples actually goes down after training withaugmentations.

Person reversal also stumps all three models be-fore retraining. Of course DA can get a person re-versal right only when it gets an original examplewrong, because of its insensitivity to word order.ESIM and FTLM find person reversals to be moredifficult than the original examples. Compared topassive reversal, the lack of language hints seemsto make the problem more challenging. However,the multitude of conditions necessary to performa person reversal makes the added examples lessthan 1% of the overall training data.

No model trained on FEVER initially does bet-ter than random guessing on the birthday problem.By training with augmented examples, ESIM andFTLM learn to use the structure of the premiseproperly to solve this problem. DA learns somehints after retraining, but essentially the problemdepends on word order, which it is blind to. It isnoteworthy that the performance of all three mod-

82

els on the original data improves after the birth-day examples are added, unlike the other two aug-mentations, where performance remains the same.Four percent of the original FEVER claims usethe word “born” or “died,” and extra practice withthese concepts proves beneficial.

5 Discussion

We have taken two basic aspects of syntax, theequivalence of passive and active forms of thesame sentence, and the distinction between sub-ject and direct object, and shown that they arenot naturally learned through existing NLI trainingsets. Two of the models we evaluated could mas-ter these concepts with added training data, but theDecomposable Attention model could not even af-ter retraining.

We automatically generated training data toteach these syntactic concepts, but doing so re-quired rather complicated programs to manipulatesentences based on parsing and SRL results. Gen-erating large numbers of examples for more com-plicated or rarer aspects of syntax will be chal-lenging. An obvious difficulty in extending ourapproach is the need to make a distraction tem-plate that affects the meaning in a known way. Theother difficulty lies in making transformed exam-ples plausible enough not to be rejected by lan-guage model likelihood.

ReferencesSamuel Bowman, Gabor Angeli, Christopher Potts, and

Christopher Manning. 2015. A large annotated cor-pus for learning natural language inference. In Pro-ceedings of the Conference on Empirical Methods inNatural Language processing (EMNLP).

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTMfor natural language inference. In Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (ACL) Volume 1, pages 1657–1668.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. Journal of Machine Learning Research(JMLR), 12:2493–2537.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 670–680.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. AllenNLP: A deep semantic natural languageprocessing platform. CoRR, abs/1803.07640.

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking NLI systems with sentences that re-quire simple lexical inferences. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (ACL) Volume 2, pages 650–655.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in Neural InformationProcessing Systems (NIPS) 27, pages 2672–2680.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, and Noah A.Smith. 2018. Annotation artifacts in natural lan-guage inference data. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT) Volume 2,pages 107–112.

Matthew Honnibal and Mark Johnson. 2015. An im-proved non-monotonic transition system for depen-dency parsing. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1373–1378.

Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehension sys-tems. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 2021–2031.

Dongyeop Kang, Tushar Khot, Ashish Sabharwal, andEduard Hovy. 2018. Adversarial training for tex-tual entailment with knowledge-guided examples.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (ACL).

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.SciTail: A textual entailment dataset from sciencequestion answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.

Aakanksha Naik, Abhilasha Ravichander, NormanSadeh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language inference.In Proceedings of the 27th International Conferenceon Computational Linguistics (COLING), Santa Fe,New Mexico, USA.

Yixin Nie and Mohit Bansal. 2017. Shortcut-stackedsentence encoders for multi-domain inference. InProceedings of the 2nd Workshop on Evaluat-ing Vector Space Representations for NLP, RepE-val@EMNLP, pages 41–45.

83

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 2249–2255.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectorsfor word representation. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies (NAACL-HLT) Volume 1, pages 2227–2237.

Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language infer-ence. In Proceedings of the Joint Conference onLexical and Computational Semantics (*Sem).

Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding by generative pre-training. In OpenAIBlog.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100, 000+ questionsfor machine comprehension of text. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 2383–2392.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversar-ial rules for debugging NLP models. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (ACL) Volume 1, pages856–865.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (ACL) Volume 1, pages 1715–1725.

Dinghan Shen, Guoyin Wang, Wenlin Wang, MartinRenqiang Min, Qinliang Su, Yizhe Zhang, Chun-yuan Li, Ricardo Henao, and Lawrence Carin.2018. Baseline needs more love: On simple word-embedding-based models and associated poolingmechanisms. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (ACL) Volume 1, pages 440–450.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extractionand verification. In Proceedings of the Conference

of the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies (NAACL-HLT) Volume 1.

Masatoshi Tsuchiya. 2018. Performance impactcaused by hidden bias of training data for recog-nizing textual entailment. In Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC).

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies (NAACL-HLT) Vol-ume 1, pages 1112–1122.

Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptions to vi-sual denotations: New similarity metrics for seman-tic inference over event descriptions. In Transac-tions of the Association for Computational Linguis-tics (TACL), volume 2, pages 67–78.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018.Generating natural adversarial examples. In Pro-ceedings of the International Conference on Learn-ing Representations (ICLR).

84


Where is your Evidence: Improving Fact-checking by JustificationModeling

Tariq Alhindi and Savvas PetridisDepartment of Computer Science

Columbia [email protected]@columbia.edu

Smaranda MuresanDepartment of Computer Science

Data Science InstituteColumbia University

[email protected]

Abstract

Fact-checking is a journalistic practice thatcompares a claim made publicly againsttrusted sources of facts. Wang (2017) intro-duced a large dataset of validated claims fromthe POLITIFACT.com website (LIAR dataset),enabling the development of machine learn-ing approaches for fact-checking. However,approaches based on this dataset have fo-cused primarily on modeling the claim andspeaker-related metadata, without consideringthe evidence used by humans in labeling theclaims. We extend the LIAR dataset by auto-matically extracting the justification from thefact-checking article used by humans to la-bel a given claim. We show that modelingthe extracted justification in conjunction withthe claim (and metadata) provides a signifi-cant improvement regardless of the machinelearning model used (feature-based or deeplearning) both in a binary classification task(true, false) and in a six-way classification task(pants on fire, false, mostly false, half true,mostly true, true).

1 Introduction

Fact-checking is the process of assessing the ve-racity of claims. It requires identifying evi-dence from trusted sources, understanding thecontext, and reasoning about what can be inferredfrom the evidence. Several organizations such asFACTCHECK.org, POLITIFACT.com and FULL-FACT.org are devoted to such activities, and thefinal verdict can reflect varying degrees of truth(e.g., POLITIFACT labels claims as true, mostlytrue, half true, mostly false, false and pants onfire).

Until recently, the bottleneck for developing au-tomatic methods for fact-checking has been thelack of large datasets for building machine learn-ing models. Thorne and Vlachos (2018) provide

a survey of current datasets and models for fact-checking (e.g., (Wang, 2017; Rashkin et al., 2017;Vlachos and Riedel, 2014; Thorne et al., 2018;Long et al., 2017; Potthast et al., 2018; Wanget al., 2018)). Wang (2017) has introduced a largedataset (LIAR) of claims from POLITIFACT, theassociated metadata for each claim and the ver-dict (6 class labels). Most work on the LIARdataset has focused on modeling the content of theclaim (including hedging, sentiment and emotionanalysis) and the speaker-related metadata (Wang,2017; Rashkin et al., 2017; Long et al., 2017).

However, these approaches do not use the ev-idence and the justification provided by humansto predict the label. Extracting evidence from(trusted) sources for fact-checking or for argu-ment mining is a difficult task (Rinott et al., 2015;Thorne et al., 2018; Baly et al., 2018). For thepurpose of our paper, we rely on the fact-checkingarticle associated with the claim. We extend theoriginal LIAR dataset by automatically extract-ing the justification given by humans for labelingthe claim, from the fact-checking article (Section2). We release the extended LIAR dataset (LIAR-PLUS) to the community1.

The main contribution of this paper is to showthat modeling the extracted justification in con-junction with the claim (and metadata) providesa significant improvement regardless of the ma-chine learning model used (feature-based or deeplearning) both in a binary classification task (true,false) and in a six-way classification task (pantson fire, false, mostly false, half-true, mostly true,true) (Section 4). We provide a detailed error anal-ysis and per-class results.

Our work complements the recent work on pro-viding datasets and models that enable the de-velopment of an end-to-end pipeline for fact-

1https://github.com/Tariq60/LIAR-PLUS

85

checking ((Thorne et al., 2018) for English and(Baly et al., 2018) for Arabic). We are primarilyconcerned on showing the impact of modeling thehuman-provided justification for predicting the ve-racity of a claim. In addition, our task aims to cap-ture the varying degrees of truth that some claimsmight have and that are usually labeled as such byprofessionals (rather than binary true vs. false la-bels).

2 Dataset

The LIAR dataset introduced by (Wang, 2017)consists of 12,836 short statements taken fromPOLITIFACT and labeled by humans for truthful-ness, subject, context/venue, speaker, state, party,and prior history. For truthfulness, the LIARdataset has six labels: pants-fire, false, mostly-false, half-true, mostly-true, and true. These sixlabel sets are relatively balanced in size. Thestatements were collected from a variety of broad-casting mediums, like TV interviews, speeches,tweets, debates, and they cover a broad range oftopics such as the economy, health care, taxes andelection.

We extend the LIAR dataset to the LIAR-PLUSdataset by automatically extracting for each claimthe justification that humans have provided in thefact-checking article associated with the claim.Most of the articles end with a summary that has aheadline “our ruling” or “summing up”. This sum-mary usually has several justification sentencesthat are related to the statement. We extract allsentences in these summary sections, or the lastfive sentences in the fact-checking article when nosummary exists. We filter out the sentence thathas the verdict and related words. These extractedsentences can support or contradict the statement,which is expected to enhance the accuracy of theclassification approaches. Excerpt from the LIAR-PLUS dataset is shown in Table 1.

3 Methods

The main goal of our paper is to show that mod-eling the human-provided justification — whichcan be seen as a summary evidence — improvesthe assessment of a claim’s truth when comparedto modeling the claim (and metadata) alone, re-gardless of the machine learning models (featurebased vs. deep learning models). All our models

Statement:“Says Rick Scott cut education to pay for evenmore tax breaks for big, powerful, well-connected corpo-rations.”Speaker: Florida Democratic PartyContext: TV AdLabel: half-trueExtracted Justification: A TV ad by the Florida Demo-cratic Party says Scott ”cut education to pay for even moretax breaks for big, powerful, well-connected corporations.”However, the ad exaggerates when it focuses attentionon tax breaks for ”big, powerful, well-connected corpo-rations.” Some such companies benefited, but so did manyother types of businesses. And the question of whether thetax cuts and the education cuts had any causal relationshipis murkier than the ad lets on.

Table 1: Excerpt from the LIAR-PLUS dataset

use 4 different conditions: basic claim/statement2

representation using just word representations (Scondition), enhanced claim/statement representa-tion that captures additional information shownto be useful such as hedging, sentiment strengthand emotion (Rashkin et al., 2017) as well asmetadata information (S+M condition), basicclaim/statement and the associated extracted jus-tification (SJ condition) and finally enhancedclaim/statement representation, metadata and jus-tification (S+MJ condition).

Feature-based Machine Learning. We experi-ment with both Logistic Regression (LR) and Sup-port Vector Machines (SVM) with linear kernel.For the basic representation of the claim/statement(S condition) we experimented with unigram fea-tures, tf-idf weighted unigram features and Gloveword embeddings (Pennington et al., 2014). Thebest representation proved to be unigrams. Forthe enhanced statement representation (S+) wemodeled: sentiment strength using SentiStrength,which measures the negativity and positivity ofa statement on a scale of 1-to-5 (Thelwall et al.,2010); emotion using the NRC Emotion Lexi-con (EmoLex), which associates each word witheight basic emotions (Mohammad and Turney,2010), and the Linguistic Inquiry and Word Count(LIWC) lexicon (Pennebaker et al., 2001). In ad-dition, we include metadata information such asthe number of claims each speaker makes for ev-ery truth-label (history) (Wang, 2017; Long et al.,2017). Finally for representing the justification inthe SJ and S+MJ conditions, we just use unigramfeatures.

2In the rest of the paper we will refer to the claim as state-ment.

86

Cond. Model Binary Six-wayvalid test valid test

SLR 0.58 0.61 0.23 0.25

SVM 0.56 0.59 0.25 0.23BiLSTM 0.59 0.60 0.26 0.23

SJLR 0.68 0.67 0.37 0.37

SVM 0.65 0.66 0.34 0.34BiLSTM 0.70 0.68 0.34 0.31

P-BiLSTM 0.69 0.67 0.36 0.35

S+MLR 0.61 0.61 0.26 0.25

SVM 0.57 0.60 0.26 0.25BiLSTM 0.62 0.62 0.27 0.25

S+MJLR 0.69 0.67 0.38 0.37

SVM 0.66 0.66 0.35 0.35BiLSTM 0.71 0.68 0.34 0.32

P-BiLSTM 0.70 0.70 0.37 0.36

Table 2: Classification Results

Deep Learning Models. We chose to use Bi-Directional Long Short-term Memory (BiLSTM)(Hochreiter and Schmidhuber, 1997) architecturesthat have been shown to be successful for vari-ous related NLP tasks such a textual entailmentand argument mining. For the S condition we usejust one BiLSTM to model the statement. We useGlove pre-trained word embeddings (Penningtonet al., 2014), a 100 dimension embedding layerthat is followed by a BiLSTM layer of size 32. Theoutput of the BiLSTM layer is passed to a soft-max layer. In the S+M condition, a normalizedcount vector of those features (described above)is concatenated with the output of the BiLSTMlayer to form a merge layer before the softmax.We used a categorical cross entropy loss func-tion and ADAM optimizer (Kingma and Ba, 2014)and trained the model for 10 epochs. For the SJand S+MJ conditions we experiment with two ar-chitectures: in the first one we just concatenatethe justification to the statement and pass it to asingle BiLSTM, and in the second one we usea dual/parallel architecture where one BiLSTMreads the statement and another one reads the justi-fication (architecture denoted as P-BiLSTM). Theoutputs of these BiLSTMs are concatenated andpassed to a softmax layer. This latter architec-ture has been proven to be effective for tasks thatmodel two inputs such as textual entailment (Con-neau et al., 2017) or sarcasm detection based onconversation context (Ghosh et al., 2017; Ghoshand Veale, 2017).

Class class size S SJLR BiLSTM LR BiLSTM P-BiLSTM

pants-fire 116 0.18 0.19 0.37 0.34 0.37false 263 0.28 0.34 0.33 0.3 0.33

mostly-false 237 0.21 0.13 0.35 0.31 0.32half-true 248 0.22 0.28 0.39 0.31 0.37

mostly-true 251 0.23 0.33 0.40 0.39 0.39true 169 0.22 0.18 0.37 0.42 0.39

total/avg 1284 0.23 0.26 0.37 0.34 0.36

Table 3: F1 Score Per Class on Validation Set

Class class size S SJLR BiLSTM LR BiLSTM P-BiLSTM

pants-fire 92 0.12 0.11 0.38 0.33 0.39false 250 0.31 0.31 0.35 0.32 0.35

mostly-false 214 0.25 0.15 0.35 0.27 0.33half-true 267 0.24 0.26 0.41 0.27 0.34

mostly-true 249 0.23 0.30 0.35 0.35 0.33true 211 0.25 0.16 0.37 0.36 0.41

total/avg 1283 0.25 0.23 0.37 0.31 0.35

Table 4: F1 Score Per Class on Test Set

4 Results and Error Analysis

Table 2 shows the results both for the binary andthe six-way classification tasks under all 4 con-ditions (S, SJ, S+M and S+MJ) for our feature-based machine learning models (LR and SVM)and the deep learning models (BiLSTM and P-BiLSTM). For the binary runs we grouped pantson fire, false and mostly false as FALSE and true,mostly true and half true as TRUE. As reference,Wang (2017 best models (text and metadata) ob-tained 0.277 F1 on validation set and 0.274 F1on test set in the six-way classification, showingrelatively similar results with our equivalent S+Mcondition.

It is clear from the results shown in Table 2 thatincluding the justification (SJ and S+MJ condi-tions) improves over the conditions that do not usethe justification (S and S+M, respectively) for allmodels, both in the binary and the six-way classi-fication tasks. For example, for the six-way classi-fication, we see that the BiLSTM model for the SJcondition obtains 0.35 F1 compared to 0.23 F1 inthe S condition. LR model has a similar behaviourwith 0.37 F1 for the SJ condition compared to 0.25F1 in S condition. For the S+MJ conditions thebest model (LR) shows an F1 of 0.38 compared to0.26 F1 in the S+M condition (similar results forthe deep learning). The dual/parallel BiLSTM ar-chitecture provides a small improvement over thesingle BiLSTM only in the six-way classification.

We also present the per-class results for the six-way classification for the S and SJ conditions. Ta-ble 3 shows the results on validation set, whileTable 4 on the test set. In the S condition, we

87

ID Statement Justification label S S+M SJ S+MJ1 We have the highest tax rate anywhere

in the world.Trump, while lamenting the condition of the middle class, said the U.S. has”the highest tax rate anywhere in the world.” All sets of data we examinedfor individual and family taxes prove him wrong. Statutory income taxrates in the U.S. fall around the end of the upper quarter of nations. Moreexhaustive measures - which compute overall tax burden per person and asa percentage of GDP - show the U.S. either is in the middle of the pack oron the lighter end of taxation compared with other advanced industrializednations.

false X

2 “Says Rick Scott cut education to payfor even more tax breaks for big, pow-erful, well-connected corporations.”

A TV ad by the Florida Democratic Party says Scott ”cut education to payfor even more tax breaks for big, powerful, well-connected corporations.”However, the ad exaggerates when it focuses attention on tax breaks for”big, powerful, well-connected corporations.” Some such companies ben-efited, but so did many other types of businesses. And the question ofwhether the tax cuts and the education cuts had any causal relationship ismurkier than the ad lets on.

half-true X X

3 Says Donald Trump has given moremoney to Democratic candidates thanRepublican candidates.

but public records show that the real estate tycoon has actually contributedaround $350,000 more to Republicans at the state and federal level thanDemocrats. That, however, is a recent development. Fergusons statementcontains an element of truth but ignores critical facts.

mostly-false X X

4 Says out-of-state abortion clinics havemarketed their services to minors instates with parental consent laws.

As Cousins clinic in New York told Yellow Page users in Pennsylvania, ”Nostate consents.” This is information the clinics wanted patients or potentialpatients to have, and paid money to help them have it. Whether it was tohelp persuade them to come in or not, it provided pertinent facts that couldhelp them in their decision-making. It fit the definition of marketing.

true X X X

5 Obamacare provision will allow forcedhome inspections by governmentagents.

But the program they pointed to provides grants for voluntary help to at-riskfamilies from trained staff like nurses and social workers. What bloggersdescribe would be an egregious abuse of the law not whats allowed by it.

pants-fire X X X

6 In the month of January, Canada createdmore new jobs than we did.

In November 2010, the U.S. economy created 93,000 jobs, compared to15,200 for Canada. And in December 2010, the U.S. created 121,000 jobs,compared to 22,000 for Canada. ”But on a per capita basis, in recent monthsU.S. job creation exceeded Canada’s only in October.” January happened tobe a month when U.S. job creation was especially low and Canadian jobcreation was especially high, but it is the most recent month and it reflectsthe general pattern when you account for population.

true X X X X

7 There has been $5 trillion in debt addedover the last four years.

number is either slightly high or a little low, depending on the type of mea-surement used, and thats actually for a period short of a full four years. Hisimplication that Obama and the Democrats are to blame has some merit, butit ignores the role Republicans have had.

mostly-true X X X X

Table 5: Error analysis of Six-way Classification (Logistic Regression)

see a larger degree of variation in performanceamong classes, with the worst being the pants-on-fire for all models, and for the deep learning modelalso the mostly-false and true classes. In the SJcondition, we notice a more uniform performanceon all classes for all the models. We notice thebiggest improvement for the pants-on-fire class forall models, half-true for LR and mostly-false andtrue for the deep learning models. When compar-ing the P-BiLSTM and BiLSTM we noticed thatthe biggest improvement comes from the half-trueclass and the pants-on-fire class.

Error Analysis In order to further understandthe cause of the errors made by the models, weanalyzed several examples by looking at the state-ment, justification and predictions by the logisticregression model when using the S, S+M, SJ, andS+MJ conditions (Table 5). Logistic regressionwas selected since it has the best numbers for thesix-way classification task.

The first example in Table 5 was wrongly clas-sified in the S condition, but classified correctlyin the S+M, SJ and S+MJ conditions. The justi-fication text has a sentence saying “Statutory in-come tax rates in the U.S. fall around the end ofthe upper quarter of nations.”, which contradictsthe statement and thus is classified correctly when

modeling the justification.

The second and third examples in Table 5 werecorrectly predicted only when the justification wasmodeled (SJ and S+MJ conditions). For statement2, the justification text has a sentence “However,the ad exaggerates...” indicates that the statementhas some false and some true information. There-fore, the model predicts the correct label “half-true” when modeling the justification text. Also,the justification for statement 3 was simple enoughfor the model to predict the gold label “mostly-false”. It has a phrase like “more to Republicans”while the statement had “more to Democratic can-didates” which indicates falsehood in the state-ment as well as discourse markers indicating con-cessive moves (“but” and “however”).

Sometimes justification features alone were notenough to get the correct prediction without us-ing the enhanced statement and metadata features.The justification for statement 4 in Table 5 is com-plex and no direct connection can be made to thestatement. Therefore, the model fails when usingSJ and S+M conditions and only succeed whenusing all features (i.e., S+MJ condition). In ad-dition, consider the 5th statement in Table 5 aboutObamacare, it seems that metadata features, whichhave the history of the speaker, might have helped

88

in predicting its factuality to be “pants on fire”,while it was wrongly classified when modelingonly the statement and the justification.

For around half of the instances in validationset, all models had wrong predictions. This isnot surprising since the best model had an averageF1 score of less than 0.40. The last two examplein Table 5 are instances where the model makesmistakes under all 4 conditions. The claim andjustification refer to temporal information whichis harder to model by the rather simple and shal-low approaches we used. Incorporating temporaland numeric information when modeling the claimand justification would be essential for capturingthe correct context of a given statement. Anothersource of error for justification-based conditionswas the noise in the extraction of the justificationparticularly when the “our ruling” and “summingup” headers were not included and we resorted toextract the last 5 sentences from the fact-checkingarticles. Improving the extraction methods will behelpful to improving the justification-based classi-fication results.


We presented a study that shows that modelingthe human-provided justification form the fact-checking article associated with a claim is im-portant leading to significant improvements whencompared to modeling just the claim/statementand metadata for all the machine learning mod-els both in a binary and a six-way classifica-tion task. We released LIAR-PLUS, the extendedLIAR dataset that contains the automatically ex-tracted justification. We also provided an erroranalysis and discussion of per-class performance.

Our simple method for extracting the justifi-cation from the fact-checking article can lead toslightly noisy text (for example it can contain arepetition of the claim or it can fail to capturethe entire evidence). We plan to further refinethe justification extraction method so that it con-tains just the summary evidence. In addition, weplan to develop methods for evidence extractionfrom the web (similar to the goals of the FEVERshared task (Thorne et al., 2018)) and comparethe results of the automatically extracted evidencewith the human-provided justifications for fact-checking the claims.

ReferencesRamy Baly, Mitra Mohtarami, James Glass, Lluıs

Marquez, Alessandro Moschitti, and Preslav Nakov.2018. Integrating stance detection and fact checkingin a unified corpus. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers), pages21–27.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 670–680. Associ-ation for Computational Linguistics.

Aniruddha Ghosh and Tony Veale. 2017. Magnets forsarcasm: Making sarcasm detection timely, contex-tual and very personal. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 482–491.

Debanjan Ghosh, R. Alexander Fabbri, and SmarandaMuresan. 2017. The role of conversation contextfor sarcasm detection in online interactions. In Pro-ceedings of the 18th Annual SIGdial Meeting on Dis-course and Dialogue, pages 186–196. Associationfor Computational Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Yunfei Long, Qin Lu, Rong Xiang, Minglei Li,and Chu-Ren Huang. 2017. Fake news detectionthrough multi-perspective speaker profiles. In Pro-ceedings of the Eighth International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers).

Saif M Mohammad and Peter D Turney. 2010. Emo-tions evoked by common words and phrases: Us-ing mechanical turk to create an emotion lexicon. InProceedings of the NAACL HLT 2010 workshop oncomputational approaches to analysis and genera-tion of emotion in text, pages 26–34. Association forComputational Linguistics.

James W Pennebaker, Martha E Francis, and Roger JBooth. 2001. Linguistic inquiry and word count:Liwc 2001. Mahway: Lawrence Erlbaum Asso-ciates, 71(2001):2001.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

89

Martin Potthast, Johannes Kiesel, Kevin Reinartz,Janek Bevendorff, and Benno Stein. 2018. A stylo-metric inquiry into hyperpartisan and fake news. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 231–240. Association for Com-putational Linguistics.

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and polit-ical fact-checking. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2931–2937.

Ruty Rinott, Lena Dankin, Carlos Alzate Perez,Mitesh M. Khapra, Ehud Aharoni, and NoamSlonim. 2015. Show me your evidence - an auto-matic method for context dependent evidence detec-tion. In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing,pages 440–450.

Mike Thelwall, Kevan Buckley, Georgios Paltoglou,Di Cai, and Arvid Kappas. 2010. Sentiment strengthdetection in short informal text. Journal of the As-sociation for Information Science and Technology,61(12):2544–2558.

J Thorne and A Vlachos. 2018. Automated fact check-ing: Task formulations, methods and future direc-tions. In Proceedings of the 27th InternationalConference on Computational Linguistics (COLING2018).



William Yang Wang. 2017. “liar, liar pants on fire”: Anew benchmark dataset for fake news detection. InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL 2017),Vancouver, BC, Canada. ACL.

Xuezhi Wang, Cong Yu, Simon Baumgartner, and FlipKorn. 2018. Relevant document discovery for fact-checking articles. In Companion of the The WebConference 2018 on The Web Conference 2018,pages 525–533. International World Wide Web Con-ferences Steering Committee.

90


Affordance Extraction and Inferencebased on Semantic Role Labeling

Daniel Loureiro, Alıpio Mario JorgeLIAAD - INESC TEC

Faculty of Sciences - University of Porto, [email protected], [email protected]

Abstract

Common-sense reasoning is becoming in-creasingly important for the advancement ofNatural Language Processing. While wordembeddings have been very successful, theycannot explain which aspects of ‘coffee’ and‘tea’ make them similar, or how they could berelated to ‘shop’. In this paper, we proposean explicit word representation that buildsupon the Distributional Hypothesis to repre-sent meaning from semantic roles, and allowinference of relations from their meshing, assupported by the affordance-based IndexicalHypothesis. We find that our model improvesthe state-of-the-art on unsupervised word sim-ilarity tasks while allowing for direct inferenceof new relations from the same vector space.

1 Introduction

The word representations used more recently inNatural Language Processing (NLP) have beenbased on the Distributional Hypothesis (DH) (Har-ris, 1954) — “words that occur in the same con-texts tend to have similar meanings”. This sim-ple idea has led to the development of powerfulword embedding models, starting with Latent Se-mantic Analysis (LSA) (Landauer and Dumais,1997) and later, the popular word2vec (Mikolovet al., 2013) and GloVe (Pennington et al., 2014)models. Although, effective at quantifying thesimilarity between words (and phrases) such as‘tea’ and ‘coffee’, they cannot relate that judge-ment to the fact that both can be sold, for in-stance. Furthermore, current representations can’tinform us about possible relations between wordsoccurring in mostly distinct contexts, such as us-ing a ‘newspaper’ to cover a ‘face’. While therehave been substantial improvements to word em-bedding models over the years, these shortcom-ings have endured (Camacho-Collados and Pile-hvar, 2018).

Word Pairs Affordances(w1, w2) (w1 as ARG0, w2 as ARG1)

shop, tea sell, import, curedoctor, patient diagnose, prescribe, treat

newspaper, face cover, expose, pokeman, cup drink, pour, spill

Table 1: Results from affordance meshing (coordina-tion) using automatically labelled semantic roles.

Glenberg et al. (2000) identified these issuessoon after LSA was introduced, and cautioned thathigh-dimensional word representations, such asthose based on the DH, lack the necessary ground-ing to be proper semantic analogues. Instead,Glenberg proposed the Indexical Hypothesis (IH)which supports that meaning is constructed by (a)indexing words and phrases to real objects or per-ceptual, analog symbols; (b) deriving affordancesfrom the objects and symbols; and (c) meshingthe affordances under the guidance of syntax. Fol-lowing Glenberg et al. (2000), this work considersan object’s affordances as its possibilities for ac-tion constrained by its context, including actionswhich may not be directly perceived, which dif-fers slightly from Gibson (1979)’s original defini-tion. Even though the language grounding advo-cated by the IH is beyond the reach of NLP byitself, we believe that its representation of mean-ing through affordances can still be captured to auseful extent.

Our contribution1 is a word-level representa-tion that allows for the affordance correspondenceand meshing supported by the IH. These affor-dances are approximated from occurrences of se-mantic roles in corpora through an adaptation ofmodels based on the DH. Our work is motivatedby two observations: (1) a pressing need to in-tegrate common-sense knowledge in NLP mod-

1Code, data and demo: https://a2avecs.github.io

91

Corpus

PAS-based Contexts

Lexical Contexts

M(VxR) M+(VxR) A(Vx300)

Semantic Role Labeling

Sliding Window

PPMI SkipGramLinear Combination

Figure 1: Outline of model pipeline.

els and (2) recent improvements to Semantic RoleLabeling (SRL) have made affordance extractionfrom raw corpora sufficiently reliable. We find thatour model (A2Avecs) performs competitively onword similarity tasks while enabling novel ‘who-does-what-to-whom’ style inferences (Table 1).

2 Related Work

This work is closely related to the research area ofselectional preferences, where the goal is to pre-dict the likelihood of a verb taking a certain argu-ment for a particular role (e.g. likelihood of manbeing an agent of drive). Most notably, Erk et al.(2010) proposed a distributional model of selec-tional preferences that used SRL annotations asa primary set of verb-role-arguments from whichto generalize using word representations basedon the DH and several word similarity metrics.Progress in selectional preferences is usually mea-sured through correlations with human thematic fitjudgements and, more recently, neural approaches(de Cruys, 2014; Tilk et al., 2016) obtained state-of-the-art results.

While this work shares some of these same el-ements (i.e. SRL and word embeddings), they areused to predict potential affordances instead of se-lectional preferences. Consequently, our represen-tations are designed to enable the meshing pro-posed by the IH, allowing us to infer affordancesthat would not be likely under a selectional pref-erence learning scheme (e.g. newspaper-cover-face from Table 1). Additionally, this work isconcerned with showing that information derivedfrom SRL is complementary to information de-rived from DH methods, and thus focuses its eval-uation on tasks related to lexical similarity ratherthan thematic fit correlations.

3 Method

Our word representations are modelled usingPredicate-Argument Structures (PASs). Thesestructures are obtained through SRL of raw cor-pora, and used to populate a sparse word/contextco-occurrence matrixW where roles serve as con-texts (features), and argument spans serve as theco-occurrence windows. The roles are predicatesspecified by argument type (e.g. eat|ARG0) andused in place of affordances. See Table 2 for acomparison of this context definition with the tra-ditional lexical definition.

Context Words

Rol

e drinks|ARG0 Johndrinks|ARG1 red, winedrinks|ARGM-MNR slowly

Adj

acen

cy

John drinks, reddrinks John, red, winered John, drinks, wine, slowlywine drinks, red, slowlyslowly red, wine

Table 2: Different context definitions applied to thesentence ‘John drinks red wine slowly’. Top: Our pro-posed definition; Bottom: Lexical adjacency definition(with window size of 2).

After computing our co-occurrence matrix wefollow-up with the additional steps employed bytraditional bag-of-words models. We use PositivePointwise Mutual Information (PPMI) to improveco-occurrence statistics, as used successfully byBullinaria and Levy (2007); Levy and Goldberg(2014b), and maintain explicit high-dimensionalrepresentations in order to preserve the context in-formation required for affordance meshing. Pre-vious works, such as Levy and Goldberg (2014a)and Stanovsky et al. (2015), have also producedword representations from syntactic context def-initions (dependency parse trees and open infor-mation extraction, respectively) but have opted forfollowing-up with the word2vec’s SkipGram (SG)model, presumably influenced by a much highernumber of contexts in their approaches.

We reduce the sparsity of our explicit PPMI ma-trix by linear combination and interpolation of se-mantically related vectors. The semantic related-ness is obtained from the cosine similarity of SGvectors. As evidenced by Baroni et al. (2014), SGseem best suited for estimating relatedness (or as-sociation). These steps are further described in re-mainder of this section (See Fig. 1).

92

3.1 Extracting PASs

We use the AllenNLP (Gardner et al., 2017) im-plementation of He et al. (2017) state-of-the artSRL to extract PASs from an English Wikipediadump from April 2010 (1B words). Since theautomatic identification of predicates by an end-to-end SRL may produce erroneous results, weensure that these predicates are valid by restrict-ing them to the set of verbs tagged on the Browncorpus (Francis and Kucera, 1979). We also usethe spaCy parser (Honnibal and Montani, 2017)to reduce each argument phrase to its head nounphrase, reducing the dilution of the more relevantnoun and predicate co-occurrence statistics (SeeFig. 2). Additionally, we lemmatize the predicates(verbs) to their root form using WordNet’s Mor-phy Lemmatizer (Miller, 1992). Finally, we trimthe vocabulary size and the number of roles by dis-carding those which occur less than 100 times, andconsider only core and adjunct argument types.The result is a set C of observed contexts, such as<chase|ARG1, (the, cat)>, used to populate W .

ROOT

NPARG0

NPARG0

The dog

PP

with NP

white stripes

VP

chasedPRED NPARG1

the cat

Figure 2: Parse tree for the sentence ‘The dog withwhite stripes chased the cat.’. The label for ARG0 isrepositioned to the smaller subtree.

3.2 Argument-specific PPMI

The authors of PropBank (Kingsbury and Palmer,2002), which provides the annotations used forlearning SRL, state that arguments are predicate-specific. Still, they also acknowledge that there aresome trends in the argument assignments. For in-stance, the most consistent trends are that ARG0is usually the agent, and ARG1 is the direct ob-ject or theme of the relation. This realisation leadsus to adapting the PPMI measure to better accountfor the correlations between roles of the same ar-gument types. Thus, we segment C by argumenttype, and apply PPMI independently considering

PST PMI THR RND MEN SPL .249 (1) 98.71

L+H .309 (47) 99.41L+H P .606 (47) 99.58L+H A+P .611 (47) 99.59L+H A+P 0.5 N/Aa < 41.2

L+H A+P 0.5 HDR .687 (0) 98.21L+H A+P 0.6 HDR .654 (14) 97.98L+H A+P 0.4 HDR .668 (0) 98.77

aFailed after using too much memory.

Table 3: Sensitivity/Impact analysis for some parame-ters of our approach.Legend: PST: Post-Processing (L: Lemmatization; H: Headnoun phrase isolation); PMI: PMI Variations (P: PPMI; A:Arg-specific PPMI); THR: Similarity threshold (tested val-ues); RND: Rounding (HDR: Half down rounding); MEN:MEN-3K task (Spearman correlation, #OOV failures); SP:Sparsity (percentage of zero values on a 155Kx18K matrix).

each CARG, such that for each Ww,p:

PPMI(w, r) = max(PMI(w, r), 0))

PMI(w, r) = logf(w, r)

f(w)f(r)= log

#(w, r) · |CARG|#(w) ·#(r)

where w is a word from the vocabulary V , r is arole (context) from the setR of the same argumenttype as CARG, and f is the probability function.The resulting matrix M = PPMI(W ) maintainsthe dimensions W and is slightly sparser.

3.3 Leveraging Association

The constraints imposed by SRL yield a very re-duced number of PAS-based contexts that can beextracted from a corpus, in comparison to lexi-cal adjacency-based contexts. Moreover, the post-processing steps we perform, while otherwise ben-eficial (see Table 3), further trim this information.To mitigate this issue, we also compute an embed-ding matrix A (see Section 3 for parameters), us-ing the state-of-the-art lexical-based SG model ofBojanowski et al. (2017), and use those embed-dings to obtain similarity values that can be used tointerpolate missing values inM , through weightedlinear combination. This way, existing vectors arere-computed as:

~vw =~v1 ∗ α1 + ...+ ~vn ∗ αn∑n

i=1 αi

93

with αi defined as:

αi =

A~vw ·A~vi‖A~vw‖‖A~vi

‖ = cosA(~vw, ~vi),

if cosA(~vw, ~vi) > 0.5

0, otherwise

where cosA corresponds to the cosine similarityin the SG representations.

The similarity threshold is tested on a few nat-ural choices (0.5 ± 0.1) and validated from re-sults on a single word similarity task (see Ta-ble 3). This approach is also used to define rep-resentations for words that are out-of-vocabulary(OOV) forM , but can be interpolated from relatedrepresentations, similarly to Zhao et al. (2015).In conjunction with the interpolation, we applyhalf down rounding to the vectors, before andafter re-computing them, so that our representa-tions remain efficiently sparse while benefittingfrom improved performance. Finally, we ap-ply a quadratic transformation to enlarge the in-fluence of meaningful co-occurrences, obtainingM+ = interpolate(M,A)2.

3.4 Inferring Relations

The examples shown in Table 1 are easily obtainedwith our model through a simple procedure (seeAlgorithm 1) that matches different arguments ofthe same predicates. As was the case with Arg-specific PPMI, this procedure is made possible bythe fact that a significant portion of argument as-signments remain consistent across predicates.

Algorithm 1 Affordance Meshing Algorithm1: procedure INFERENCE(M+, w1, w2, a1, a2)2: relations← []

3: ~v1, ~v2 ← get vec(w1,M+), get vec(w2,M

+)

4: for f1 ∈ features(~v1) ∧ arg(f1) = a1 do5: for f2 ∈ features(~v2) ∧ arg(f2) = a2 do6: if pred(f1) = pred(f2) then7: relations.add((f1 ∗ f2, pred(f1)))8: return sorted(relations)

4 Evaluation and Experiments

The A2Avecs model introduced in this paper isused to generate 155,183 word vectors of 18,179affordance dimensions. This section comparesour model with lexical-based models (word2vec(Mikolov et al., 2013), GloVe (Pennington et al.,

2014) and fastText (Bojanowski et al., 2017)) andother syntactic-based models (Deps (Levy andGoldberg, 2014a) and OpenIE (Stanovsky et al.,2015)). We’re using Deps and OpenIE embed-dings that the respective authors trained on aWikipedia corpus and distributed online. Lexicalmodels were trained using the same parameters,wherever applicable: Wikipedia corpus from April2010 (same as mentioned in section 2.1); mini-mum word frequency of 100; window size of 2;300 vector dimensions; 15 negative samples; andngrams sized from 3 to 6 characters.

We also show that our approach can make useof high-quality pretrained embeddings. We exper-iment with a fastText model pretrained on 600Btokens, referred to as ‘fastText 600B’ in contrastwith the fastText model trained on Wikipedia.

4.1 Model Introspection

The explicit nature of the representations pro-duced by our model makes them directly inter-pretable, similarly to other sparse representationssuch as Faruqui and Dyer (2015b). The examplespresented at Table 4 demonstrate the relational ca-pacity of our model, beyond associating meaning-ful predicates. In this introspection we highlightthe top role contexts for a set of words, inspiredby (Levy and Goldberg, 2014a) which presentedthe top syntactic context for the same words, andnote that this introspection produces results thatshould correspond to Erk et al. (2010)’s inverse se-lectional preferences.

Our online demonstration provides access to ad-ditional introspection queries, such as top wordsfor given affordances, or which affordances aremost distinguishable between a pair of words (de-termined by absolute difference).

batman hogwarts turingfoil|ARG0flirt|ARGM-MNRapprehend|ARG0subdue|ARG0rescue|ARGM-DIR

ambush|ARGM-MNRrock|ARGM-LOCexpress|ARG0prevent|ARGM-LOCexpel|ARG2

travel|ARGM-TMPpass|ARGM-ADVsolve|ARG0simulate|ARG1prove|ARG1

florida object-oriented dancingbase|ARGM-MNRvacation|ARG1reside|ARGM-DISfort|ARG1vacation|ARGM-LOC

define|ARG1define|ARG0use|ARG1implement|ARG1express|ARGM-MNR

dance|ARG0dance|ARGM-MNRdance|ARGM-LOCdance|ARGM-ADVdance|ARG1

Table 4: Words and their top role contexts. Using thesame words from the introspection of (Levy and Gold-berg, 2014a) to clarify the difference in the representa-tions of both approaches.

94

Context Model SL-666 SL-999 WS-SIM WS-ALL MEN RG-65

Lexicalword2vec .426 .414 .762 .672 .721 .793

GloVe .333 .325 .637 .535 .636 .601fastText (A) .426 .419 .779 .702 .751 .799

Syntactic

Deps .475 .446 .758 .629 .606 .765Open IE .397 .390 .746 .696 .281 .801

A2Avecs (M+) .461 .412 .734 .577 .687 .802A2Avecs (SVD(M+)) .436 .386 .672 .509 .599 .789

Lexical SOTA fastText 600B (A) .523 .504 .839 .791 .836 .859Intp. w/SOTA A2Avecs (M+) .513 .468 .780 .619 .744 .814Intp. & Conc. A2Avecs (M+ ‖ A) .540 .521 .846 .771 .829 .857Deps Conc. Deps ‖ A .524 .503 .818 .752 .770 .835

Table 5: Spearman correlations for word similarity tasks (see Faruqui and Dyer (2014) for task descriptions). Topsection shows results from training on the Wikipedia corpus exclusively. Bottom section shows results where weused SG embeddings (A) trained on a larger corpus for performing interpolation and concatenation on the sameset of roles used above. For comparison, we also show results for Deps concatenated with those embeddings.

4.2 Word Similarity

The results presented on Table 5 show that ourmodel can outperform lexical and syntactic mod-els, in spite of maintaining an explicit representa-tion. In fact, applying Singular Value Decomposi-tion (SVD) to obtain dense 300-dimensional em-beddings degrades performance. We achieve bestresults with the concatenation of the fastText 600Bvectors with our model interpolated using thosesame vectors for the vocabulary VM+ ∩ VA, afternormalizing both to unit length (L2). Interestingly,the same concatenation process with Deps embed-dings doesn’t seem as beneficial, suggesting thatour representations are more complementary.

5 Conclusion

Our results suggest that semantic similarity canbe captured in a vector space that also allows forthe inference of new relations through affordance-based representations, which opens up excitingpossibilities for the field. In the process, we pre-sented more evidence to support that informationobtained from SRL is complementary to informa-tion obtained from adjacency-based contexts, oreven contexts based on syntactical dependencies.We believe this work helps bridge the gap betweenselectional preferences and semantic plausibility,beyond frequentist generalizations based on theDH. In the near term, we expect that specific taskssuch as Entity Disambiguation and Coreferencecan benefit from these representations. With fur-ther developments, semantic plausibility assess-ments should also be useful for more broad taskssuch as Fact Verification and Story Understanding.

ReferencesMarco Baroni, Georgiana Dinu, and German

Kruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In ACL.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. TACL, 5:135–146.

John A. Bullinaria and Joseph P. Levy. 2007. Ex-tracting semantic representations from word co-occurrence statistics: a computational study. Behav-ior research methods, 39 3:510–26.

Jose Camacho-Collados and Mohammad Taher Pile-hvar. 2018. From word to sense embeddings: A sur-vey on vector representations of meaning.

Tim Van de Cruys. 2014. A neural network approachto selectional preference acquisition. In EMNLP.

Katrin Erk, Sebastian Pado, and Ulrike Pado. 2010. Aflexible, corpus-driven model of regular and inverseselectional preferences. Computational Linguistics,36:723–763.

Manaal Faruqui and Chris Dyer. 2014. Communityevaluation and exchange of word vectors at word-vectors.org. In ACL.

Manaal Faruqui and Chris Dyer. 2015b. Non-distributional word vector representations. In ACL.

W. Nelson Francis and Henry Kucera. 1979. TheBrown Corpus: A Standard Corpus of Present-DayEdited American English. Brown University Ligu-istics Department.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. Allennlp: A deep semantic natural languageprocessing platform.

95

J. J. Gibson. 1979. The ecological approach to visualperception. Brain and language.

Arthur M. Glenberg, David A. Robertson, BriannaBenjamin, Jennifer Dolland, Jeanette Hegyi, Kather-ine V Kortenkamp, Erik Kraft, Nathan Pruitt,Dana Scherr, and Sara Steinberg. 2000. Symbolgrounding and meaning: A comparison of high-dimensional and embodied theories of meaning.

Zellig Harris. 1954. Distributional structure. page10(23):146162.

Luheng He, Kenton Lee, Mike Lewis, and Luke S.Zettlemoyer. 2017. Deep semantic role labeling:What works and what’s next. In ACL.

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Paul Kingsbury and Martha Palmer. 2002. From tree-bank to propbank. In LREC.

Thomas K. Landauer and Susan T. Dumais. 1997. Asolution to plato’s problem: The latent semanticanalysis theory of acquisition, induction, and repre-sentation of knowledge.

Omer Levy and Yoav Goldberg. 2014a. Dependency-based word embeddings. In ACL.

Omer Levy and Yoav Goldberg. 2014b. Linguistic reg-ularities in sparse and explicit word representations.In CoNLL.

Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig.2013. Linguistic regularities in continuous spaceword representations. In HLT-NAACL.

George A. Miller. 1992. Wordnet: A lexical databasefor english. Commun. ACM, 38:39–41.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

Gabriel Stanovsky, Ido Dagan, and Mausam. 2015.Open ie as an intermediate structure for semantictasks. In ACL.

Ottokar Tilk, Vera Demberg, Asad B. Sayeed, DietrichKlakow, and Stefan Thater. 2016. Event participantmodelling with neural networks. In EMNLP.

Kai Zhao, Hany Hassan, and Michael Auli. 2015.Learning translation models from monolingual con-tinuous representations. In HLT-NAACL.

96


UCL Machine Reading Group:Four Factor Framework For Fact Finding (HexaF)

Takuma YonedaComputational Intelligence Lab.Toyota Technological Institute

[email protected]

Jeff Mitchell, Johannes Welbl,Pontus Stenetorp, Sebastian Riedel

Dept. of Computer ScienceUniversity College London

j.mitchell, j.welbl,

p.stenetorp, [email protected]

Abstract

In this paper we describe our 2nd placeFEVER shared-task system that achieved aFEVER score of 62.52% on the provisionaltest set (without additional human evaluation),and 65.41% on the development set. Our sys-tem is a four stage model consisting of docu-ment retrieval, sentence retrieval, natural lan-guage inference and aggregation. Retrievalis performed leveraging task-specific features,and then a natural language inference modeltakes each of the retrieved sentences pairedwith the claimed fact. The resulting predic-tions are aggregated across retrieved sentenceswith a Multi-Layer Perceptron, and re-rankedcorresponding to the final prediction.

1 Introduction

We often hear the word “Fake News” these days.Recently, Russian meddling, for example, hasbeen blamed for the prevalence of inaccurate newsstories on social media,1 but even the reporting onthis topic often turns out to be fake news (Uberti,2016). An abundance of incorrect information canplant wrong beliefs in individual citizens and leadto a misinformed public, undermining the demo-cratic process. In this context, technology to au-tomate fact-checking and source verification (Vla-chos and Riedel, 2014) is of great interest to bothmedia consumers and publishers.

The Fact Extraction and Verification (FEVER)shared task provides a benchmark for such tools,testing the ability to assess textual claims againsta corpus of around 5.4M Wikipedia articles. Eachclaim is labeled as SUPPORTS, REFUTES or NOTENOUGH INFO, depending on whether relevantevidence from the corpus can support/refute it.Systems are evaluated on the proportion of claimsfor which both the predicted label is correct and

1https://www.bbc.co.uk/news/world-us-canada-41821359

!

""

""

"#"

""

$%$&'($

)&$*)

$%$&'($

)&$*)

""

""

"#"

""

Figure 1: Illustration of the model pipeline for a claim.

a complete set of relevant evidence sentences hasbeen identified.

The original dataset description paper (Thorneet al., 2018) evaluates a simple baseline sys-tem that achieves a score of ∼33% on this met-ric, using tf-idf based retrieval to find the rel-evant evidence and a natural language infer-ence (NLI) model to classify the relation betweenthe returned evidence and the claim. Our systemattempts to improve on this baseline by address-ing two major weaknesses. Firstly, the originalretrieval component only finds a full evidence setfor 55% of claims. While tf-idf is an effectivetask agnostic approach to information retrieval, wefind that a simple linear model using task-specificfeatures is able to achieve much stronger perfor-mance. Secondly, the NLI component uses anoverly simplistic strategy for aggregating retrievedevidence, by simply concatenating all the sen-tences into a single paragraph. Instead, we employ

97

an explicit aggregation step to combine the knowl-edge gained from each evidence sentence. Theseimprovements allow us to achieve a FEVER scoreof 65.41% on the development set, and 62.52% onthe test set.

2 System Description

Our system is a four stage model consisting ofdocument retrieval, sentence retrieval, NLI and ag-gregation. Document retrieval attempts to find thename of a Wikipedia article in the claim, and thenranks each article based on capitalisation, sentenceposition and token match features. A set of sen-tences are then retrieved from the top ranked arti-cles, based on token matches with the claim andposition in the article. The NLI model is subse-quently applied to each of the retrieved sentencespaired with the claim, giving a prediction for eachpotential evidence sentence. The respective pre-dictions are aggregated using a Multi-Layer Per-ceptron (MLP), and the sentences are finally re-ranked so that the evidence which is consistentwith the final prediction are placed at the top.

2.1 Document Retrieval

Our method begins by building a dictionary of arti-cle titles, based on the observation that the FEVERclaims frequently include the title of a Wikipediaarticle containing the required evidence. These ti-tles are first normalised by lowercasing, convert-ing underscores to spaces and truncating to thefirst parenthesis if present. An initial list of po-tential articles is then constructed by detecting anysuch title in the claim. For each article, the proba-bility of containing the gold evidence is predictedby a logistic regression model, using as featuresthe position and capitalisation within the claim,presence of stop words, and token match countsbetween the first sentence of the article and theclaim. Likewise we include the same counts alsofor the rest of the article as features, alongsidewhether the name was truncated, and whether theexcised words are mentioned in the claim (e.g.,“Watchmen” vs “Watchmen (film)”).

The model is trained on a balanced set of pos-itive and negative examples drawn from the train-ing set, and the top-ranked articles are then passedon to the sentence retrieval component.

This process is related to, but goes substantiallybeyond entity recognition and linking (Mendeset al., 2017). These processes attempt to identify

Figure 2: Overview: Aggregation Network

mentions of names from a limited class of enti-ties (e.g. people, places, organisations). In ourcase, the mentions cover a much wider range oflexical items, including not only names but alsocommon nouns, verbs or adjectives. Nonetheless,both types of model share the objective of findingmentions and linking them to a reference set.

2.2 Sentence RetrievalWe observed that many evidence sentences appearat the beginning of an article, and they often men-tion the article title. We thus train a logistic re-gression model, using as features the position ofthe sentence within the article, its length, whetherthe article name is present, token matching be-tween the sentence and the claim, and the doc-ument retrieval score. The top-ranked sentencesfrom this model are then passed to the subsequentNLI stage.

2.3 Natural Language Inference (NLI)In this component, an NLI model predicts a la-bel for each pair of claim and retrieved evidencesentence. We adopted the Enhanced Sequen-tial Inference Model (ESIM) (Chen et al., 2017)as NLI model. ESIM employs a bidirectionalLSTM (BiLSTM) to encode premise and hypoth-esis, and also encodes local inference informationso that the model can effectively exploit sequen-tial information. We also experimented with theDecomposable Attention Model (DAM) (Parikhet al., 2016) — as used in the baseline model,however ESIM consistently performed better. TheJack the Reader (Weissenborn et al., 2018) frame-work was used for both DAM and ESIM.

We first pre-trained the ESIM model on theStanford Natural Language Inference (SNLI) cor-pus (Bowman et al., 2015), and then fine-tuned

98

on the FEVER dataset. We used 300-dimensionalpre-trained GloVe word embeddings (Penningtonet al., 2014). As training input, we used gold ev-idence sentences for SUPPORTS and REFUTESsamples, and retrieved evidence sentences for NOTENOUGH INFO.

It is worth noting that there are two kinds of ev-idences in this task. The first is a complete set ofevidence, which can support/refute a claim, andcan consist of multiple sentences. The second isincomplete evidence, which can support or refutethe claim only when paired with other evidence.

The baseline model (Thorne et al., 2018) sim-ply concatenates all evidence sentences and feedsthem into the NLI model, regardless of their ev-idence type. In contrast, we generate NLI pre-dictions individually for each predicted evidence,thus processing them in parallel.

Furthermore, we observed that evidence sen-tences often include a pronoun referring to thesubject of the article without explicitly mentioningit. This co-reference is opaque to the NLI modelwithout further information. To resolve this, weprepend the corresponding title of the article to thesentence, along with a separator as described inFigure 1. We also experimented with adding linenumbers to represent sentence position within thearticle, which did not, however, improve the labelaccuracy.

!

Figure 3: Illustration for the co-reference problem withindividual sentences: What ‘it’ refers to is not obviousfor a NLI model.

2.4 AggregationIn the aggregation stage, the model aggregates thepredicted NLI labels for each claim-evidence pairand outputs the final prediction.

The NLI model outputs three prediction scoresper pair of sentences, one for each label. In ouraggregation model, these scores are all fed into anMLP, alongside the evidence confidence scores foreach of the (ranked) evidence sentences. Since thelabel balance in the training set is significantly bi-ased, we give the samples training weights whichare inversely proportional to the size of their re-spective class. We also experimented with draw-ing samples according to the size of each class,

but using the full training data with class weightsperformed better. The final MLP model contains2 hidden layers with 100 hidden units each andRectified Linear Unit (ReLU) nonlinearities (Nairand Hinton, 2010). We observed only minor per-formance differences when modifying the size andnumber of layers of the MLP.

Aside from this neural aggregation module,we also tested logical aggregation, majority-voteand top-1 sentence. In logical aggregation, ourmodule takes the NLI predictions for all evi-dence sentences, and outputs either SUPPORTS orREFUTES if at least one of them has such a label,and NOT ENOUGH INFO if all predictions havethat label. In cases where both SUPPORTS andREFUTES appear among the predictions, we takeone from the highest ranked evidence. Majority-vote counts the frequency of labels among predic-tion and outputs the most frequent label. 15 pre-dicted evidence sentences are used in each aggre-gation method.

3 Results

3.1 Aggregation Results

Table 1 shows the development set results of ourmodel under the different aggregation settings.Note that the Evidence Recall and F1 metrics arecalculated based on the top 5 predicted evidences.We observe that the Majority-vote aggregationmethod only reaches 43.94% of FEVER Scoreand 45.36% of label accuracy, either of which aremuch lower than other methods. Since there areonly a few gold evidence sets for most claims,the majority of NLI predictions tend to be NOTENOUGH INFO, rendering a majority aggregationmethod impractical.

Conversely, the top-1 sentence aggregation onlyuses the top-ranked sentence alone to form a la-bel prediction. In this scenario a failure of the re-trieval component is critical, nevertheless the sys-tem can achieve a FEVER score of 63.36%, leav-ing a large gap to the baseline model (Thorneet al., 2018). The logical aggregation improvesslightly over omitting aggregation entirely (top-1sentence). However, the neural aggregation mod-ule produces the best overall results, both in termsof FEVER score and label accuracy. This demon-strates the advantage of using a neural aggrega-tion model operating on individual NLI confidencescores, compared to the more rigid use of only thepredicted labels in logical aggregation.

99

Aggregation Method FEVER Score Label Accuracy Evidence Recall Evidence F1

Majority-vote 43.94 45.36 83.91 35.36Top-1 sentence 63.36 66.30 84.62 35.72Logical 64.29 68.26 85.03 36.02MLP 65.41 69.66 84.54 35.84

Table 1: Development set scores for different aggregation methods, all numbers in percent.

❳❳❳❳❳❳❳❳❳❳❳GoldPrediction

supports refutes not enough info Total

supports 5,345 336 985 6,666refutes 827 4,196 1,643 6,666not enough info 1,288 989 4,389 6,666

Total 7,460 5,521 7,017 19,998

Table 2: Confusion matrix on the development set.

1 2 5 10 20 50 100

0.7

00

.75

0.8

00

.85

0.9

0

Number of Items Returned

Pro

po

rtio

n o

f C

laim

s w

ith

Fu

ll E

vid

en

ce

Documents

Sentences

Figure 4: Performance of retrieval models on the de-velopment set.

Finally, after obtaining the aggregated labelwith the MLP, the model re-sorts its evidence pre-dictions in such a way that those evidences withthe same predicted label as the final predictionare ranked above those with a different label (seeupper part of Figure 1). We observed that thisre-ranking increased the evidence recall by 0.18points (when used with MLP aggregation).

The overall FEVER score is the proportion ofclaims for which both the correct evidence is re-turned and a correct label prediction is made. Wefirst describe the performance of the retrieval com-ponents, and then discuss the results for NLI.

3.2 Retrieval Results

On the development set, the initial step of iden-tifying Wikipedia article titles within the text of

the claim returns on average 62 articles per claim.These articles cover the full evidence set in 90.8%of cases and no relevant evidence is returned foronly 2.9% of claims. Ranking these articles, us-ing the model described above, achieves 81.4%HITS@1, and this single top-ranked article con-tains the full evidence in 74.7% of instances. Tak-ing the text of the 15 best articles and rankingthe sentences achieves 73.7% HITS@1, which isequivalent to returning the full evidence for 68%of claims. Figure 4 illustrates the performanceof the IR components as the number of returneditems increases.

4 Error Analysis

Table 2 shows the confusion matrix for the devel-opment set predictions. We observe that the sys-tem finds it easiest to classify instances labelledas SUPPORTS, whereas using the NOT ENOUGHINFO label correctly is most difficult.

We next describe some frequent failure cases ofour model in the description below.

Limitations of word embeddings. Numeri-cal expressions like years (1980s vs 80s) ormonths (January vs October) tend to have simi-lar word embeddings, rendering it is difficult for aNLI model to distinguish them and correctly pre-dict REFUTES cases. This was the most frequenterror type encountered in the development set.

Confusing sentence. An NLI model aligns twosentences and predicts their relationship. For ex-ample, when two sentences are “Bob is in hishouse and he cannot sleep” and “Bob is awake”, a

100

model can conclude that the second sentence fol-lows from the first one by simply aligning Bobwith Bob and cannot sleep with awake. However,it sometimes fails to capture a correct alignment,which results in a fail prediction. For example,“Andrea Pirlo is an American professional foot-baller” vs “Andrea Pirlo is an Italian professionalfootballer who plays for an American club.”

Sentence Complexity. In some cases, just tak-ing an alignment is not enough to predict the cor-rect label. In these cases, the model needs to cap-ture the relationship between multiple words. Forexample, “Virginia keeps all computer chips man-ufactured within the state for use in Virginian elec-tronics.” vs. “Virginia’s computer chips becamethe state’s leading export by monetary value.”

5 Future Work

For the model to read sentences that includes nu-merical expressions correctly, it could be helpfulto explicitly encode the numerical expression andobtain a representation that captures the numericalfeatures (Spithourakis and Riedel, 2018). Lever-aging context-dependent pre-trained word embed-dings such as ELMo (Peters et al., 2018) couldhelp dealing better with more complex sentences.

6 Conclusion

In this paper, we described our FEVER shared-task system. We employed a four stage frame-work, composed of document retrieval, sentenceretrieval, natural language inference, and aggrega-tion. By applying task specific features for a re-trieval model, and connecting an aggregation net-work on top of the NLI model, our model achievesa score of 65.41% on the development set and62.52% on the provisional test set.

Acknowledgments

This work has been supported by the Euro-pean Union H2020 project SUMMA (grant No.688139), by an Allen Distinguished InvestigatorAward, and by an Engineering and Physical Sci-ences Research Council scholarship.

ReferencesSamuel R. Bowman, Gabor Angeli, Christopher Potts,

and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.

In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642. Association for Computational Linguis-tics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced lstm fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668. Association for Computational Linguis-tics.

Afonso Mendes, David Nogueira, Samuel Broscheit,Filipe Aleixo, Pedro Balage, Rui Martins, SebastiaoMiranda, and Mariana S. C. Almeida. 2017. Summaat tac knowledge base population task 2017. In Pro-ceedings of the Text Analysis Conference (TAC) KBP2017, Gaithersburg, Maryland USA. National Insti-tute of Standards and Technology.

Vinod Nair and Geoffrey E. Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proceedings of the 27th International Conferenceon International Conference on Machine Learning,ICML’10, pages 807–814, USA. Omnipress.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable atten-tion model for natural language inference. CoRR,abs/1606.01933.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP, volume 14, pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. CoRR, abs/1802.05365.

Georgios P. Spithourakis and Sebastian Riedel. 2018.Numeracy for language models: Evaluating and im-proving their ability to predict numbers. CoRR,abs/1805.08154.


David Uberti. 2016. Washington post fake news storyblurs the definition of fake news. Columbia Jour-nalism Review.

Andreas Vlachos and Sebastian Riedel. 2014. Factchecking: Task definition and dataset construction.In Proceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational Social Sci-ence, pages 18–22, Baltimore, MD, USA. Associa-tion for Computational Linguistics.

Dirk Weissenborn, Pasquale Minervini, Tim Dettmers,Isabelle Augenstein, Johannes Welbl, Tim Rock-taschel, Matko Bosnjak, Jeff Mitchell, Thomas De-meester, Pontus Stenetorp, and Sebastian Riedel.

101

2018. Jack the Reader A Machine Reading Frame-work. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(ACL) System Demonstrations.

102


UKP-Athene: Multi-Sentence Textual Entailment for Claim VerificationAndreas Hanselowski†∗, Hao Zhang∗, Zile Li∗, Daniil Sorokin†∗,

Benjamin Schiller∗, Claudia Schulz†∗, Iryna Gurevych†∗

†Research Training Group AIPHESComputer Science Department, Technische Universitat Darmstadt

https://www.aiphes.tu-darmstadt.de∗Ubiquitous Knowledge Processing Lab (UKP-TUDA)

Computer Science Department, Technische Universitat Darmstadthttps://www.ukp.tu-darmstadt.de/

Abstract

The Fact Extraction and VERification(FEVER) shared task was launched to sup-port the development of systems able toverify claims by extracting supporting orrefuting facts from raw text. The sharedtask organizers provide a large-scale datasetfor the consecutive steps involved in claimverification, in particular, document retrieval,fact extraction, and claim classification. Inthis paper, we present our claim verificationpipeline approach, which, according to thepreliminary results, scored third in the sharedtask, out of 23 competing systems. For thedocument retrieval, we implemented a newentity linking approach. In order to be ableto rank candidate facts and classify a claimon the basis of several selected facts, weintroduce two extensions to the EnhancedLSTM (ESIM).

1 Introduction

In the past years, the amount of false or misleadingcontent on the Internet has significantly increased.As a result, information evaluation in terms offact-checking has become increasingly importantas it allows to verify controversial claims statedon the web. However, due to the large number offake news and hyperpartisan articles published on-line every day, manual fact-checking is no longerfeasible. Thus, researchers as well as corporationsare exploring different techniques to automate thefact-checking process1.

In order to advance research in this direction,the Fact Extraction and VERification (FEVER)shared task2 was launched. The organizers of

1https://fullfact.org/media/uploads/full_fact-the_state_of_automated_factchecking_aug_2016.pdf

2http://fever.ai/task.html

the FEVER shared task constructed a large-scaledataset (Thorne et al., 2018) based on Wikipedia.This dataset contains 185,445 claims, each ofwhich comes with several evidence sets. An ev-idence set consists of facts, i.e. sentences fromWikipedia articles that jointly support or contra-dict the claim. On the basis of (any one of) itsevidence sets, each claim is labeled as Supported,Refuted, or NotEnoughInfo if no decision aboutthe veracity of the claim can be made. Supportedby the structure of the dataset, the FEVER sharedtask encompasses three sub-tasks that need to besolved.Document retrieval: Given a claim, find (En-glish) Wikipedia articles containing informationabout this claim.Sentence selection: From the retrieved articles,extract facts in the form of sentences that are rele-vant for the verification of the claim.Recognizing textual entailment: On the basis ofthe collected sentences (facts), predict one of threelabels for the claim: Supported, Refuted, or NotE-noughInfo. To evaluate the performance of thecompeting systems, an evaluation metric was de-vised by the FEVER organizers: a claim is con-sidered as correctly verified if, in addition to pre-dicting the correct label, a correct evidence set wasretrieved.In this paper, we describe the pipeline systemthat we developed to address the FEVER task.For document retrieval, we implemented an en-tity linking approach based on constituency pars-ing and handcrafted rules. For sentence selection,we developed a sentence ranking model based onthe Enhanced Sequential Inference Model (ESIM)(Chen et al., 2016). We furthermore extended theESIM for recognizing textual entailment betweenmultiple input sentences and the claim using an at-tention mechanism.According to the preliminary results of the

103

FEVER shared task, our systems came third outof 23 competing teams.

2 Background

In this section, we present underlying methods thatwe adopted for the development of our system.

2.1 Entity linking

The document retrieval step requires matching agiven claim with the content of a Wikipedia arti-cle. A claim frequently features one or multipleentities that form the main content of the claim.

Furthermore, Wikipedia can be viewed as aknowledge base, where each article describes aparticular entity, denoted by the article title. Thus,the document retrieval step can be framed as anentity linking problem (Cucerzan, 2007). That is,identifying entity mentions in the claim and link-ing them to the Wikipedia articles of this entity.The linked Wikipedia articles can then be used asthe set of the retrieved documents for the subse-quent steps.

2.2 Enhanced Sequential Inference Model

Originally developed for the SNLI task (Bow-man et al., 2015) of determining entailment be-tween two statements, the ESIM (Enhanced Se-quential Inference Model) (Chen et al., 2016) cre-ates a rich representation of statement-pairs. Sincethe FEVER task requires the handling of claim-sentence pairs, we use the ESIM as the basisfor both sentence selection and textual entailment.The ESIM solves the entailment problem in threeconsecutive steps, taking two statements as input.Input encoding: Using a bidirectional LSTM(BiLSTM) (Graves and Schmidhuber, 2005), rep-resentations of the individual tokens of the two in-put statements are computed.Local inference modeling: Each token of onestatement is used to compute attention weightswith respect to each token in the other statement,giving rise to an attention weight matrix. Then,each token representation is multiplied by all ofits attention weights and weighted pooling is ap-plied to compute a single representation for eachtoken. This operation gives rise to two new repre-sentations of the two statements.Inference composition: These two statementrepresentations are then fed into two BiLSTMs,which again compute sequences of representationsfor each statement. Maximum and average pool-

ing is applied to the two sequences to derive tworepresentations, which are then concatenated (lasthidden state of the ESIM) and fed into a multilayerperceptron for the classification of the entailmentrelation.

3 Our system for fact extraction andclaim verification

In this section, we describe the models that we de-veloped for the three FEVER sub-tasks.

3.1 Document retrieval

As explained in Section 2.1, we propose an entitylinking approach to the document retrieval sub-task. That is, we find entities in the claims thatmatch the titles of Wikipedia articles (documents).Following the typical entity linking pipeline, wedevelop a document retrieval component that hasthree main steps.Mention extraction: Named entity recognitiontools focus only on the main types of entities (Lo-cation, Organization, Person). In order to find en-tities of different categories, such as movie titles,that are numerous in the shared task data set, weemploy the constituency parser from AllenNLP(Gardner et al., 2017). After parsing the claim, weconsider every noun phrase as a potential entitymention. However, a movie or a song title may bean adjective or any other type of syntactic phrase.To account for such cases, we use a heuristic thatadds all words in the claim before the main verbas well as the whole claim itself as potential en-tity mentions. For example, a claim “Down WithLove is a 2003 comedy film.” contains the nounphrases ‘a 2003 comedy film’ and ‘Love’. Neitherof the noun phrases constitutes an entity mention,but the tokens before the main verb, ‘Down WithLove’, form an entity.Candidate article search: We use the MediaWikiAPI3 to search through the titles of all Wikipediaarticles for matches with the potential entity men-tions found in the claim. The MediaWiki API usesthe Wikipedia search engine to find matching ar-ticles. The top match is the article whose titlehas the largest overlap with the query. For eachentity mention, we store the seven highest-rankedWikipedia article matches.

The MediaWiki API uses the online version ofWikipedia and since there are some discrepancies

3https://www.mediawiki.org/wiki/API:Main_page

104

claim evidencesentences

claim sampledsentences

Input Encoding

Local Inference

Composition

Input Encoding

Local Inference

Composition

spsn

∑max(0, 1 + sn − sp)

Input

ESIM

tanh +Dropout

Rankingscore

Hinge Loss

Figure 1: Sentence selection model

between the 2017 dump used in the shared taskand the latest version, we also perform an exactsearch over all Wikipedia article titles in the dump.We add these results to the set of the retrieved ar-ticles.Candidate filtering: The MediaWiki API re-trieves articles whose title overlaps with the query.Thus, the results may contain articles with a ti-tle longer or shorter than the entity mention usedin the query. Similarly to previous work on en-tity linking (Sorokin and Gurevych, 2018), we re-move results that are longer than the entity men-tion and do not overlap with the rest of the claim.To check this overlap, we first remove the con-tent in parentheses from the Wikipedia article ti-tles and stem the remaining words in the titles andthe claim. Then, we discard a Wikipedia article ifits stemmed article title is not completely includedin the stemmed claim.

We collect all retrieved Wikipedia articles forall identified entity mentions in the claim afterfiltering and supply them to the next step in thepipeline. The evaluation of the document retrievalsystem on the development data shows the effec-tiveness of our ad-hoc entity linking approach (seeSection 4).

3.2 Sentence selectionIn this step, we select candidate sentences as a po-tential evidence set for a claim from the Wikipediaarticles retrieved in the previous step. This isachieved by extending the ESIM to generate aranking score on the basis of two input statements,instead of predicting the entailment relation be-tween these two statements.Architecture: The modified ESIM takes as inputa claim and a sentence. To generate the rankingscore, the last hidden state of the ESIM (see Sec-tion 2.2) is fed into a hidden layer which is con-

nected to a single neuron for the prediction of theranking score. As a result, we are able to rank allsentences of the retrieved documents according tothe computed ranking scores. In order to find apotential evidence set, we select the five highest-ranked sentences.Training: Our adaptation of the ESIM is illus-trated in Fig. 1. In the training mode, the ESIMtakes as input a claim and the concatenated sen-tences of an evidence set. As a loss function, weuse a modified hinge loss with negative sampling:∑max(0, 1 + sn − sp), where sp indicates the

positive ranking score and sn the negative rank-ing score for a given claim-sentence pair. To getsp, we feed the network a claim and the concate-nated sentences of one of its ground truth evidencesets. To get sn, we take all Wikipedia articlesfrom which the ground truth evidence sets of theclaim originate, randomly sample five sentences(not including the sentences of the ground truth ev-idence sets for the claim), and feed the concatena-tion of these sentences into the same ESIM. Withour modified hinge loss function, we then try tomaximize the margin between positive and nega-tive samples.Testing: At testing time, we calculate the scorebetween a claim and each sentence in the retrieveddocuments. For this purpose, we deploy an en-semble of ten models with different random seeds.Then, the mean score of a claim-sentence pair overall ten models of the ensemble is calculated andthe scores for all pairs are ranked. Finally, the fivesentences of the five highest-ranked pairs are takenas an output of the model.

3.3 Recognizing textual entailment

In order to classify the claim as Supported,Refuted or NotEnoughInfo, we use the fivesentences retrieved by our sentence selectionmodel described in the previous section. For theclassification, we propose another extension to theESIM, which can predict the entailment relationbetween multiple input sentences and the claim.Fig. 2 gives an overview of our extended ESIMfor the FEVER textual entailment task.

As word representation for both claim and sen-tences, we concatenate the Glove (Penningtonet al., 2014) and FastText (Bojanowski et al., 2016)embeddings for each word. Since both types ofembeddings are pretrained on Wikipedia, they areparticularly suitable for our problem setting.

105

S1

Input Encoding

Local Inference

Composition

·

S5

Input Encoding

Local Inference

Composition

·

S2 S3 S4

. . .

. . .

. . .

claim C

⊕max avg

Input

ESIM

Outputof ESIM

AttendedVectors

Pooling

Attention

Classifier

Figure 2: Extended ESIM for recognizing textual en-tailment

To process the five input sentences using theESIM, we combine the claim with each sentenceand feed the resulting pairs into the ESIM. The lasthidden states of the five individual claim-sentenceruns of the ESIM are compressed into one vectorusing attention and pooling operations.The attention is based on a representation of theclaim that is independent of the five sentences.This representation is obtained by summing up theinput encodings of the claim in the five ESIM runs.In the same way, we derive five sentence repre-sentations, one from each of the five runs of theESIM, which are independent of the claim. Foreach claim-sentence pair, the single sentence rep-resentation and the claim representation are thenindividually fed through a single layer perceptron.The cosine similarity of these two vectors is thenused as an attention weight. The five output vec-tors of all ESIMs are multiplied with their respec-tive attention weights and we apply average andmax pooling on these vectors in order to reducethem to two representations. Finally, the two rep-resentations are concatenated and fed through a 3-layer perceptron to predict one of the three classesSupported, Refuted or NotEnoughInfo. The ideabehind the described attention mechanism is to al-low the model to extract information from the fivesentences that is most relevant for the classifica-tion of the claim.

4 Results

Table 1 shows the performance of our documentretrieval and sentence selection system when re-trieving different numbers of the highest-rankedWikipedia articles. In contrast to the results re-ported in Table 3, here we consider a single modelinstead of an ensemble. The results show that bothsystems benefit from a larger number of retrievedarticles.

#search results doc. accuracy sent. recall

3 92.60 85.375 93.30 86.027 93.55 86.24

Table 1: Performance of the retrieval systems using dif-ferent numbers of MediaWiki search results

For the subtask of recognizing textual entail-ment, we also experiment with different numbersof selected sentences. The results in Table 2demonstrate that our model performs best with allfive selected sentences.

#sentence(s) label accuracy FEVER score

1 67.94 63.642 68.33 64.303 67.82 63.724 67.61 63.595 68.49 64.74

Table 2: Performance of the textual entailment modelusing different numbers of sentences

In Table 3, we compare the performance of ourthree systems as well as the full pipeline to thebaseline systems and pipeline implemented by theshared task organizers (Thorne et al., 2018) on thedevelopment set. As the results demonstrate, wewere able to significantly improve upon the base-line on each sub-task. The performance gains overthe whole pipeline add up to an improvement ofabout 100% with respect to the baseline pipeline.

5 Error analysis

In this section, we present the error analysis foreach of the three sub-tasks, which can serve as abasis for further improvements of the system.

5.1 Document retrievalThe typical errors encountered for the documentretrieval system can be divided into three classes.

106

Task (metric) system score (%)

Document retrieval baseline 70.20(accuracy) our system 93.55

Sentence selection baseline 44.22(recall) our system 87.10

Textual entailment baseline 52.09(label accuracy) our system 68.49Full pipeline baseline 32.27(FEVER score) our system 64.74

Table 3: Performance comparison of our system andthe baseline system on the development set

Spelling errors: A word in the claim or in thearticle title is misspelled. E.g. “Homer Hickmanwrote some historical fiction novels.” vs. “HomerHickam”. In this case, our document retrieval sys-tem discards the article during the filtering phase.Missing entity mentions: The entity mention rep-resented by the title of the article, which needs tobe retrieved, is not related to any entity mentionin the claim. E.g. Article title: “Blue Jasmine”Claim: “Cate Blanchett ignored the offer to act inCate Blanchett.”.Search failures: Some article titles contain a cat-egory name in parentheses for the disambiguationof the entity. This makes it difficult to retrieve theexact article title using the MediaWiki API. E.g.the claim “Alex Jones is apolitical.” requires thearticle “Alex Jones (radio host)”, but it is not con-tained in the MediaWiki search results.

5.2 Sentence selection

The most frequent case, in which the sentence se-lection model fails to retrieve a correct evidenceset, is that the entity mention in the claim does notoccur in the annotated evidence set. E.g. the onlyevidence set for the claim “Daggering is nontradi-tional.” consists of the single sentence “This danceis not a traditional dance.”. Here, “this dance”refers to “daggering” and cannot be resolved byour model, since the information that “daggering”is a dance is not mentioned in the evidence sen-tence or in the claim. Some evidence sets containtwo sentences one of which is less related to theclaim. E.g. the claim “Herry II of France hasthree cars.” has an evidence set that contains thetwo sentences “Henry II died in 1559.” and “1886is regarded as the birth year of the modern car.”.The second sentence is not directly related to the

claim, thus, it is ranked very low by our model.

5.3 Recognizing textual entailment

A large number of claims are misclassified due tothe model’s disability to interpret numerical val-ues. For instance, the claim “The heart beats at aresting rate close to 22 beats per minute.” is notclassified as refuted on the basis of the evidencesentence “The heart beats at a resting rate closeto 72 beats per minute.”. The only information re-futing the claim is the number, but neither GloVenor FastText embeddings can embed numbers dis-tinctly enough. Another problem are challeng-ing NotEnoughInfo cases. For instance, the claim“Terry Crews played on the Los Angeles Charg-ers.” (annotated as NotEnoughInfo) is classifiedas refuted, given the sentence “In football, Crewsplayed ... for the Los Angeles Rams, San DiegoChargers and Washington Redskins, ...”. The sen-tence is related to the claim but does not excludeit, which makes this case difficult.

6 Conclusion

In this paper, we presented the system for fact ex-traction and verification, which we developed inthe course of the FEVER shared task. Accordingto the preliminary results, our system scored thirdout of 23 competing teams. The shared task wasdivided into three parts: (i) Given a claim, retrieveWikipedia documents that contain facts about theclaim. (ii) Extract these facts from the document.(iii) Verify the claim on the basis of the extractedfacts. To address the problem, we developed mod-els for the three sub-tasks. We framed documentretrieval as entity linking by identifying entities inthe claim and linking them to Wikipedia articles.To extract facts in the articles, we developed a sen-tence ranking model by extending the ESIM. Forclaim verification we proposed another extensionto the ESIM, whereby we were able to classify theclaim on the basis of multiple facts using attention.Each of our three models, as well as the combinedpipeline, significantly outperforms the baseline onthe development set.

7 Acknowledgements

This work has been supported by the German Re-search Foundation as part of the Research TrainingGroup Adaptive Preparation of Information fromHeterogeneous Sources (AIPHES) at the Technis-che Universitat Darmstadt grant No. GRK 1994/1.

107

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin,

and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference.arXiv preprint arXiv:1508.05326.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2016. Enhanced LSTMfor natural language inference. arXiv preprintarXiv:1609.06038.

Silviu Cucerzan. 2007. Large-Scale Named Entity Dis-ambiguation Based on Wikipedia Data. In Pro-ceedings of the 2007 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning (EMNLP-CoNLL), pages 708–716, Prague, Czech Republic.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. AllenNLP: A deep semantic natural languageprocessing platform.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectionalLSTM and other neural network architectures. Neu-ral Networks, 18(5-6):602–610.


Daniil Sorokin and Iryna Gurevych. 2018. MixingContext Granularities for Improved Entity Linkingon Question Answering Data across Entity Cate-gories. In Proceedings of the 7th Joint Conferenceon Lexical and Computational Semantics (*SEM2018), pages 65–75. Association for ComputationalLinguistics.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: A large-scale dataset for fact extractionand verification. In NAACL-HLT.

108


Team Papelo: Transformer Networks at FEVER

Christopher MalonNEC Laboratories [email protected]

Abstract

We develop a system for the FEVER fact ex-traction and verification challenge that usesa high precision entailment classifier basedon transformer networks pretrained with lan-guage modeling, to classify a broad set of po-tential evidence. The precision of the entail-ment classifier allows us to enhance recall byconsidering every statement from several arti-cles to decide upon each claim. We includenot only the articles best matching the claimtext by TFIDF score, but read additional ar-ticles whose titles match named entities andcapitalized expressions occurring in the claimtext. The entailment module evaluates poten-tial evidence one statement at a time, togetherwith the title of the page the evidence camefrom (providing a hint about possible pronounantecedents). In preliminary evaluation, thesystem achieves .5736 FEVER score, .6108label accuracy, and .6485 evidence F1 on theFEVER shared task test set.

1 Introduction

The release of the FEVER fact extraction and ver-ification dataset (Thorne et al., 2018) provides alarge-scale challenge that tests a combination ofretrieval and textual entailment capabilities. Toverify a claim in the dataset as supported, refuted,or undecided, a system must retrieve relevant arti-cles and sentences from Wikipedia. Then it mustdecide whether each of those sentences, or somecombination of them, entails or refutes the claim,which is an entailment problem. Systems are eval-uated on the accuracy of the claim predictions,with credit only given when correct evidence issubmitted.

As entailment data, premises in FEVER datadiffer substantially from those in the image cap-tion data used as the basis for the Stanford NaturalLanguage Inference (SNLI) (Bowman et al., 2015)

dataset. Sentences are longer (31 compared to 14words on average), vocabulary is more abstract,and the prevalence of named entities and out-of-vocabulary terms is higher.

The retrieval aspect of FEVER is not straight-forward either. A claim may have small wordoverlap with the relevant evidence, especially ifthe claim is refuted by the evidence.

Our approach to FEVER is to fix the mostobvious shortcomings of the baseline approachesto retrieval and entailment, and to train a sharpentailment classifier that can be used to filter abroad set of retrieved potential evidence. Forthe entailment classifier we compare Decompos-able Attention (Parikh et al., 2016; Gardner et al.,2017) as implemented in the official baseline,ESIM (Chen et al., 2017), and a transformer net-work with pre-trained weights (Radford et al.,2018). The transformer network naturally sup-ports out-of-vocabulary words and gives substan-tially higher performance than the other methods.

2 Transformer network

The core of our system is an entailment modulebased on a transformer network. Transformer net-works (Vaswani et al., 2017) are deep networks ap-plied to sequential input data, with each layer im-plementing multiple heads of scaled dot productattention. This attention mechanism allows deepfeatures to be compared across positions in the in-put.

Many entailment networks have two sequenceinputs, but the transformer is designed with justone. A separator token divides the premise fromthe hypothesis.

We use a specific transformer network releasedby OpenAI (Radford et al., 2018) that has beenpre-trained for language modeling. The networkconsists of twelve blocks. Each block consists of a

109

multi-head masked self-attention layer, layer nor-malization (Ba et al., 2016), a feed forward net-work, and another layer normalization. After thetwelfth block, two branches exist. In one branch,matrix multiplication and softmax layers are ap-plied at the terminal sequence position to predictthe entailment classification. In the other branch,a hidden state is multiplied by each token embed-ding and a softmax is taken to predict the nexttoken. The language modeling branch has beenpre-trained on the BookCorpus dataset (Zhu et al.,2015). We take the pre-trained model and trainboth branches on examples from FEVER.

3 Reframing entailment

The baseline FEVER system (Thorne et al., 2018)ran the AllenNLP (Gardner et al., 2017) im-plementation of Decomposable Attention (Parikhet al., 2016) to classify a group of five premisestatements concatenated together against theclaim. These five premise statements were fixedby the retrieval module and not considered indi-vidually. In our system, premise statements areindividually evaluated.

We collect training data as the five sentenceswith the highest TFIDF score against the claim,taken from the Wikipedia pages selected by the re-trieval module. If any ground truth evidence groupfor a claim requires more than one sentence, theclaim is dropped from the training set. Otherwise,each sentence is labeled with the truth value ofthe claim if it is in the ground truth evidence set,and labeled as neutral if not. The resulting dataforms an entailment problem that we call “FEVEROne.” For comparison, we form “FEVER Five”and “FEVER Five Oracle” by concatenating allfive retrieved sentences, as in the baseline. InFEVER Five Oracle, the ground truth is the claimground truth (if verifiable), but in FEVER Five,ground truth depends on whether the retrieved ev-idence is in the ground truth evidence set.

Several FEVER claims require multiple state-ments as evidence in order to be supported or re-futed. The number of such claims is relativelysmall: in the first half of the development set, only623 of 9999 claims were verifiable and had nosingleton evidence groups. Furthermore, we dis-agreed with many of these annotations and thoughtthat less evidence should have sufficed. Thus wechose not to develop a strategy for multiple evi-dence statements.

To compare results on FEVER Five to FEVEROne, we must aggregate decisions about individ-ual sentences of possible evidence to a decisionabout the claim. We do this by applying the fol-lowing rules:

1. If any piece of evidence supports the claim,we classify the claim as supported.

2. If any piece of evidence refutes the claim, butno piece of evidence supports it, we classifythe claim as refuted.

3. If no piece of evidence supports or refutesthe claim, we classify the claim as not hav-ing enough information.

We resolve conflicts between supporting and re-futing information in favor of the supporting in-formation, because we observed cases in the de-velopment data where information was retrievedfor different entities with the same name. For ex-ample, Ann Richards appeared both as a governorof Texas and as an Australian actress. Informa-tion that would be a contradiction regarding theactress should not stop evidence that would sup-port a claim about the politician.

Even if a sentence is in the evidence set, it mightnot be possible for the classifier to correctly deter-mine whether it supports the claim, because thesentence could have pronouns with antecedentsoutside the given sentence. Ideally, a coreferenceresolution system could add this information to thesentence, but running one could be time consum-ing and introduce its own errors. As a cheap al-ternative, we make the classifier aware of the titleof the Wikipedia page. We convert any undersoresin the page title to spaces, and insert the title be-tween brackets before the rest of each premise sen-tence. The dataset constructed in this way is called“FEVER Title One.”

The FEVER baseline system works by solvingFEVER Five Oracle. Using Decomposable Atten-tion, it achieves .505 accuracy on the test half ofthe development set. Swapping in the EnhancedSequential Inference Model (ESIM) (Chen et al.,2017) to solve FEVER Five Oracle results in anaccuracy of .561. Because ESIM uses a singleout-of-vocabulary (OOV) token for all unknownwords, we expect it to confuse named entities.Thus we extend the model by allocating 10,000 in-dices for out-of-vocabulary words with randomlyinitialized embeddings, and taking a hash of each

110

Problem Support ClaimAccuracy Kappa Accuracy Kappa

ESIM on FEVER One .760 .260 .517 .297ESIM on FEVER Title One .846 .394 .639 .433Transformer on FEVER Title One .958 .660 .823 .622

Table 1: Effect of adding titles to premises.

Problem Support ClaimAccuracy Kappa Accuracy Kappa

ESIM on FEVER Title Five Oracle N/A N/A .591 .388ESIM on FEVER Title Five N/A N/A .573 .110ESIM on FEVER Title One .846 .394 .639 .433Transformer on FEVER Title Five Oracle N/A N/A .673 .511Transformer on FEVER Title Five N/A N/A .801 .609Transformer on FEVER Title One .958 .660 .823 .622

Table 2: Concatenating evidence or not.

System RetrievalFEVER Baseline (TFIDF) 66.1%+ Titles in TFIDF 68.3%+ Titles + NE 80.8%+ Titles + NE + Film 81.2%Entire Articles + NE + Film 90.1%

Table 3: Percentage of evidence retrieved from first half of development set. Single-evidence claims only.

System Development TestFEVER Title Five Oracle .5289 —FEVER Title Five .5553 —FEVER Title One .5617 .5539FEVER Title One (Narrow Evidence) .5550 —FEVER Title One (Entire Articles) .5844 .5736

Table 4: FEVER Score of various systems. All use NE+Film retrieval.

OOV word to select one of these indices. With ex-tended ESIM, the accuracy is .586. Therefore, werun most later comparisons with extended ESIMor transformer networks as the entailment module,rather than Decomposable Attention.

The FEVER One dataset is highly unbalancedin favor of neutral statements, so that the major-ity class baseline would achieve 93.0% on thisdata. In fact it makes training ESIM a challenge,as the model only learns the trivial majority classpredictor if the natural training distribution is fol-lowed. We reweight the examples in FEVER Onefor ESIM so that each class contributes to the lossequally. Then, we use Cohen’s Kappa rather thanthe accuracy to evaluate a model’s quality, so that

following the bias with purely random agreementis not rewarded in the evaluation. In Table 1 wecompare FEVER One to FEVER Title One, bothat the level of classifying individual support state-ments and of classifying the claim by aggregatingthese decisions as described above. On a supportbasis, we find a 52% increase in Kappa by addingthe titles.

When ESIM is replaced by the transformer net-work, class reweighting is not necessary. The net-work naturally learns to perform in excess of themajority class baseline. Cohen’s Kappa is 68%higher than that for ESIM.

The possibility of training on oracle labels fora concatenated set of evidence allows a classi-

111

fier to simply guess whether the hypothesis is trueand supported somewhere, rather than having toconsider the relationship between hypothesis andpremise. For example, it is possible to classify67% of SNLI examples correctly without read-ing the premise (Gururangan et al., 2018). As weshow in Table 2, for ESIM, we find that this kindof guessing makes the FEVER Title Five Oracleperformance better than FEVER Title Five. TheTransformer model is accurate enough that oracleguessing does not help. Both models perform bestwhen classifying each bit of evidence separatelyand then aggregating.

4 Improving retrieval

Regardless of how strong the entailment classifieris, FEVER score is limited by whether the docu-ment and sentence retrieval modules, which pro-duce the input to the entailment classifier, find theright evidence. In Table 3, we examine the per-centage of claims for which correct evidence is re-trieved, before filtering with the entailment classi-fier. For this calculation, we skip any claim withan evidence group with multiple statements, andcount a claim as succesfully retrieved if it is notverifiable or if the statement in one of the evidencegroups is retrieved. The baseline system retrievesthe five articles with the highest TFIDF score,and then extracts the five sentences from that col-lection with the highest TFIDF score against theclaim. It achieves 66.1% evidence retrieval.

Our first modification simply adds the titleto each premise statement when computing itsTFIDF against the claim, so that statements from arelevant article get credit even if the subject is notrepeated. This raises evidence retrieval to 68.3%.

A more significant boost comes from retriev-ing additional Wikipedia pages based on namedentity recognition (NER). We start with phrasestagged as named entities by SpaCy (Honnibal andJohnson, 2015), but these tags are not very reli-able, so we include various capitalized phrases.We retrieve Wikipedia pages whose title exactlymatches one of these phrases.

The named entity retrieval strategy boosts theevidence retrieval rate to 80.8%, while less thandoubling the processing time. However, some-times the named entity page thus retrieved is only aWikipedia disambiguation page with no useful in-formation. Noticing a lot of questions about filmsin the development set, we modify the strategy

to also retrieve a page titled “X (film)” if it ex-ists, whenever “X” is retrieved. The film retrievalsraise evidence retrieval to 81.2%.

Finally, we eliminate the TFIDF sentence rank-ing to expand sentence retrieval from five sen-tences to entire articles, up to the first fifty sen-tences from each. Thus we obtain 2.6 millionstatements to classify regarding the 19,998 claimsin the shared task development set, for an aver-age of 128 premises per claim. The evidence re-trieval rate, including all these premises, increasesto 90.1%. We continue to apply the entailmentmodule trained with only five premise retrievals.Running the entailment module on this batch us-ing a machine with three NVIDIA GeForce GTX1080Ti GPU cards takes on the order of six hours.

Retrieving more than five sentences means thatwe can no longer submit all retrieved evidence assupport for the claims. Instead, we follow the ag-gregation strategy from Section 3 to decide theclaim label, and only submit statements whoseclassification matches. Limiting evidence in thisway when only five statements are retrieved (“nar-row evidence” in Table 4) pushes FEVER scoredown very little, to .5550 from .5617 on the devel-opment set, so we have confidence that the extraretrieval will make up for the loss. Indeed, whenthe system reviews the extra evidence, FEVERscore goes up to .5844 on the development set.

Table 4 compares the end-to-end performanceof systems that evaluate five retrieved statementstogether, evaluate five retrieved statements sepa-rately, and evaluate all statements from entire ar-ticles separately. Evaluating the statements sep-arately gives better performance. We submit thesystems that retrieve five statements and entire ar-ticles for evaluation on the test set, achieving pre-liminary FEVER scores of .5539 and .5736 re-spectively (label accuracy of .5754 and .6108, ev-idence recall of .6245 and .5002, evidence F1 of.2542 and .6485). In preliminary standings, thelatter system ranks fourth in FEVER score andfirst in evidence F1.

5 Discussion

Our approach to FEVER involves a minimum ofheuristics and relies mainly on the strength of theTransformer Network based entailment classifica-tion. The main performance gains come fromadding retrievals that resolve named entities ratherthan matching the claim text only, filtering fewer

112

of the retrievals, and making the entailment clas-sifier somewhat aware of the topic of what it isreading by including the title. If higher qualityand more plentiful multi-evidence claims wouldbe constructed, it would be nice to incorporatedynamic retrievals into the system, allowing theclassifier to decide that it needs more informationabout keywords it encountered during reading.

ReferencesLei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton.

2016. Layer normalization. CoRR, abs/1607.06450.

Samuel Bowman, Gabor Angeli, Christopher Potts, andChristopher Manning. 2015. A large annotated cor-pus for learning natural language inference. In Pro-ceedings of the Conference on Empirical Methods inNatural Language processing (EMNLP).

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTM fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. AllenNLP: A deep semantic natural languageprocessing platform. CoRR, abs/1803.07640.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, and Noah A.Smith. 2018. Annotation artifacts in natural lan-guage inference data. In Proceedings of the Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies (NAACL-HLT), Volume 2, pages107–112.

Matthew Honnibal and Mark Johnson. 2015. An im-proved non-monotonic transition system for depen-dency parsing. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1373–1378.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 2249–2255.

Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding by generative pre-training. In OpenAIBlog.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extraction

and verification. In Proceedings of the Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies (NAACL-HLT), Volume 1.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies:Towards story-like visual explanations by watchingmovies and reading books. In Proceedings of theIEEE International Conference on Computer Vision(ICCV), pages 19–27.

113


Uni-DUE Student Team: Tackling fact checking through decomposableattention neural network

Jan KowollikUniversity of Duisburg-Essen

[email protected]

Ahmet AkerUniversity of Duisburg-Essen

[email protected]

Abstract

In this paper we present our system for theFEVER Challenge. The task of this challengeis to verify claims by extracting informationfrom Wikipedia. Our system has two parts. Inthe first part it performs a search for candidatesentences by treating the claims as query. Inthe second part it filters out noise from thesecandidates and uses the remaining ones to de-cide whether they support or refute or entailnot enough information to verify the claim.We show that this system achieves a FEVERscore of 0.3927 on the FEVER shared task de-velopment data set which is a 25.5% improve-ment over the baseline score.

1 Introduction

In this paper we present our system for the FEVERChallenge1. The FEVER Challenge is a sharedtask on fact extraction and claim verification. Ini-tially Thorne et al. (2018) created an annotatedcorpus of 185, 445 claims and proposed a baselinesystem to predict the correct labels as well as thepieces of evidence for the claims.

Our system consist of two parts. In the first partwe retrieve sentences that are relevant to a claim.The claim is used as query and is submitted toLucene search API. The sentences found are can-didates for pieces of evidence for the claim. Nextin the second part we run a modified version of theDecomposable Attention network (Parikh et al.,2016) to predict the textual entailment between aclaim and the candidate sentences found throughsearching but also between claim and all candi-date sentences merged into one long text. Thisstep gives us entailment probabilities. We also usea point system to filter out some noise (irrelevantsentences). Based on the remaining candidates weperform label prediction, i.e. whether the claim is

1http://fever.ai

supported, refuted or there is not enough evidence.Our system achieves a FEVER score of 0.3927on the FEVER shared task development data setwhich is a 25.5% improvement over the baselinescore.

2 Data

The data consists of two parts: the FEVER dataset and the Wikipedia dump (Thorne et al., 2018).

The Wikipedia dump contains over five millionpages but for each page only the first section wastaken. The text on each page was split into sen-tences and each sentence was assigned an index.The page title is written using underscores be-tween the individual words instead of spaces.

The FEVER data set contains the annotatedclaims that should be correctly predicted by thedeveloped system. Each claim is annotatedwith one of the three labels SUPPORTS (verifi-ably true), REFUTES (verifiably false) and NOTENOUGH INFO (not verifiable). For claims withthe first two labels the pieces of evidence are pro-vided as a combination of Wikipedia page title andsentence index on that page.

The FEVER data set created by Thorne et al.(2018) is split into a training set with 145, 449, adevelopment set with 19, 998 and a test set with19, 998 annotated claims. The development andtest sets are balanced while the training set hasan approximately 16:6:7 split on the three labels.Each data set and also the Wikipedia dump isavailable at the FEVER web page2.

3 System

We decided to adopt the general two part structureof the baseline for our system with a key differ-ence. The first part takes the claim and finds can-didate sentences that ideally have a high chance of

2http://fever.ai/data.html

114

being evidence for the claim. The second part de-termines the label and selects evidence sentences.

The baseline system uses the sentences foundin the first part directly as evidence. In our systemwe only find candidate sentences in the first partand select the actual evidence sentences at the endof the second part. This allows us to operate ona larger number of sentences in the second part ofthe system and achieve higher recall.

3.1 Finding Candidate SentencesThe main idea of the first part of our system is tomimic human behavior when verifying a claim.If we take a claim about a person as an exam-ple, a human is likely to just take few keywordssuch as the person’s name and use this to searchfor the right Wikipedia page to find evidence. Wemimic this behavior by first extracting few key-words from the claim and use them to find candi-date sentences in the Wikipedia dump.

Extracting KeywordsWe use Named Entity Recognition (NER), Con-stituency Parsing and Dependency Parsing to ex-tract keywords from each claim. For NER weuse the neural network model created by Peterset al. (2018). We use all found named entities askeywords. For the Constituency Parsing we usethe neural network model created by Stern et al.(2017). We extract all NP tagged phrases from thefirst two recursion layers as keywords because wefound that this finds mostly subjects and objectsof a sentence. These two neural networks both usethe AllenNLP library (Gardner et al., 2018). Forthe dependency parsing we use the Standford De-pendency Parser (Chen and Manning, 2014). Weextract all subject and object phrases as keywords.

The NE recognition is our main source for key-words extraction while the other two systems pro-vide additional keywords that either have not beenfound by the NER or that are not named entitiesin the first place. Example of the keywords beingextracted from claims shown in Table 2 are shownin Table 1.

Indexing the Wikipedia DumpAfter extracting the keywords we use the Lucenesearch API3 to find candidate sentences foreach claim. Before searching with Lucene theWikipedia texts need to be indexed. We treat eachsentence as a separate document and index it. We

3https://lucene.apache.org/core/

exclude sentences that are empty and also thosethat are longer than 2000 characters.

For each sentence we also add the Wikipediapage title and make it searchable. For the titlewe replace all underscores with spaces to improvematching. In each sentence we replace the wordsHe, She, It and They with the Wikipedia page ti-tle that the sentence was found in. When lookingat an entire Wikipedia page it is obvious who orwhat these words refer to but when searching in-dividual sentences we do not have the necessarycontext available. We perform this replacement toprovide more context.

Searching for the Candidate SentencesWe use three types of queries to search for candi-date sentences for a claim:

• Type 1: For each keyword we split the key-word phrase into individual words and createa query that searches within the Wikipediapage titles requiring all the individual wordsto be found.

• Type 2: We split all keywords into individ-ual words and combine those into one querysearching within the Wikipedia page titles tofind sentences on those pages where as manywords as possible match the title of the page.

• Type 3: We combine all keywords as phrasesinto one query searching within the sentencesto find those sentences where as many key-words as possible match.

We limit the number of results to the two mostrelevant sentences for the first query type and 20sentences for the other two queries because thefirst query type makes one query per keywordwhile the other two only make one query perclaim. An example of the queries being generatedis given in Table 3. If the same sentence is foundtwice we do not add it to the candidate list again.For each of the candidate sentences we add theWikipedia page title at the beginning of the sen-tence if it does not already contain it somewhere.

3.2 Making the PredictionThe second part of our system first processes thecandidate sentences in three independent steps thatcan be run in parallel:

• We use a modified version of the Decompos-able Attention neural network (Parikh et al.,

115

# Named Entity Recognition Constituency Parser Dependency Parser1 Northern Isles, Scotland The Northern Isles The Northern Isles2 - Artificial intelligence, concern Artificial intelligence, concern3 Walk of Life album, the highest grossing album -

Table 1: Generated keywords from the three systems (see Table 2 for claims). For the first claim the NE recognitioncorrectly finds the two named entities while the other two systems miss one entity and got an additional The intothe keyword. The second claim has no named entities and the other systems correctly find the relevant parts. In thethird example the named entity found by the NE recognition is disambiguated by the Constituency Parser.

# Claim1 The Northern Isles belong to Scotland.2 Artificial intelligence raises concern.3 Walk of Life (album) is the highest grossing album.

Table 2: Example claims used in tables 1/3.

2016) to predict the textual entailment be-tween each candidate sentence and its corre-sponding claim.

• We merge all candidate sentences of a claiminto one block of text and predict the textualentailment between this block of text and theclaim.

• We assign points to each candidate sentencebased on POS-Tags.

Finally our system combines the results in orderto decide on the label and to predict the evidencesentences for each claim.

Textual EntailmentWe started with the Decomposable Attention net-work (Parikh et al., 2016) that is also used inthe baseline except that we predict the textual en-tailment for each pair of candidate sentence andclaim. We found that for long sentences thenetwork has high attention in different parts ofthe sentence that semantically belong to differentstatements. Using the idea that long sentences of-ten contain multiple statements we made the fol-lowing additions to the Decomposable Attentionnetwork.

We include an additional 2-dimensional convo-lution layer that operates on the attention matrix incase of sufficiently large sentences. Based on ourtesting we decided on a single convolution layerwith a kernel size of 12. The output of this con-volution layer contains a different amount of ele-ments depending on the size of the attention ma-trix. This makes sense as longer sentences cancontain multiple statements. We use a CnnEn-

coder4 to change the different length output intoa same length output. This is necessary in orderto use the result of the convolution layer in a laterstep of the network and can be seen as a selectionof the correct statement from the available data.The output of the CnnEncoder is concatenated tothe input of the aggregate step of the network.If either the claim or the candidate sentence areshorter than 12 then we skip this additional stepand concatenate a zero vector instead.

When predicting the textual entailment we donot reduce the probabilities to a final label imme-diately but keep working with the probabilities inthe final prediction (see Section Final Prediction).

Merge SentencesFor each claim we merge all the candidate sen-tences into one block of text similarly to the base-line. We predict the textual entailment using ourmodified decomposable attention network. Wefound that the REFUTES label is predicted withvery high accuracy. However, this is not the casefor the other two labels. By including the resultsof this step we can improve the predicted labels forthe REFUTES label as shown in Table 5. Compar-ing that to the full result given in Table 4 we cansee that about 29.3% of correct REFUTES predic-tions are due to this step.

Creating POS-Tags and Assigning PointsWe use the Stanford POS-Tagger (Toutanova et al.,2003) to create POS-Tags for all candidate sen-tences and all claims. We found that the Stan-ford POS-Tagger only uses a single CPU core onour system so we wrote a script that splits the filecontaining all claim or candidate sentences intomultiple files. Then the script calls multiple POS-Tagger instances in parallel, one for each file. Theresults are then merged back into a single file.

4https://allenai.github.io/allennlp-docs/api/allennlp.modules.seq2vec_encoders.html

116

Query type Query Occurrence LimitType 1 ”Artificial” ”intelligence” must occur 2Type 1 ”concern” must occur 2Type 2 ”Artificial” ”intelligence” ”concern” should occur 20Type 3 ”Artificial intelligence” ”concern” should occur 20

Table 3: Generated queries for claim 2 (see Table 2). Claim 2 has two keywords where one contains two words.For the Type 1 query we create two queries where one query contains two separate words. For the Type 2 querywe split all words and use them all in one query. For type 3 we omit the split and use entire keyword phrases asquery.

SUP REF NEISUP 3291 370 3005REF 1000 3159 2507NEI 1710 1142 3814

Table 4: Confusion matrix of the full system prediction.Columns are predictions and rows the true labels.

SUP REF NEISUP -8 +52 -44REF -2 +926 -924NEI -4 +152 -148

Table 5: Confusion matrix change due to including themerge feature. Columns are predictions and rows thetrue labels.

Using the generated POS-Tags we assign scoresto the candidate sentences. First each candidatesentence is assigned 5 different scores, one foreach of the following POS-Tag categories: verbs,nouns, adjectives, adverbs and numbers. Each cat-egory score starts at 3 and is decreased by 1 foreach word of the respective POS-Tag category thatis in the claim but not in the candidate sentence.Duplicate words are considered only once. We donot allow the category scores to go negative. Atthe end the category scores are added together tocreate the final score which can be a maximum of15.

Final Prediction

We create a matrix from the per candidate sentencetextual entailment probabilities with the three la-bels as columns and one row per candidate. We re-duce all three probabilities of a candidate sentenceif it received 11 or less points. The number 11is empirically determined using the developmentset. As shown in Figure 1 we are able to filterout most of the non-evidence sentences by look-ing only at candidate sentences whose point scoreis more than 11. Reducing the probabilities is doneby multiplying them with 0.3. This way they arealways reduced below the minimum highest prob-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1500%

10%

20%

30%

EvidenceNot Evidence

filter keep

Figure 1: Histogram of how many candidate sentences(y-axis) received how many points (x-axis) for the de-velopment set. 65.07% of non-evidence and 26.21% ofevidence sentences get filtered with the threshold be-tween 11 and 12.

ability of non-filtered sentences (= 33.33...%).Finally we predict the label and decide on

the evidence sentences. If the Merge Sentencesprediction predicted REFUTES then we use RE-FUTES as final label. Otherwise we find the high-est value in the matrix and select the column it ap-pears in as final label. We sort the matrix based onthe column of the final label and select the top 5candidate sentences as evidence.

3.3 Training

For training the modified Decomposable Attentionnetwork we are using the SNLI data set and theFEVER training set (Bowman et al., 2015; Thorneet al., 2018). For claims labeled as NOT ENOUGHINFO we first search for Wikipedia page titles thatcontain a word from the claim and then randomlychoose one of the sentences on that page. If noWikipedia page is found this way we randomlyselect one. We concatenate the generated train-ing data with the SNLI data set to create the finaltraining data containing 849, 426 claims.

4 Results

Our system achieves a FEVER score of 0.3927 onthe shared task development set containing 19, 998

117

Label Recall ScoreAll 0.5132 0.3581 0.3927Unmodified DA 0.5170 0.3880 0.3909Without Points 0.4545 0.1169 0.3665Without Merge 0.4747 0.3294 0.3815

Table 6: Contribution of each feature. Label refers tothe label accuracy, while Recall refers to the evidencerecall.

claims. This is a 25.5% improvement over thebaseline score of 0.3127 on the development set.The confusion matrix for the predicted labels isgiven in Table 4. It shows that the highest incor-rect predictions are for the NOT ENOUGH INFOlabel while the REFUTES label is predicted withthe least amount of errors.

For the test set our system generated 773, 862pairs of candidate sentences and claim sentences.Only for a single claim out of all 19, 998 claims nocandidate sentences were found.

For the development set the candidate sentencesfound in the first part of our system include the ac-tual evidence of 77.83% of the claims. In compar-ison the baseline (Thorne et al., 2018) only finds44.22% of the evidence. Our system finds 38.7sentences per claim on average, while the baselineis limited to 5 sentences per claim.

When looking at how much each feature im-proves the final score in Table 6, we can see thatthe point system using POS-Tags results in thebiggest improvement.

5 Conclusion

In this paper we have presented our system for theFEVER Challenge. While keeping the two-partstructure of the baseline we replaced the first partcompletely and heavily modified the second partto achieve a 25.5% FEVER score improvementover the baseline. In our immediate future workwe will investigate alternative ways of obtaininghigher recall in the first part but also improve thetextual entailment to further reduce noise.


and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.CoRR, abs/1508.05326.

Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing

(EMNLP), pages 740–750. Association for Compu-tational Linguistics.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E.Peters, Michael Schmitz, and Luke Zettlemoyer.2018. Allennlp: A deep semantic natural languageprocessing platform. CoRR, abs/1803.07640.

Ankur P Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. arXiv preprintarXiv:1606.01933.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. ArXiv e-prints, 1802.05365.

Mitchell Stern, Jacob Andreas, and Dan Klein. 2017.A minimal span-based neural constituency parser.CoRR, abs/1705.03919.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extraction andverification. CoRR, abs/1803.05355.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Human Language Tech-nology Conference of the North American Chapterof the Association for Computational Linguistics.

118


SIRIUS-LTG: An Entity Linking Approach to Fact Extraction andVerification

Farhad Nooralahzadeh and Lilja ØvrelidDepartment of InformaticsUniversity of Oslo, Norway

farhadno,[email protected]

Abstract

This article presents the SIRIUS-LTG sys-tem for the Fact Extraction and VERification(FEVER) Shared Task. It consists of threecomponents: 1) Wikipedia Page Retrieval:First we extract the entities in the claim, thenwe find potential Wikipedia URI candidatesfor each of the entities using a SPARQL queryover DBpedia 2) Sentence selection: We in-vestigate various techniques i.e. Smooth In-verse Frequency (SIF), Word Mover’s Dis-tance (WMD), Soft-Cosine Similarity, Cosinesimilarity with unigram Term Frequency In-verse Document Frequency (TF-IDF) to ranksentences by their similarity to the claim. 3)Textual Entailment: We compare three modelsfor the task of claim classification. We apply aDecomposable Attention (DA) model (Parikhet al., 2016), a Decomposed Graph Entail-ment (DGE) model (Khot et al., 2018) and aGradient-Boosted Decision Trees (TalosTree)model (Sean et al., 2017) for this task. Theexperiments show that the pipeline with sim-ple Cosine Similarity using TFIDF in sentenceselection along with DA model as labellingmodel achieves the best results on the devel-opment set (F1 evidence: 32.17, label accu-racy: 59.61 and FEVER score: 0.3778). Fur-thermore, it obtains 30.19, 48.87 and 36.55in terms of F1 evidence, label accuracy andFEVER score, respectively, on the test set. Oursystem ranks 15th among 23 participants in theshared task prior to any human-evaluation ofthe evidence.

1 Introduction

The Web contains vast amounts of data from manyheterogeneous sources, and the harvesting of in-formation from these sources can be extremelyvaluable for several domains and applications suchas, for instance, business intelligence. The vol-ume and variety of data on the Web are increas-ing at a very rapid pace, making their use andprocessing increasingly difficult. A large volume

of information on the Web consists of unstruc-tured text which contains facts about named enti-ties (NE) such as people, places and organizations.At the same time, the recent evolution of pub-lishing and connecting data over the Web dubbed”Linked Data” provides a machine-readable andenriched representation of many of the world’s en-tities, together with their semantic characteristics.These structured data sources are a result of thecreation of large knowledge bases (KB) by dif-ferent communities, which are often interlinked,as is the case of DBpedia (Lehmann et al., 2015)1, Yago 2 (Suchanek et al., 2007) and FreeBase 3

(Bollacker et al., 2008). This characteristic of theWeb of data empowers both humans and computeragents to discover more concepts by easily navi-gating among the datasets, and can profitably beexploited in complex tasks such as information re-trieval, question answering, knowledge extractionand reasoning.

Fact extraction from unstructured text is a taskcentral to knowledge base construction. Whilethis process is vital for many NLP applications,misinformation (false information) or disinforma-tion (deliberately false information) from unreli-able sources, can provide false output and misleadthe readers. Such risks could be properly managedby applying NLP techniques aimed at solving thetask of fact verification, i.e., to detect and discrim-inate misinformation and prevent its propagation.The Fact Extraction and VERification (FEVER)shared task 4 (Thorne et al., 2018) addresses bothproblems. In this work, we introduce a pipelinesystem for each phase of the FEVER shared task.In our pipeline, we first identify entities in a givenclaim, then we extract candidate Wikipedia pagesfor each of the entities and the most similar sen-

1 dbpedia.org2 www.mpi-inf.mpg.de/yago/3 www.freebase.com4 www.fever.ai

119

tences are obtained using a textual similarity mea-sure. Finally, we label the claim with regard to ev-idence sentences using a textual entailment tech-nique.

2 System description

In this section, we describe our system which con-sists of three components which solve the threefollowing tasks: Wikipedia page retrieval, sen-tence selection and textual entailment.

2.1 Wiki-page RetrievalEach claim in the FEVER dataset contains a singlepiece of information about an entity that its orig-inal Wikipedia page describes. Therefore we firstextract entities using the Stanford Named EntityRecognition (StanfordNER) (Finkel et al., 2005).We observe that StanfordNER is sometimes un-able to extract entity names in the claim due tolimited contextual information like in example 1below:Example 1 A View to a Kill is anaction movie.NER:[]Noun-Phrases:[A View to a Kill, an action

movie]

To tackle this problem, we also extract nounphrases using the parse tree of Stanford CoreNLP(Manning et al., 2014) and the longest multi-wordexpression that contains words with the first letterin upper case. This enables us to provide a widerange of potential entities for the retrieval process.We then retrieve a set of Wikipedia page candi-dates for an entity in the claim using a SPARQL(Prud’hommeaux and Seaborne, 2008) query overDBpedia, i.e. the structured version of Wikipedia.

The SPARQL query aids the retrieval processby providing a list of candidates to the subsequentsystem components, particularly when the claimis about film, song, music album, bands and etc.Listing 1 shows the query employed for the entityMeteora in Example 2 below, which outputs theresulting Wikipedia pages (Pages):Example 2 Meteora is not a rockalbum.Entity:[Meteora]Pages:[’Meteora’, ’Meteora (album)’,

’Meteora Monastary’, ’Meteora (Greek

monasteries)’, ’Meteora (film)’]

The query retrieves all the Wikipedia pageswhich contains the entity mention in their title.

prefix rdfs:<http://www.w3.org/2000/01/rdf-schema#>prefix fn:<http://www.w3.org/2005/xpath-functions/#>

SELECT DISTINCT ?resourceWHERE ?resource rdfs:label ?s.?s <bif:contains> ’"Meteora"’.

FILTER (lang(?s) ="en")FILTER ( fn:string-length(fn:substring-after(?resource,"http://dbpedia.org/resource/"))>1)FILTER (regex(str(?resource),"http://dbpedia.org/resource")&& !regex(str(?resource),"http://dbpedia.org/resource/File:")&& !regex(str(?resource),"http://dbpedia.org/resource/Category:")&& !regex(str(?resource),"http://dbpedia.org/resource/Template:")&& !regex(str(?resource),"http://dbpedia.org/resource/List")&& !regex(str(?resource), "(disambiguation)")

)

Listing 1: SPARQL query to extract Wikipediapage candidates for entity mention (e.g. Meteora)


Given a set of Wikipedia page candidates, the sim-ilarity between the claim and the individual textlines on the page is obtained. We here experimentwith several methods for computing this similar-ity:

Cosine Similarity using TFIDF: Sentences areranked by unigram TF-IDF similarity to the claim.We modified the fever-baseline code to considerthe candidate list from the Wiki-page retrievalcomponents.

Soft-Cosine Similarity: Following the work ofCharlet and Damnati (2017), we measure the sim-ilarity between the candidate sentences and theclaim. This textual similarity measure relies onthe introduction of a relation matrix in the classi-cal cosine similarity between bag-of-words. Therelation matrix is calculated using the word2vecrepresentations of words.

Word Mover’s Distance(WMD): The WMD distance”measures the dissimilarity between two text doc-uments as the minimum amount of distance thatthe embedded words of one document need to’travel’ to reach the embedded words of anotherdocument” (Kusner et al., 2015). The word2vecembeddings is used to calculate semantic dis-tances of words in the embedding space.

Smooth Inverse Frequency (SIF): We also createsentence embeddings using the SIF weightingscheme (Arora et al., 2017) for a claim and candi-

120

date sentences. Then we calculate the cosine sim-ilarity measure between these embedding vectors.

2.3 Entailment

The previous component provides the most sim-ilar sentences as an evidence set for each claim.In this component, the aim is to find out whetherthe selected sentences enable us to classify a claimas being either SUPPORTED, REFUTED or NOTENOUGH INFO. In cases where multiple sen-tences are selected as evidence, their strings areconcatenated prior to classification. If the set ofselected sentences is empty for a specific claim,due to the failure in finding related Wiki-page, wesimply assign NOT ENOUGH INFO as an entail-ment label. In order to solve the entailment taskwe experiment with the use of several existingtextual entailment systems with somewhat differ-ent requirements and properties. We follow theinstruction from the Git-Hub repositories of thethree following models and investigate their per-formances in the FEVER textual entailment sub-task:

Decomposable Attention (DA) model (Parikh et al., 2016):We used the publicly available DA model 5 whichis trained on the FEVER shared task dataset. Weasked the model to predict an inference label foreach claim based on the evidence set which is pro-vided by the sentence selection component.

Decomposed Graph Entailment (DGE) model: Khotet al. (2018) propose a decomposed graph entail-ment model that uses structure from the claim tocalculate entailment probabilities for each nodeand edge in the graph structure and aggregatesthem for the final entailment computation. Theoriginal DGE model 6 uses Open IE (Khot et al.,2017) tuples as a graph representation for theclaim. However, it is mentioned that the model canuse any graph with labeled edges. Therefore, weprovide a syntactic dependency parse tree usingthe Stanford dependency parser (Manning et al.,2014) which outputs the Enhanced Universal De-pendencies representation (Schuster and Manning,2016) as a graph representation for the claim.

Gradient-Boosted Decision Trees model: We also ex-periment with the TalosTree model 7 (Sean et al.,

5https://github.com/sheffieldnlp/fever-baselines

6https://github.com/allenai/scitail7https://github.com/Cisco-Talos/fnc-1

Evidence

Similarity Precision Recall F1

Cosine Similarity using TFIDF 21.14 67.24 32.17Soft-Cosine Similarity 19.50 65.53 30.05Word Movers Distance (WMD) 18.24 59.29 27.90Smooth Inverse Frequency(SIF)

15.19 50.33 23.33

FEVER Baseline - - 17.18

Table 1: Evidence extraction results on develop-ment set

2017) which was the winning system in the FakeNews Challenge (Pomerleau and Ra, 2017). TheTalosTree model utilizes text-based features de-rived from the claim and evidences, which arethen fed into Gradient Boosted Trees to predictthe relation between the claim and the evidences.The features that are used in the prediction modelare word count, TF-IDF, sentiment and a singular-value decomposition feature in combination withword2vec embeddings.

3 Experiments

3.1 DatasetThe shared-task (Thorne et al., 2018) provides anannotated dataset of 185,445 claims along withtheir evidence sets. The shared-task dataset is di-vided into 145,459 , 19,998 and 19,998 train, de-velopment and test instances, respectively. Theclaims are generated from information extractedfrom Wikipedia. The Wikipedia dump (versionJune 2017) was processed with Stanford CoreNLP,and the claims sampled from the introductory sec-tions of approximately 50,000 popular pages.

3.2 EvaluationIn this section we evaluate our system in the twomain subtasks of the shared task: I) evidence ex-traction (wiki-page retrieval and sentence selec-tion) and II) Entailment. Since, the scoring for-mula in the shared-task considers only the first 5predicted sentence evidences, we choose 5-mostsimilar sentences in the sentence selection phase(Section 2.2).

3.2.1 Evidence ExtractionInitially, the impact of different similarity mea-sures in sentence selection is evaluated. Table 1shows the results of the various similarity mea-sures described in section 2 for the evidence ex-traction subtask on the development set. The re-

121

F1

Model Label Precision Recall F1 Label Accuracy FEVER Score

DANOT ENOUGH INFO 46.00 10.00 17.00

50.61 37.78REFUTES 61.00 60.00 60.00SUPPORTS 46.00 82.00 59.00

DGENOT ENOUGH INFO 41.00 5.00 8.00


TalosTreeNOT ENOUGH INFO 28.00 1.00 3.00


FEVER baseline* 51.37 31.27

Table 2: Pipeline performance on the dev set with the sentence selection module. (*) In the FEVERbaseline the label accuracy uses the annotated evidence instead of evidence from the evidence extractionmodule.

sults suggest that the simple cosine similarity us-ing TF-IDF is the best performing method for thesentence selection component when compared tothe other similarity techniques. With an F1-scoreof 32.17 it clearly outperforms the Soft-CosineSimilarity (F1 30.05), WMD (F1 27.90) and SIF(F1 23.22) measures. This component clearly alsooutperforms the FEVER baseline for this subtask(F1 17.18).

3.2.2 Entailment

This component is trained on pairs of annotatedclaims and evidence sets from the FEVER shared-task training dataset. We here train two differentmodels i.e. DGE and TalosTree and we utilizethe pre-trained DA model. We evaluate classifi-cation accuracy on the development set, assum-ing that the evidence sentences are extracted in theevidence extraction phase with the best perform-ing setup. The results are presented in Table 2and show that the NOT ENOUGH INFO class isdifficult to detect for all three models. Further-more, the DA model achieves the best accuracyand FEVER score compared to the others. We alsoobserve that the label accuracy has a significantimpact on the total FEVER score.

3.2.3 Final System

The final system pipeline is established with theSPARQL query and cosine similarity using TFIDFin the evidence extraction module, and using thedecomposable attention model for the entailmentsubtask. Table 3 depicts the final submission re-sults over the test set using our system.

Evidence FEVER

Similarity P R F1 Acc. Score

Our System 19.19 70.72 30.19 48.87 36.55

FEVER- - 18.26 48.84 27.45

Baseline

Table 3: Final system pipeline results over test set.

4 Conclusion

We present our system for the FEVER sharedtask to extract evidence from Wikipedia and ver-ify each claim w.r.t. the obtained evidence. Weexamine various configurations for each compo-nent of the system. The experiments demonstratethe effectiveness of the TF-IDF cosine similaritymeasure and decomposable attention on both thedevelopment and test datasets.

Our future work includes: 1) to implement asemi-supervised machine learning method for ev-idence extraction , and 2) to investigate differentneural architectures for the verification task.

References

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A simple but tough-to-beat baseline for sentence em-beddings.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A col-laboratively created graph database for structuringhuman knowledge. In Proceedings of the 2008ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’08, pages 1247–1250,New York, NY, USA. ACM.

122

Delphine Charlet and Geraldine Damnati. 2017. Sim-bow at semeval-2017 task 3: Soft-cosine semanticsimilarity between questions for community ques-tion answering. In Proceedings of the 11th Interna-tional Workshop on Semantic Evaluation (SemEval-2017), pages 315–319. Association for Computa-tional Linguistics.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd Annual Meet-ing on Association for Computational Linguistics,ACL ’05, pages 363–370, Stroudsburg, PA, USA.Association for Computational Linguistics.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.Answering complex questions using open informa-tion extraction. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 311–316.Association for Computational Linguistics.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.SciTail: A textual entailment dataset from sciencequestion answering. In AAAI.

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kil-ian Q. Weinberger. 2015. From word embeddings todocument distances. In Proceedings of the 32Nd In-ternational Conference on International Conferenceon Machine Learning - Volume 37, ICML’15, pages957–966. JMLR.org.

Jens Lehmann, Robert Isele, Max Jakob, AnjaJentzsch, Dimitris Kontokostas, Pablo N. Mendes,Sebastian Hellmann, Mohamed Morsey, Patrick vanKleef, Soren Auer, and Christian Bizer. 2015. DB-pedia - a large-scale, multilingual knowledge baseextracted from wikipedia. Semantic Web Journal,6(2):167–195.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2249–2255.Association for Computational Linguistics.

Dean Pomerleau and Delip Ra. 2017. Fake news chal-lenge. http://fakenewschallenge.org/,.

Eric Prud’hommeaux and Andy Seaborne. 2008.SPARQL Query Language for RDF. W3CRecommendation. http://www.w3.org/TR/rdf-sparql-query/.

Sebastian Schuster and Christopher D. Manning. 2016.Enhanced english universal dependencies: An im-proved representation for natural language under-standing tasks. In LREC.

Baird Sean, Sibley Doug, and Pan Yuxi.2017. Talos targets disinformation withfake news challenge victory. http://blog.talosintelligence.com/2017/06/talos-fake-news-challenge, Ac-cessed: 2018-07-01.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowl-edge. In Proceedings of the 16th international con-ference on World Wide Web, WWW ’07, pages 697–706, New York, NY, USA. ACM.


123


Integrating Entity Linking and Evidence Ranking for Fact Extraction andVerification

Motoki Taniguchi ∗

[email protected]

Tomoki Taniguchi ∗

[email protected]

Takumi [email protected]

Yasuhide [email protected]

Tomoko [email protected]

Fuji Xerox Co., Ltd.

Abstract

We describe here our system and results on theFEVER shared task. We prepared a pipelinesystem which composes of a document selec-tion, a sentence retrieval, and a recognizingtextual entailment (RTE) components. A sim-ple entity linking approach with text matchis used as the document selection component,this component identifies relevant documentsfor a given claim by using mentioned enti-ties as clues. The sentence retrieval compo-nent selects relevant sentences as candidate ev-idence from the documents based on TF-IDF.Finally, the RTE component selects evidencesentences by ranking the sentences and classi-fies the claim simultaneously. The experimen-tal results show that our system achieved theFEVER score of 0.4016 and outperformed theofficial baseline system.

1 Introduction

The increasing amounts of textual informationon the Web have brought demands to de-velop techniques to extract and verify a fact.The Fact Extraction and VERification (FEVER)task (Thorne et al., 2018) focuses on verifica-tion of textual claims against evidence. In theFEVER shared task, a given claim is classifiedas SUPPORTED, REFUTED, or NOTENOUGHINFO

(NEI). Evidence to justify a given claim is requiredfor SUPPORTED or REFUTED claims. The evidenceis not given and must be retrieved from Wikipedia.

This paper describes our participating system inthe FEVER shared task. The architecture of oursystem is designed by following the official base-line system (Thorne et al., 2018). There are two

∗Authors contributed equally

main differences between our system and the base-line system. The first one is identifying documentsthat contain evidence by using text match betweenmentioned entities in a given claim and Wikipediapage title. The details are described in Section 2.1.The next one is a neural network based model,details of which are described in Section 2.3, forselecting evidence sentences as ranking task andclassifying a claim simultaneously.

2 System

We propose a pipeline system which composes ofa document selection, a sentence retrieval, and arecognizing textual entailment (RTE) components.A simple entity linking approach with text matchis used as the document selection component.This component identifies relevant documents fora given claim by using mentioned entities as clues.The sentence retrieval component selects relevantsentences as candidate evidence from the docu-ments based on Term Frequency-Inverse Docu-ment Frequency (TF-IDF). Finally, the RTE com-ponent selects evidence sentences by ranking thecandidate sentences and classifies the claim asSUPPORTED, REFUTED, or NOTENOUGHINFO si-multaneously. Details of the components are de-scribed in the following Section.

2.1 Document selectionWikipedia pages of entities mentioned in a claimcan be good candidate documents containing theSUPPORTED/REFUTED evidence. Therefore, weuse a simple but efficient entity linking approachas a document selection component. In our en-tity linking approach, relevant documents are re-trieved by using exact match between page titlesof Wikipedia and words in a claim. We expect

124

this component to select only surely correct docu-ments. In other words, we decided to prefer preci-sion of evidence rather than recall. In fact, our pre-liminary experiment indicates that 68% of claimsexcluding NEI in a development set can be fullysupported or refuted by the retrieved documentswith our approach. This corresponds roughly tothe accuracy of 10 nearest documents retrieved bythe DrQA (Chen et al., 2017) based retrieval ap-proach used in the baseline system. The averagenumber of selected documents in our approach is3.7, and thus our approach is more efficient thanthe baseline system.

2.2 Sentence retrievalFollowing the baseline system, we use a sentenceretrieval component which returns K nearest sen-tences for a claim using cosine similarity betweenunigram and bigram TF-IDF vectors. The K near-est sentences are retrieved from the documents se-lected by the document selection component. Weselected optimal K using grid search over 5, 10,15, 20, 50 in terms of the performance of the fullpipeline system on a development set. The optimalvalues was K = 15.

2.3 Recognizing textual entailmentAs RTE component, we adopt DEISTE (Deep Ex-plorations of Inter-Sentence interactions for Tex-tual Entailment) model that is the state-of-the-artin RTE tasks (Yin et al., 2018). RTE componentis trained on labeled claims paired with sentence-level evidence. To build the model, we utilizethe NEARESTP dataset described in Thorne et al.(2018). In a case where multiple sentences arerequired as evidence, the texts of the sentencesare concatenated. We use Adam (Kingma and Ba,2014) as an optimizer and utilize 300 dimensionalGloVe vector which is adapted by the baseline sys-tem. The other model parameters are the same asthe parameters described in Yin et al. (2018).

Claims labelled as NEI are easier to predict cor-rectly than SUPPORTED and REFUTED because un-like SUPPORTED and REFUTED, NEI dose not needevidence. Therefore, our RTE component are de-signed to predict the claims as NEI if the modelcan not predict claims as SUPPORTED or REFUTED

with high confidence. RTE prediction process iscomposed of three steps. Firstly, we calculate theprobability score of each label for pairs of a claimand candidate sentence using DEISTE model. Sec-ondly, we decide a prediction label using the fol-

lowing equations.

SR = arg maxs∈S,a∈A

Ps,a

Pmax = maxs∈S,a∈A

Ps,a

Labelpred =

SR (Pmax > Pt)NEI (otherwise)

where S is a set of pairs of a claim and candi-date sentence; A = SUPPORTED, REFUTED; Ps,a

is a probability score of a pair for label a; Pt is athreshold value; Labelpred is prediction label for aclaim. Finally, we sort candidate sentences in de-scending order of scores and select at most 5 ev-idence sentences with the same label as predictedlabel. We also apply grid search to find the bestthreshold Pt and set it to 0.93.

3 Evaluation

3.1 Dataset

We used official training dataset for training RTEcomponent. For parameter tuning and perfor-mance evaluation, we used a development and testdatasets used in (Thorne et al., 2018). Table 1shows statistics of each dataset.

SUPPORTED REFUTED NEITraining 80,035 29,775 35,639Development 3,333 3,333 3,333Test 3,333 3,333 3,333

Table 1: The number of claims in each datasets.

3.2 In-house Experiment

We evaluated our system and baseline system onthe test dataset with FEVER score, label accuracy,evidence precision, evidence recall and evidenceF1. FEVER score is classification accuracy ofclaims if the correct evidence is selected. Labelaccuracy is classification accuracy of claims if therequirement for correct evidence is ignored. Table2 shows the evaluation results on the test dataset.Our system achieved FEVER score of 0.4016 andoutperformed the baseline system. As expected,our system produced a significant improvement of59 points in evidence precision against the base-line system. Though evidence recall decreased,evidence F1 increased by 17 points compared tothe baseline system.

Table 3 shows the confusion matrix on thedevelopment dataset. Even though our model

125

FEVER Score Label Accuracy Evidence Precision Evidence Recall Evidence F1Baseline 0.2807 0.5060 0.1084 0.4599 0.1755Ours 0.4016 0.4851 0.6986 0.2265 0.3421

Table 2: Evaluation results on the test dataset.

Actual class＼ Predicted class SUPPORTED REFUTED NEI TotalSUPPORTED 929 181 2223 3333REFUTED 104 1331 1898 3333NEI 363 300 2670 3333Total 1396 1812 6791 9999

Table 3: Confusion matrix on the development dataset.

FEVER Score Label Accuracy Evidence F1Ours 0.3881 0.4713 0.1649

Table 4: Final results of our submissions.

tends to predict claims as NEI, the precisionsof SUPPORTED (929/1396 = 0.67) and REFUTED

(1331/1812 = 0.73) are higher than the precisionof NEI (2670/6791 = 0.39).

3.3 Submission run

Table 4 presents the evaluation results of our sub-missions. The models showed similar behavior asin the in-house experiment excepting evidence F1.Our submission were ranked in 9th place.

4 Conclusion

We developed a pipeline system which composesof a document selection, a sentence retrieval, andan RTE components for the FEVER shared task.Evaluation results of in-house experiment showthat our system achieved improvement of 12% inFEVER score against the baseline system.

Even though document selection component ofour system has contributed to find more correct ev-idence document, the component was too strict,and thus degraded evidence recall. Therefore, asa future work, we plan to explore more sophisti-cated entity linking approach.

References

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extractionand verification. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 809–819. Association for ComputationalLinguistics.

Wenpeng Yin, Dan Roth, and Hinrich Schutze. 2018.End-task oriented textual entailment via deep explo-rations of inter-sentence interactions. In Proceed-ings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 540–545. Association for Computa-tional Linguistics.

126


Robust Document Retrieval and Individual Evidence Modeling for FactExtraction and Verification.

Tuhin Chakrabarty and Tariq AlhindiDepartment of Computer Science

Columbia [email protected]@cs.columbia.edu

Smaranda MuresanDepartment of Computer Science

Data Science InstituteColumbia University

[email protected]

Abstract

This paper presents the ColumbiaNLP submis-sion for the FEVER Workshop Shared Task.Our system is an end-to-end pipeline that ex-tracts factual evidence from Wikipedia andinfers a decision about the truthfulness ofthe claim based on the extracted evidence.Our pipeline achieves significant improvementover the baseline for all the components (Doc-ument Retrieval, Sentence Selection and Tex-tual Entailment) both on the development setand the test set. Our team finished 6th out of24 teams on the leader-board based on the pre-liminary results with a FEVER score of 49.06on the blind test set compared to 27.45 of thebaseline system.

1 Introduction and Background

Fact checking is a type of investigative journal-ism where experts examine the claims publishedby others for their veracity. The claims can rangefrom statements made by public figures to storiesreported by other publishers. The end goal of afact checking system is to provide a verdict onwhether the claim is true, false, or mixed. Sev-eral organizations such as FactCheck.org andPolitiFact are devoted to such activities.

The FEVER Shared task aims to evaluate theability of a system to verify information using ev-idence from Wikipedia. Given a claim involv-ing one or more entities (mapping to Wikipediapages), the system must extract textual evidence(sets of sentences from Wikipedia pages) that sup-ports or refutes the claim and then using thisevidence, it must label the claim as Supported,Refuted or NotEnoughInfo. The dataset for theshared task was introduced by Thorne et al. (2018)and consists of 185,445 claims. Table 1 showsthree instances from the data set with the claim,the evidence and the verdict.

Claim : Fox 2000 Pictures released the film Soul Food.[wiki/Soul Food (film)]Evidence: Soul Food is a 1997 American comedy-dramafilm produced by Kenneth ”Babyface” Edmonds , TraceyEdmonds and Robert Teitel and released by Fox 2000Pictures .Verdict: SUPPORTS

Claim : Murda Beatz’s real name is Marshall Mathers.[wiki/Murda Beatz]Evidence: Shane Lee Lindstrom (born February 11,1994), known professionally as Murda Beatz, is aCanadian hip hop record producer and songwriter fromFort Erie, Ontario.Verdict: REFUTES

Claim : L.A. Reid has served as the CEO of AristaRecords for four years.[wiki/L.A. Reid]Evidence: He has served as the chairman and CEO ofEpic Records, a division of Sony Music Entertainment,the president and CEO of Arista Records, and thechairman and CEO of the Island Def Jam Music Group.Verdict: NOT ENOUGH INFO

Table 1: Examples of claims, the extracted evidencefrom Wikipedia and the verdicts from the shared taskdataset (Thorne et al., 2018)

The baseline system described by Thorne et al.(2018) uses 3 major components:

• Document Retrieval: Given a claim, iden-tify relevant documents from Wikipediawhich contain the evidence to verify theclaim. Thorne et al. (2018) used the doc-ument retrieval component from the DrQAsystem (Chen et al., 2017), which returns thek nearest documents for a query using co-sine similarity between binned unigram andbigram TF-IDF vectors.

• Sentence Selection: Given the set of re-trieved document, identify the candidate ev-idence sentences. Thorne et al. (2018) useda modified document retrieval component of

127

DrQA (Chen et al., 2017) to select the topmost similar sentences w.r.t the claim, usingbigram TF-IDF with binning.

• Textual Entailment: For the entailmenttask, training is done using labeled claimspaired with evidence (labels are SUPPORTS,REFUTES, NOT ENOUGH INFO). Thorneet al. (2018) used the decomposable atten-tion model (Parikh et al., 2016) for this task.For the case where multiple sentences are re-quired as evidence, the strings were concate-nated.

Our system implements changes in all threemodules (Section 2), which leads to significant im-provements both on the development and test sets.On the shared task development set our documentretrieval approach covers 94.4% of the claims re-quiring evidence, compared to 55.30% in the base-line. Further, on the dev set our evidence recall isimproved by 33 points over the baseline. For en-tailment, our model improves the baseline by 7.5points on dev set. Overall, our end-to-end systemshows an improvement of 19.56 in FEVER scorecompared to the baseline (50.83 vs. 31.27) on thedev set . On the blind test set we achieve an evi-dence recall of 75.89 and an entailment accuracyof 57.45 (9 points above baseline) resulting in aFEVER score of 49.06 (Section 3). Together withthe results we discuss some lessons learned basedon our error analysis and release our code 1.

2 Methods

2.1 Document RetrievalDocument Retrieval is a crucial step when build-ing an end-to-end system for fact extraction andverification. Missing a relevant document couldlead to missed evidence, while non-relevant doc-uments would add noise for the subsequent tasksof sentence selection and textual entailment. Wepropose a multi-step approach for retrieving docu-ments relevant to the claims.

• Google Custom Search API: Wang et al.(2018) looked at retrieving relevant docu-ments for fact-checking articles, looking atgenerating candidates via search. Inspired bythis, we first use the Custom Search API ofGoogle to retrieve documents having infor-mation about the claim. We add the token

1https://github.com/tuhinjubcse/FEVER-EMNLP

wikipedia to the claim and issue a queryand collect the top 2 results.

• Named Entity Recognition: Second, we usethe AllenNLP (Gardner et al., 2017) pre-trained bidirectional language model (Peterset al., 2017) for named entity recognition 2.After finding the named entities in the claim,we use Wikipedia python API 3 to collect thetop wikipedia document returned by the APIfor each named entity.

• Dependency Parse: Third, to increase thechance of detecting relevant entities in theclaim, we find the first lower case verb phrase(VP) in the dependency parse tree and querythe Wikipedia API with all the tokens beforethe VP. The reason for emphasizing lowercase verb phrase is to avoid missing entitiesin claims such as “Finding Dory was directedby X”, where the relevant entity is “FindingDory”.

To deal with entity ambiguity, we also addthe token film in our query where theclaim contains keywords such as film,stars, premiered and directed by.For example in “Marnie was directed byWhoopi Goldberg.”, Marnie can refer toboth wikipedia pages Marnie (film) andMarnie. Our point of interest here isMarnie (film). We only experimentedwith film to capture the performance gains.One of our future goals is to build better com-putational models to handle entity ambiguityor entity linking.

• Combined: We use the union of the docu-ments returned by the three approaches as thefinal set of relevant documents to be used bythe Sentence Selection module.

Method Avg k CoverageGoogle API 2 79.5%NER 2 77.1%Dependency Parse 1 80.0%Combined 3 94.4%(Thorne et al., 2018) 5 55.3%

Table 2: Coverage of claims that can be fully supportedor refuted by the retrieved documents (dev set)

Table 2 shows the percentage of claims that canbe fully supported or refuted by the retrieved docu-

2http://demo.allennlp.org/named-entity-recognition3https://pypi.org/project/wikipedia/

128

ments before sentence selection on the dev set. Wesee that our best approach (combined) achieveda high coverage 94.4% compared to the baseline(Thorne et al., 2018) of 55.3%. Because we donot have the gold evidences for the blind test setwe cannot report the claim coverage using ourpipeline .


For sentence selection, we used the modified doc-ument retrieval component of DrQA (Chen et al.,2017) to select sentences using bigram TF-IDFwith binning as proposed by (Thorne et al., 2018).We extract the top 5 most similar sentences fromthe k-most relevant documents using the TF-IDFvector similarity. Our evidence recall is 78.4as compared to 45.05 in the development set ofFEVER (Thorne et al., 2018), which demonstratesthe importance of document retrieval in fact ex-traction and verification. On the blind test set oursentence selection approach achieves an evidencerecall of 75.89.

However, even though TF-IDF proves to bea strong baseline for sentence selection we no-ticed on the dev set that using all 5 evidences to-gether introduced additional noise to the entail-ment model. To solve this, we further filtered thetop 3 evidences from the selected 5 evidences us-ing distributed semantic representations. Peterset al. (2018) show how deep contextualized wordrepresentations model both complex characteris-tics of word use (e.g., syntax and semantics), andusage across various linguistic contexts. Thus, weused the ELMo embeddings to convert the claimand evidence to vectors. We then calculated cosinesimilarity between claim and evidence vectors andextracted the top 3 sentences based on the score.Because there was no penalty involved for poorevidence precision, we returned all five selectedsentences as our predicted evidence but used onlythe top three sentences for the entailment model.

2.3 Textual Entailment

The final stage of our pipeline is recognizing tex-tual entailment. Unlike Thorne et al. (2018), wedid not concatenate evidences, but trained ourmodel for each claim-evidence pair. For recog-nizing textual entailment we used the model in-troduced by Conneau et al. (2017) in their workon supervised learning of universal sentence rep-resentations.

Figure 1: The architecture for recognizing textual en-tailment (adapted from (Conneau et al., 2017))

The architecture is presented in Figure 1. Weuse bidirectional LSTMs (Hochreiter and Schmid-huber, 1997) with max-pooling to encode theclaim and the evidence. The text encoder providesdense feature representation of an input claim orevidence. Formally, for a sequence of T wordswt“1,..,T , the BiLSTM layer generates a sequenceof ht vectors, where ht is the concatenation of aforward and a backward LSTM output. The hid-den vectors ht are then converted into a single vec-tor using max-pooling, which chooses the max-imum value over each dimension of the hiddenunits. Overall, the text encoder can be treated as anoperator Text Ñ Rd that provides d dimensionalencoding for a given text.

Out of vocabulary issues in pre-trained wordembeddings are a major bottleneck for sentencerepresentations. To solve this we use fastTextembeddings (Bojanowski et al., 2017) which relyon subword information. Also, these embeddingswere trained on Wikipedia corpus making them anideal choice for this task.

As shown in Figure 1, the shared sentence en-coder outputs a representation for the claim u andthe evidence v. Once the sentence vectors are gen-erated, the following three methods are applied toextract relations between the claim and the evi-dence: (i) concatenation of the two representations(u, v); (ii) element-wise product u*v and (iii) ab-solute element-wise difference |u´v|. The result-ing vector, which captures information from boththe claim and the evidence, is fed into a 3-classclassifier consisting of fully connected layers cul-minating in a softmax layer.

For the final class label, we experimented firstby taking the majority prediction of the three

129

(claim, evidence) pairs as our entailment label butthis led to lower accuracy on the dev set. So ourfinal predictions are based on the rule outlined inthe Algorithm 1, where SUPPORTS = S, REFUTES

= R, NOT ENOUGH INFO = N and C is a countfunction. Because the selected evidences were in-herently noisy and our pipeline did not concate-nate evidences together we chose this rule overmajority prediction to mitigate the dominance ofprediction of NOT ENOUGH INFO class.

Algorithm 1 Prediction Ruleif CpSq “ 1 & CpNq “ 2 then

label “ Selse if CpRq “ 1 & CpNq “ 2 then

label “ Relse

label “ argmaxpCpSq, CpRq, CpNqq

We also experimented by training a classifierwhich takes confidence scores of all the threeclaim evidence pairs along with their position inthe document and trained a boosted tree classifierbut the accuracy did not improve. Empirically therule gave us the best results on the dev set and thusused it to obtain the final label.

Table 3 shows the 3 way classification accu-racy using the textual entailment model describedabove.

DataSet AccuracyShared Task Dev 58.77

Blind Test Set 57.45

Table 3: 3 way classification results

Our entailment accuracy on the shared task devand test set is 7 and 9 points better than the base-line respectively.

Implementation Details. The batch size is keptas 64. The model is trained for 15 epochs usingAdam optimizer with a learning rate of 0.001. Thesize of the LSTM hidden units is set to 512 and forthe classifier, we use a MLP with 1 hidden-layerof 512 hidden units. The embedding dimension ofthe words is set to 300.

3 End to End Results and Error Analysis

Table 4 shows the overall FEVER score obtainedby our pipeline on the dev and test sets. In theprovisional ranking our system ranked 6th.

On closer investigation we find that neither TF-IDF nor sentence embedding based approaches are

Data Pipeline FEVER

DEV (Thorne et al., 2018) 31.27Ours 50.83

TEST (Thorne et al., 2018) 27.45Ours 49.06

Table 4: FEVER scores on shared task dev and test set

perfect when it comes to sentence selection, al-though TF-IDF works better.

Fox 2000 Pictures released the filmSoul Food 0.29Soul Food is a 1997 American comedy-dramafilm produced by Kenneth”Babyface” Edmonds,TraceyEdmonds and Robert Teitel and released byFox 2000 Pictures

Table 5: Cosine similarity between claim and support-ing evidence

Table 5 goes on to prove that we cannot rely onmodels that entirely depend on semantics. In spiteof the two sentences being similar, the cosine sim-ilarity between them is poor mostly because theevidence contains a lot of extra information whichmight not be relevant to the claim and difficult forthe model to understand.

At seventeen or eighteen years of age, he joined Plato’sAcademy in Athens and remained there until the age ofthirty-seven (c. 347 BC)Shortly after Plato died , Aristotle left Athens and at therequest of Philip II of Macedon ,tutored Alexander theGreat beginning in 343 BC

Table 6: The top evidence is selected by Annotatorsand the bottom evidence by our pipeline

We also found instances where the predicted ev-idence is correct but it does not match the goldevidence. For the claim “Aristotle spent time inAthens”, both evidences given in Table 6 supportit, but still our system gets penalized on not beingable to match the gold evidence.

We found quite a few annotations to be incor-rect and hence the FEVER scores are lower thanexpected. Table 7 show two instances where thegold labels for the claims was NOT ENOUGH INFO,while in fact they should have been SUPPORTS andREFUTES, respectively.

Table 8 reflects the fact that NOT ENOUGH INFO

is often hard to predict and that is where our modelneeds to improve more.

The lines between SUPPORTS and NOT

ENOUGH INFO are often blurred as shown in

130

Claim: Natural Born Killers was directed by Oliver StoneEvidence: Natural Born Killers is a 1994 Americansatirical crime film directed by Oliver Stone and starringWoody Harrelson , Juliette Lewis , Robert Downey Jr. ,Tom Sizemore , and Tommy Lee Jones .Claim:Anne Rice was born in New JerseyEvidence: Born in New Orleans, Rice spent much of herearly life there before moving to Texas, and later to SanFrancisco

Table 7: Wrong gold label (NOT ENOUGH INFO)

S N RS 4635 1345 686N 2211 3269 1186R 1348 1470 3848

Table 8: Confusion matrix of entailment predictionson shared task dev set

Table 8. Our models need better understandingof semantics to be able to identify these. Table9 shows one such example where the gospelkeyword becomes the discriminative factor.

Claim: Happiness in Slavery is a gospel songby Nine Inch NailsEvidence: Happiness in Slavery,is a song by Americanindustrial rock band Nine Inch Nails from their debutextended play (EP), Broken(1992)

Table 9: Example where our model predicts SUPPORTSfor a claim labeled as NOT ENOUGH INFO

4 ConclusionThe FEVER shared task is challenging primarilybecause the annotation requires substantial man-ual effort. We presented an end-to-end pipeline toautomate the human effort and showed empiricallythat our model outperforms the baseline by a largemargin. We also provided a thorough error analy-sis which highlights some of the shortcomings ofour models and potentially of the gold annotations.

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin, and

Tomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. pages 1870–1879. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers).

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervised

learning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 670–680, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Matt Gardner, Joel Grus, Oyvind Tafjord Mark Neu-mann, Pradeep Dasigi, Nelson Liu, Matthew Peters,Michael Schmitz, and Luke Zettlemoyer. 2017. Al-lennlp: A deep semantic natural language process-ing platform.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. In Neural computation, 791,pages 1735–1780.

Ankur Parikh, Dipanjan Das Oscar Tackstr om, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. pages 2249–2255. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing.Association for Computational Linguistics, Austin,Texas.

Matthew E. Peters, Waleed Ammar, Chandra Bhaga-vatula, and Russell Power. 2017. Semi-supervisedsequence tagging with bidirectional language mod-els. In ACL.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extractionand verification. pages 809–819. Proceedings ofNAACL-HLT 2018.

Xuezhi Wang, Cong Yu, Simon Baumgartner, and FlipKorn. 2018. Relevant document discovery for fact-checking articles. pages 525–533. WWW ’18 Com-panion Proceedings of the The Web Conference2018.

131


DeFactoNLP: Fact Verification using Entity Recognition, TFIDF VectorComparison and Decomposable Attention

Aniketh Janardhan Reddy*

Machine Learning DepartmentCarnegie Mellon University

Pittsburgh, [email protected]

Gil RochaLIACC/DEI

Faculdade de EngenhariaUniversidade do Porto, [email protected]

Diego EstevesSDA Research

University of BonnBonn, Germany

[email protected]

Abstract

In this paper, we describe DeFactoNLP1,the system we designed for the FEVER 2018Shared Task. The aim of this task was toconceive a system that can not only auto-matically assess the veracity of a claim butalso retrieve evidence supporting this assess-ment from Wikipedia. In our approach, theWikipedia documents whose Term Frequency-Inverse Document Frequency (TFIDF) vectorsare most similar to the vector of the claimand those documents whose names are simi-lar to those of the named entities (NEs) men-tioned in the claim are identified as the docu-ments which might contain evidence. The sen-tences in these documents are then supplied toa textual entailment recognition module. Thismodule calculates the probability of each sen-tence supporting the claim, contradicting theclaim or not providing any relevant informa-tion to assess the veracity of the claim. Var-ious features computed using these probabili-ties are finally used by a Random Forest clas-sifier to determine the overall truthfulness ofthe claim. The sentences which support thisclassification are returned as evidence. Our ap-proach achieved2 a 0.4277 evidence F1-score,a 0.5136 label accuracy and a 0.3833 FEVERscore3.

*Work was completed while the author was a student atthe Birla Institute of Technology and Science, India and wasinterning at SDA Research.

1https://github.com/DeFacto/DeFactoNLP2The scores and ranks reported in this paper are provi-

sional and were determined prior to any human evaluation ofthose evidences that were retrieved by the proposed systemsbut were not identified in the previous rounds of annotation.The organizers of the task plan to update these results afteran additional round of annotation.

3FEVER score measures the fraction of claims for whichat least one complete set of evidences have been retrieved bythe fact verification system.

1 Introduction

Given the current trend of massive fake news prop-agation on social media, the world is desperatelyin need of automated fact checking systems. Auto-matically determining the authenticity of a fact isa challenging task that requires the collection andassimilation of a large amount of information. Toperform the task, a system is required to find rel-evant documents, detect and label evidences, andfinally output a score which represents the truth-fulness of the given claim. The numerous designchallenges associated with such systems are dis-cussed by Thorne and Vlachos (2018) and Esteveset al. (2018).

The Fact Extraction and Verification (FEVER)dataset (Thorne et al., 2018) is the first publiclyavailable large-scale dataset designed to facilitatethe training and testing of automated fact verifi-cation systems. The FEVER 2018 Shared Task re-quired us to design such systems using this dataset.The organizers had provided us a preprocessedversion of the June 2017 Wikipedia dump in whichthe pages only contained the introductory sectionsof the respective Wikipedia pages. Given a claim,we were asked to build systems which could deter-mine if there were sentences supporting the claim(labelled as ”SUPPORTS”) or sentences refuting it(labelled as ”REFUTES”). If conclusive evidenceeither supporting or refuting the claim could not befound in the dump, the system should report thesame (labelled ”NOT ENOUGH INFO”). How-ever, if conclusive evidence was found, it shouldalso retrieve the sentences which either support orrefute the claim.

2 System Architecture

Our approach has four main steps: Relevant Docu-ment Retrieval, Relevant Sentence Retrieval, Tex-tual Entailment Recognition and Final Scoring

132

Figure 1: The main steps of our approach

and Classification. Given a claim, Named EntityRecognition (NER) and TFIDF vector compari-sion are first used to retrieve the relevant docu-ments and sentences as delineated in Section 2.1.The relevant sentences are then supplied to the tex-tual entailment recognition module (Section 2.2)that returns a set of probabilities. Finally, a Ran-dom Forest classifier (Breiman, 2001) is employedto assign a label to the claim using certain featuresderived from the probabilities returned by the en-tailment model as detailed in Section 2.3. The pro-posed architecture is depicted in Figure 1.

2.1 Retrieval of Relevant Documents andSentences

We used two methods to identify which Wikipediadocuments may contain relevant evidences. Infor-mation about the NEs mentioned in a claim canbe helpful in determining the claim’s veracity. Inorder to get the Wikipedia documents which de-scribe them, the first method initially uses the Con-ditional Random Fields-based Stanford NER soft-ware (Finkel et al., 2005) to recognize the NEsmentioned in the claim. Then, for every NE whichis recognized, it finds the document whose namehas the least Levenshtein distance (Levenshtein,1966) to that of the NE. Hence, we obtain a set

of documents which contain information about theNEs mentioned in a claim. Since all of the sen-tences in such documents might aid the verifica-tion, they are all returned as possible evidences.

The second method used to retrieve candidateevidences is identical to that used in the base-line system (Thorne et al., 2018) and is based onthe rationale that sentences which contain termssimilar to those present in the claim are likelyto help the verification process. Directly evalu-ating all of the sentences in the dump is compu-tationally expensive. Hence, the system first re-trieves the five most similar documents based onthe cosine similarity between binned unigram andbigram TFIDF vectors of the documents and theclaim using the DrQA system (Chen et al., 2017).Of all the sentences present in these documents,the five most similar sentences based on the co-sine similarity between the binned bigram TFIDFvectors of the sentences and the claim are finallychosen as possible sources of evidence. The num-ber of documents and sentences chosen is based onthe analysis presented in the aforementioned workby Thorne et al. (2018).

The sets of sentences returned by the two meth-ods are combined and fed to the textual entailmentrecognition module described in Section 2.2.

2.2 Textual Entailment Recognition Module

Recognizing Textual Entailment (RTE) is the pro-cess of determining whether a text fragment (Hy-pothesisH) can be inferred from another fragment(Text T ) (Sammons et al., 2012). The RTE mod-ule receives the claim and the set of possible evi-dential sentences from the previous step. Let therebe n possible sources of evidence for verifying aclaim. For the ith possible evidence, let si denotethe probability of it entailing the claim, let ri de-note the probability of it contradicting the claim,and let ui be the probability of it being uninfor-mative. The RTE module calculates each of theseprobabilities.

The SNLI corpus (Bowman et al., 2015) is usedfor training the RTE model. This corpus is com-posed of sentence pairs 〈T,H〉, where T corre-sponds to the literal description of an image andH is a manually created sentence. If H can be in-ferred from T , the “Entailment” label is assignedto the pair. If H contradicts the information in T ,the pair is labelled as “Contradiction”. Otherwise,the label “Neutral” is assigned.

133

We chose to employ the state-of-the-art RTEmodel proposed by Peters et al. (2018) which isa re-implementation of the widely used decom-posable attention model developed by Parikh et al.(2016). The model achieves an accuracy of 86.4%on the SNLI test set. We selected it because atthe time of development of this work, it was oneof the best performing systems on the task withpublicly available code. Additionally, the usage ofpreprocessing parsing tools is not required and themodel is faster to train when compared to the otherapproaches we tried.

Although the model achieved good scores onthe SNLI dataset, we noticed that it does not gen-eralize well when employed to predict the rela-tionships between the candidate claim-evidencepairs present in the FEVER data. In order to im-prove the generalization capabilities of the RTEmodel, we decided to fine-tune it using a newlysynthesized FEVER SNLI-style dataset (Pratt andJennings, 1996). This was accomplished in twosteps: the RTE model was initially trained us-ing the SNLI dataset and then re-trained using theFEVER SNLI-style dataset.

The FEVER SNLI-style dataset was created us-ing the information present in the FEVER datasetwhile retaining the format of the SNLI dataset.Let us consider each learning instance in theFEVER dataset of the form 〈c, l, E〉, where c isthe claim, l ∈ SUPPORTS, REFUTES, NOTENOUGH INFO is the label and E is the set ofevidences. While constructing the FEVER SNLI-style dataset, we only considered the learning in-stances labeled as “SUPPORTS” or “REFUTES”because these were the instances that provided uswith evidences. Given such an instance, we pro-ceeded as follows: for each evidence e ∈ E, wecreated an SNLI-style example 〈c, e〉 labeled as“Entailment” if l = “SUPPORTS” or “Contradic-tion” if l = “REFUTES”. If e contained more thanone sentence, we made a simplifying assumptionand only considered the first sentence of e. Foreach “Entailment” or “Contradiction” which wasadded to this dataset, a “Neutral” learning instanceof the form 〈c, n〉 was also created. n is a ran-domly selected sentence present the same docu-ment from which e was retrieved. We also en-sured that n was not included in any of the otherevidences in E. Following this procedure, we ob-tain examples that are similar (retrieved from thesame document) but should be labeled differently.

Split Entail. Contradiction NeutralTraining 122,892 48,825 147,588

Dev 4,685 4,921 8,184Test 4,694 4,930 8,432

Table 1: FEVER SNLI-style Dataset split sizes for EN-TAILMENT, CONTRADICTION and NEUTRAL classes

Model Macro Entail. Contra. NeutralVanilla 0.45 0.54 0.44 0.37

Fine-tuned 0.70 0.70 0.64 0.77

Table 2: Macro and class-specific F1 scores achievedon the FEVER SNLI-style test set

Thus, we obtained a dataset with the characteris-tics depicted in Table 1. To correct the unbalancednature of the dataset, we performed random under-sampling (He and Garcia, 2009). The fine-tuninghad a huge positive impact on the generalizationcapabilities of the model as shown in Table 2. Us-ing the fine-tuned model, the aforementioned setof probabilities are finally computed.

2.3 Final ClassificationTwelve features were derived using the probabili-ties computed by the RTE module. We define thefollowing variables for notational convenience:

csi =

1 if si ≥ ri and si ≥ ui0 otherwise

cri =

1 if ri ≥ si and ri ≥ ui0 otherwise

cui =

1 if ui ≥ si and ui ≥ ri0 otherwise

The twelve features which were computed are:

f1 =∑n

i=1 csi

f2 =∑n

i=1 cri

f3 =∑n

i=1 cui

f4 =∑n

i=1(si × csi)

f5 =∑n

i=1(ri × cri)

f6 =∑n

i=1(ui × cui)

f7 = max(si) ∀i

f8 = max(ri) ∀i

f9 = max(ui) ∀i

f10 =

f4f1

if f1 6= 0

0 otherwise

f11 =

f5f2

if f2 6= 0

0 otherwise

f12 =

f6f3

if f3 6= 0

0 otherwise

Each of the possible evidential sentences sup-ports a certain label more than the other labels (this

134

can be determined by looking at the computedprobabilities). The variables csi, cri and cui areused to capture this fact. The most obvious wayto label a claim would be to assign the label withthe highest support to the claim. Hence, we choseto use the features f1, f2 and f3 which representthe number of possible evidential sentences whichsupport each label. The amount of support lent toa certain label by supporting sentences could alsobe useful in performing the labelling. This moti-vated us to use the features f4, f5 and f6 whichquantify the amount of support for each label. Ifa certain sentence can strongly support a label, itmight be prudent to assign that label to the claim.Hence, we use the features f7, f8 and f9 whichcapture how strongly a single sentence can supportthe claim. Finally, we used the features f10, f11and f12 because the average strength of the sup-port lent by supporting sentences to a given labelcould also help the classifier.

These features were used by a Random Forestclassifier (Breiman, 2001) to determine the label tobe assigned to the claim. The classifier was com-posed of 50 decision trees and the maximum depthof each tree was limited to 3. Information gain wasused to measure the quality of a split. 3000 claimslabelled as ”SUPPORTS”, 3000 claims labelled as”REFUTES” and 4000 claims labelled as ”NOTENOUGH INFO” were randomly sampled fromthe training set. Relevant sentences were then re-trieved as detailed in Section 2.1 and supplied tothe RTE module (Section 2.2). The probabilitiescalculated by this module were used to generatethe aforementioned features. The classifier wasthen trained using these features and the actual la-bels of the claims.

We used the trained classifier to label the claimsin the test set. If the ”SUPPORTS” label was as-signed to the claim, the five documents with thehighest (si × csi) products were returned as ev-idences. However, if csi = 0 ∀i, then the labelwas changed to ”NOT ENOUGH INFO” and anull set was returned as evidence. A similar pro-cess was employed when the ”REFUTES” labelwas assigned to a claim. If the ”NOT ENOUGHINFO” label was assigned, a null set was returnedas evidence.


Our system was evaluated using a blind test setwhich contained 19,998 claims. Table 3 compares

Metric DeFactoNLP Baseline BestLabel Accuracy 0.5136 0.4884 0.6821

Evidence F1 0.4277 0.1826 0.6485FEVER Score 0.3833 0.2745 0.6421

Table 3: System Performance

the performance of our system with that of thebaseline system. It also lists the best performancefor each metric. The evidence precision of our sys-tem was 0.5191 and its evidence recall was 0.3636.All of these results were obtained upon submit-ting our predictions to an online evaluator. DeFac-toNLP had the 5th best evidence F1 score, the 11th

best label accuracy and the 12th best FEVER scoreout of the 24 participating systems.

The results show that the evidence F1 score ofour system is much better than that of the base-line system. However, the label accuracy of oursystem is only marginally better than that of thebaseline, suggesting that our final classifier is notvery reliable. The low label accuracy may havenegatively affected the other scores. Our system’slow evidence recall can be attributed to the prim-itive methods employed to retrieve the candidatedocuments and sentences. Additionally, the RTEmodule can only detect entailment between twopairs of sentences. Hence, claims which requiremore than one sentence to verify them cannot beeasily labelled by our system. This is another rea-son behind our low evidence recall, FEVER scoreand label accuracy. We aim to study more sophis-ticated ways to combine the information obtainedfrom the RTE module in the near future.

To better assess the performance of the sys-tem, we performed a manual analysis of the pre-dictions made by the system. We observed thatfor some simple claims (ex.“Tilda Swinton is avegan”) which were labeled as “NOT ENOUGHINFO” in the gold-standard, the sentence retrievalmodule found many sentences related to the NEsin the claim but none of them had any useful infor-mation regarding the claim object (ex.“vegan”). Insome of these cases, the RTE module would la-bel certain sentences as either supporting or refut-ing the claim, even if they were not relevant to theclaim. In the future, we aim to address this short-coming by exploring triple extraction-based meth-ods to weed out certain sentences (Gerber et al.,2015).

We also noticed that the usage of coreference in

135

the Wikipedia articles was responsible for the sys-tem missing some evidences as the RTE modulecould not accurately assess the sentences whichused coreference. Employing a coreference res-olution system at the article level is a promisingdirection to address this problem.

The incorporation of named entity disambigua-tion into the sentence and document retrieval mod-ules could also boost performance. This is becausewe noticed that in some cases, the system used in-formation from unrelated Wikipedia pages whosenames were similar to those of the NEs mentionedin a claim to incorrectly label it (ex. a claim wasrelated to the movie “Soul Food” but some of theretrieved evidences were from the Wikipedia pagerelated to the soundtrack “Soul Food”).

4 Conclusion

In this work, we described our fact verificationsystem, DeFactoNLP, which was designed for theFEVER 2018 Shared Task. When supplied aclaim, it makes use of NER and TFIDF vectorcomparison to retrieve candidate Wikipedia sen-tences which might help in the verification pro-cess. An RTE module and a Random Forestclassifier are then used to determine the veracityof the claim based on the information present inthese sentences. The proposed system achieved a0.4277 evidence F1-score, a 0.5136 label accuracyand a 0.3833 FEVER score. After analyzing ourresults, we have identified many ways of improv-ing the system in the future. For instance, tripleextraction-based methods can be used to improvethe sentence retrieval component as well as to im-prove the identification of evidential sentences.We also wish to explore more sophisticated meth-ods to combine the information obtained from theRTE module and employ entity linking methods toperform named entity disambiguation.

Acknowledgments

This research was partially supported by an EUH2020 grant provided for the WDAqua project(GA no. 642795) and by the DAAD under the “In-ternational promovieren in Deutschland fr alle”(IPID4all) project.


and Christopher D. Manning. 2015. A large an-

notated corpus for learning natural language infer-ence. In EMNLP, pages 632–642. The Associationfor Computational Linguistics.

Leo Breiman. 2001. Random forests. Mach. Learn.,45(1):5–32.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computa-tional Linguistics (ACL).

Diego Esteves, Anisa Rula, Aniketh Janardhan Reddy,and Jens Lehmann. 2018. Toward veracity assess-ment in rdf knowledge bases: An exploratory anal-ysis. Journal of Data and Information Quality(JDIQ), 9(3):16.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd Annual Meet-ing on Association for Computational Linguistics,ACL ’05, pages 363–370, Stroudsburg, PA, USA.Association for Computational Linguistics.

Daniel Gerber, Diego Esteves, Jens Lehmann, LorenzBuhmann, Ricardo Usbeck, Axel-Cyrille NgongaNgomo, and Rene Speck. 2015. Defacto - tempo-ral and multilingual deep fact validation. Web Se-mantics: Science, Services and Agents on the WorldWide Web.

Haibo He and Edwardo A. Garcia. 2009. Learningfrom imbalanced data. IEEE Trans. on Knowl. andData Eng., 21(9):1263–1284.

V. I. Levenshtein. 1966. Binary Codes Capable of Cor-recting Deletions, Insertions and Reversals. SovietPhysics Doklady, 10:707.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2249–2255.Association for Computational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Conferenceof NAACL: Human Language Technologies, Volume1 (Long Papers), pages 2227–2237. Association forComputational Linguistics.

Lorien Pratt and Barbara Jennings. 1996. A survey oftransfer between connectionist networks. Connec-tion Science, 8(2):163–184.

Mark Sammons, V.G.Vinod Vydiswaran, and DanRoth. 2012. Recognizing textual entailment. InDaniel M. Bikel and Imed Zitouni, editors, Multilin-gual Natural Language Applications: From Theoryto Practice, pages 209–258. Prentice Hall.

136

James Thorne and Andreas Vlachos. 2018. Automatedfact checking: Task formulations, methods and fu-ture directions. CoRR, abs/1806.07687.


137


An End-to-End Multi-task Learning Model for Fact Checking

Sizhen Li and Shuai Zhao and Bo ChengState Key Laboratory of networking and switching technology,

Beijing university of posts and telecommunications

Hao Yang2012 Labs, Huawei Technologies CO., LTD

Abstract

With huge amount of information generatedevery day on the web, fact checking is an im-portant and challenging task which can helppeople identify the authenticity of most claimsas well as providing evidences selected fromknowledge source like Wikipedia. Here wedecompose this problem into two parts: an en-tity linking task (retrieving relative Wikipediapages) and recognizing textual entailment be-tween the claim and selected pages. In this pa-per, we present an end-to-end multi-task learn-ing with bi-direction attention (EMBA) modelto classify the claim as “supports”, “refutes”or “not enough info” with respect to the pagesretrieved and detect sentences as evidence atthe same time. We conduct experiments on theFEVER (Fact Extraction and VERification)paper test dataset and shared task test dataset, anew public dataset for verification against tex-tual sources. Experimental results show thatour method achieves comparable performancecompared with the baseline system.

1 Introduction

When we got news from newspapers and TVswhich was thoroughly investigated and written byprofessional journalists, most of these messagesare well-found and trustworthy. However, with thepopularity of the internet, there are 2.5 quintillionbytes of data created each day at our current pace1.Everyone online is a producer as well as a recip-ient of these emerging information, and some ofthem are incorrect, fabricated or even with someevil purposes. Most time it is difficult for us tofigure out the truth of those emerging news with-out professional background and enough investi-gation. Fact checking, which firstly has been pro-duced and received a lot of attention in the indus-

1https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#62a15bba60ba

try of journalism, mainly verifying the speeches ofpublic figures, is also important for other domains,e.g. wrong common-sense correction, rumor de-tection, content review etc.

With the increasing demand for automatic claimverification, several datasets for fact checkinghave been produced in recent years. Vlachosand Riedel (2014) are the first to release a pub-lic fake news detection and fact-checking datasetfrom two fact checking websites, the fact check-ing blog of Channel 42 and the True-O-Meter fromPolitiFact3. This dataset only includes 221 state-ments. Similarly, from PolitiFact via its API,Wang (2017) collected LIAR dataset with 12.8Kmanually labeled short statements, which per-mits machine learning based methods used on thisdataset. Both dataset don’t include the originaljustification and evidence as it was not machine-readable. However, just verifying the claim basedon the claim itself and without referring to any ev-idence sources is not reasonable and convincing.

In 2015, Silverman launched the EmergentProject4, a real-time rumor tracker, part of a re-search project with the Tow Center for Digi-tal Journalism5 at Columbia University. Ferreiraand Vlachos (2016) firstly proposed to use thedata from Emergent Project as Emergent datasetfor rumor debunking, which contains 300 ru-mored claims and 2,595 associated news articles.In 2017, the Fake news challenge (Pomerleauand Rao, 2017) consisted of 50K labeled claim-article pairs similarly derived from the EmergentProject. These two dataset stemmed from Emer-gent Project alleviate the fact checking task bydetecting the relationship between claim-articlepairs. However, in more common situation, we

2http://blogs.channel4.com/factcheck/3http://www.politifact.com/truth-o-meter/statements/4http://www.emergent.info/5https://towcenter.org/

138

are dealing with plenty of claims themselves on-line without associated articles which can help toverify the claims.

Fact Extraction and VERification (FEVER)dataset (Thorne et al., 2018) consists of 185,445claims manually verified against the introduc-tory sections of Wikipedia pages and classifiedas SUPPORTED, REFUTED or NOTENOUGH-INFO. For the first two classes, the dataset pro-vides combination of sentences forming the nec-essary evidences supporting or refuting the claim.Obviously, this dataset is more difficult than ex-isting fact-checking datasets. In order to achievehigher FEVER score, a fact-checking system is re-quired to classify the claim correctly as well asretrieving sentences among more than 5 millionWikipedia pages jointed as correct evidence sup-porting the judgement.

The baseline method of this task comprises ofthree components: document retrieval, sentence-level evidence selection and textual entailment.For the first two retrieval components, the base-line method uses document retrieval component ofDrQA (Chen et al., 2017) which only relies on theunigram and bigram TF-IDF with vector similarityand don’t understand semantics of the claim andpages. So, we find that it extracts lots of Wikipediapages which are unrelated to the entities describedin claims. Besides, similarity-based method pre-fer extracting supporting evidences than refutingevidences. For the recognizing textual entailment(RTE) module, on one hand, the previous retrievalresults limit the performance of the RTE model.On the other hand, the selected sentences con-catenated as evidences may also confuse the RTEmodel due to some contradictory information.

In this paper, we introduce an end-to-end multi-task learning with bi-direction attention (EMBA)model for FEVER task. We utilize the multi-taskframework to jointly extract evidences and verifythe claim because these two sub-tasks can be ac-complished at the same time. For example, afterselecting relative pages, we carefully scan thesepages to find supporting or refuting evidences. Ifwe find some, the claim can be labeled as SUP-PORTS or REFUTES immediately. If not, theclaim will be classified as NOTENOUGHINFOafter we read pages completely. Our model istrained on claim-pages pairs by using attentionmechanism in both directions, claim-to-pages andpages-to-claim, which provides complimentary in-

formation to each other. We obtain claim-awaresentence representation to predict the correct ev-idence position and the pages-aware claim rep-resentation to detect the relationship between theclaim and the pages.

2 Related Work

Natural Language Inference (NLI) or Recognizingtextual entailment (RTE) detects the relationshipbetween the premise-hypothesis pairs as “entail-ment”, “contradiction” and “not related”. Withthe renaissance of neural network (Krizhevskyet al., 2012; Mikolov et al., 2010; Graves, 2012)and attention mechanism (Xu et al., 2015; Luonget al., 2015; Bahdanau et al., 2014), the popularframework for the RTE is “matching-aggregation”(Parikh et al., 2016; Wang et al., 2017). Underthis framework, words of two sentences are firstlyaligned, and then the aligning results with orig-inal vectors are aggregated into a new represen-tation vector to make the final decision. The at-tention mechanism can empower this frameworkto capture more interactive features between twosentences. Compared to Fever task, RTE providesthe sentence to verify against instead of having toretrieve it from knowledge source.

Another relative task is question answer-ing (QA) and machine reading comprehension(MRC), for which approaches have recently beenextended to handle large-scale resources such asWikipedia (Chen et al., 2017). Similar to MRCtask which needs to identify the answer span ina passage, FEVER task requires to detect the ev-idence sentences in Wikipedia pages. However,MRC model tends to identify the answer spanbased on the similarity and reasoning betweenthe question and passage, while similarity-basedmethod is more likely to ignore refuting evidencein pages. For example, a claim stating “Manch-ester by the Sea is distributed globally” can be re-futed by retrieving “It began a limited release onNovember 18, 2016” as evidence.

3 Model

The FEVER dataset is derived from the Wikipediapages. So, we assume each claim contains at leastone entity in Wikipedia and the evidence can beretrieved from these relative pages. Thus, we de-compose FEVER task into two components: (1)entity linking which detects Wikipedia entities inclaim. We use the pages of identified entities

139

backwardlast time-

step

forwardlast time-

step

Dense+Softmax

Embedding layer

Context embedding layer

Attention Matchinglayer

Aggregation layer

Output layer

ℎ"# ℎ$# ℎ%#… … … …ℎ"& ℎ$' ℎ('

)ℎ"# )ℎ$# )ℎ%# )ℎ"' )ℎ$' )ℎ%'

*"# *$

# *%# *"' *$

' *%'

+" … …+$

…

Dense+Softmax

Sentence

Claim Classification

…

…

… …

,$-

Figure 1: An End-to-End Multi-task Learning Model for Fact Checking

as relative pages. And (2) an end-to-end multi-task learning with bi-direction attention (EMBA)model (in Figure 1) which classify the claim as”supports”, ”refutes” or ”not enough info” with re-spect to the pages retrieved and select sentences asevidence at the same time.

3.1 Entity Liking

S-MART is a Wikipedia entity linking tool forshort and noisy text. For each claim, we useS-MART to retrieve the top 5 entities fromWikipedia. These entity pages are jointed togetheras the source pages then passed to select correctsentences. For a given claim, S-MART first re-trieves all possible entities of Wikipedia by surfacematching, and then ranks them using a statisticalmodel, which is trained on the frequency countswith which the surface form occurs with the en-tity.

3.2 Sentence Extraction and ClaimVerification

We now proceed to identify the correct sentencesas evidence from relative pages and try to classifythe claim as ”supports”, ”refutes” or ”not enoughinfo” with respect to the pages retrieved at thesame time. Inspired by the recent success of at-tention mechanism in NLI (Wang et al., 2017) andMRC (Seo et al., 2016; Tan et al., 2017), we pro-pose an end-to-end multi-task learning with bi-

direction attention (EMBA) model, which exploitsboth pages-to-claim attention to verify the claimand claim-to-pages attention to predict the evi-dence sentence position respectively. Our modelconsists of:

Embedding layer: This layer represents eachword in a fixed-size vector with two components:a word embedding and a character-level embed-ding. For word embedding, pre-trained word vec-tors, Glove (Pennington et al., 2014), provides thefixed-size embedding of each word. For characterembedding, following Kim (Kim, 2014), charac-ters of each words are embedded into fixed-sizeembedding, then fed into a Convolutional NeuralNetwork (CNN). The character and word embed-ding vectors are concatenated together and passedto a Highway Network (Srivastava et al., 2015).The output of this layer are two sequences of wordvectors of claim and pages.

Context embedding layer: The purpose of thislayer is to incorporate contextual information intothe presentation of each word of claim and pas-sage. We utilize a bi-directional LSTM (BiLSTM)on the top of the embedding provided by the pre-vious layers to encode contextual embedding foreach word.

Attention matching layer: In this layer, wecompute attention in two directions: from pagesto claim as well as from claim to pages. To ob-tain these attention mechanisms, we first calculate

140

a shared similarity matrix between the contextualembedding of each word of the claim hc

i and eachword of the pages hp

j :

αij = w[hci ;h

pj ;h

ci hp

j ] (1)

where αij represents the attention weights on thei-th claim word by j-th pages word, w is a train-able weight vector, is elementwise multiplica-tion, [;] is vector concatenation across row, andimplicit multiplication is matrix multiplication.

Claim-to-pages attention Claim-to-pages atten-tion represents which claim words are most rele-vant to each word of pages. To obtain attendedpages vector, we take αij as the weight of hp

j

and weighted sum all the contextual embedding ofpages:

hj =N∑

i=1

αijhi/N∑

i=1

αij (2)

Finally, we match each contextual embeddingwith its corresponding attention vector to obtainthe claim-aware representation of each word ofpages:

mj = W[hj ; hj ;hj hj ] (3)

Pages-to-claim attention Pages-to-claim atten-tion represents which pages words are most rele-vant to each claim word. Similar to claim-to-pagesattention, the attended claim vector and the pages-aware representation of each pages word are cal-culated by:

hi =N∑

j=1

αijhj/N∑

j=1

αij (4)

mi = W[hi; hi;hi hi] (5)

Aggregation layer: The input to the aggrega-tion layer is two sequences of matching vectors,the claim-aware pages word representation andpages-aware claim word representation. The goalof the modeling layer is to capture the interactionamong the pages words conditioned on the claimas well as the claim words conditioned on the pas-sage words. This is different from the contextualembedding layer, which captures the interactionamong context information independent of match-ing information.

Sentence selection layer: The FEVER task re-quires the model to retrieve sentences of the pas-sage as evidence to verify the claim. The sentencerepresentation st is obtained by concatenating vec-tors from the last time-step of the previous layer

BiLSTM models output sequences. We calculatethe probability distribution of the evidence posi-tion over the whole pages by:

pt = softmax(wst) (6)

For this sub-task, the objective function is to min-imize the negative log probabilities of true evi-dence index:

Ls = −N∑

t=1

[yt log pt + (1− yt) log(1− pt)] (7)

where yt ∈ 0, 1 denotes a label, yt = 1 means thet-th sentence is a correct evidence, other yt = 0.

Claim verification layer: The input of thislayer is pages-aware claim representation pro-duced from the matching layer and the output is a3-way classification, predicting whether the claimis SUPPORTED, REFUTED or NOTENOUGH-INFO by the pages. We utilize multiple convolu-tion layers, with the output of 3 for classification.We optimize the objective function:

Lc = −k∑

i=1

yi log gi (8)

Where k is the number of claims. yt ∈ 0, 1, 2denotes a label, meaning the i-th claim is SUP-PORTED, REFUTED, and NOTENOUGHINFOby the pages respectively.

Training: The model is trained by minimizingjoint objective function:

L = Ls + α ∗ Lc (9)

where α is the hyper-parameter for weights of twoloss functions.

4 Experiments

In this section, we evaluate our model on FEVERpaper test dataset and shared task test dataset.

4.1 Model Details

The model architecture used for this task is de-picted in Figure 1. The nonlinearity function f =tanh is employed. We use 100 1D filters for CNNchar embedding, each with a width of 5. The hid-den state size (d) of the model is 100. We use theAdam (Kingma and Ba, 2014) optimizer, with aminibatch size of 32 and an initial learning rateof 0.001. A dropout rate of 0.2 is used for the

141

EMBA Baseline(paper test) (paper dev)

Evidence Score 30.34 32.57Label Accuracy 45.06 -Evidence Precision 46.12 -Evidence Recall 42.84 -Evidence F1 44.42 -

Table 1: our EMBA model results on the FEVERpaper test dataset, Baseline method results on thepaper dev dataset.

Evidence F1 Label Accuracy FEVER Score39.73 45.38 29.22

Table 2: Model results on the FEVER shared tasktest dataset.

CNN, all LSTM layers, and the linear transforma-tion. The parameters are initialized by the tech-niques described in (Glorot, 2010). The max valueused for max-norm regularization is 5. TheLc lossweight is set to α = 0.5.

4.2 Experimental Results

We use the official evaluation script6 to computethe evidence F1 score, label accuracy and FEVERscore. As shown in Table 1, our method achievescomparable performance on FEVER paper testdataset comparing with the baseline method onFEVER paper dev dataset. The result shows thatjointly verifying a claim and retrieving evidencesat same time can be as good as pipelined model.Our method results on the FEVER paper sharedtask test dataset is showed in Table 2. Besides,We calculate and present the confusion matrix ofclaim classification results on the FEVER papertest dataset in Table 3. Our model isn’t goodat identifying the unrelated relationship betweenclaim and pages retrieved. Our model sentence se-lection performance is recorded in Table 4. Wecan see that our model doesn’t perform well forretrieving evidence. Though with low evidenceprecision, our model average accuracy without re-quirement to provide correct evidence (51.97%) issimilar to 52.09% accuracy of baseline method,which means that claim verification module andthe sentence extraction module are relatively inde-pendent in our model.

6https://github.com/sheffieldnlp/fever-baselines

NEI REFUTES SUPPORTS recallNEI 1285 1432 390 41%REFUTES 992 1937 155 62.8%SUPPORTS 942 860 1284 41.6%precision 39.9% 45.8% 70.2%

Table 3: confusion matrix for claim classification.(NEI = “not enough info”)

Evidence Evidence Evidenceprecision recall F1

SUPPORTS 22.07% 40.08% 28.47%REFUTES 23.8% 45.18% 31.18%

Table 4: sentences retrieval performance.

4.3 Error Analysis

We investigate the predicted results on the papertest dataset and show several error causes as fol-lowings.

Document retrieval We use entity linking toolto retrieve relative Wikipedia pages. Some entitymentions in claims are linked incorrectly, hencewe cannot obtain the desired pages containing thecorrect evidence sentences. The S-MART tool re-turned correct entities for 70% claims of paper testdataset. A better entity retrieval method should beresearched for the FEVER task.

Pages length After document retrieval, the rel-ative pages are concatenated and passed throughEMBA model. However, in order to train and pre-dict effectively, the length of the pages is limitedto 800 tokens. So, if there are many relative pagesand the position of the evidence sentence is nearthe end of the page, these correct sentences wouldbe cut off.

Evidence composition Some claims requirecomposition of evidences from multiple pages.Furthermore, the selection of second page relieson the correct retrieval of the first page and sen-tence. For example, claim “Deepika Padukonehas been in at least one Indian films” can be sup-ported by combination of “She starred roles in YehJawaani Hai Deewani” and “Yeh Jawaani Hai Dee-wani is an Indian film” from “Deepika Padukone”and “Yeh Jawaani Hai Deewani” Wikipedia pagesrespectively. The second page couldn’t be foundcorrectly if we don’t select the first sentence ex-actly. 18% claims in train dataset belong to thissituation.

142

5 Conclusion

We propose a novel end-to-end multi-task learn-ing with bi-direction attention (EMBA) model todetect sentences as evidence and classify the claimas “supports”, “refutes” or “not enough info” withrespect to the pages retrieved at the same time.EMBA uses attention mechanism in both direc-tions to capture interactive features between claimand pages retrieved. Model obtains claim-awaresentence representation to predict the correct ev-idence position and the pages-aware claim rep-resentation to detect the relationship between theclaim and the pages. Experimental results on theFEVER paper test dataset show that our approachachieve comparable performance comparing withthe baseline method. There are several promis-ing directions that worth researching in the future.For instance, in sentence selection layer, the modeljust predicts whether a sentence is an evidence.Further, we can try to instantly predict whethera sentence is “supporting”, “refuting” or “not re-lated with” the claim. What’s more, the hyper-parameter α for joint loss function is fixed. A goodvalue for this parameter can achieve one plus oneis greater than two. We can try to learn this param-eter value during training the model.

6 acknowledgement

This work is supported by National Natural Sci-ence Foundation of China (Grant No. 61501048),Beijing Natural Science Foundation (Grant No.4182042), Fundamental Research Funds for theCentral Universities (No. 2017RC12), andNational Natural Science Foundation of China(U1536111, U1536112).

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. Computer Science.

Danqi Chen, Adam Fisch, Jason Weston, and An-toine Bordes. 2017. Reading wikipedia to an-swer open-domain questions. arXiv preprintarXiv:1704.00051.

William Ferreira and Andreas Vlachos. 2016. Emer-gent: a novel data-set for stance classification. InProceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: Human language technologies,pages 1163–1168.

Yoshua Bengio Glorot, Xavier. 2010. Understandingthe difficulty of training deep feedforward neuralnetworks. In international conference on artificialintelligence and statistics, pages 249–256.

Alex Graves. 2012. Long Short-Term Memory.Springer Berlin Heidelberg.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. Computer Sci-ence.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In International Con-ference on Neural Information Processing Systems,pages 1097–1105.

Minh Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. Computer Sci-ence.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH 2010, Conference of the InternationalSpeech Communication Association, Makuhari,Chiba, Japan, September, pages 1045–1048.



Dean Pomerleau and Delip Rao. 2017. Fake NewsChallenge http://fakenewschallenge.org.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionflow for machine comprehension. arXiv preprintarXiv:1611.01603.

Rupesh Kumar Srivastava, Klaus Greff, and JurgenSchmidhuber. 2015. Highway networks. arXivpreprint arXiv:1505.00387.

Chuanqi Tan, Furu Wei, Nan Yang, Bowen Du,Weifeng Lv, and Ming Zhou. 2017. S-net: Fromanswer extraction to answer generation for ma-chine reading comprehension. arXiv preprintarXiv:1706.04815.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extraction andverification. arXiv preprint arXiv:1803.05355.

143


William Yang Wang. 2017. ”liar, liar pants on fire”:A new benchmark dataset for fake news detection.pages 422–426.

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhutdinov, RichardZemel, and Yoshua Bengio. 2015. Show, attend andtell: Neural image caption generation with visual at-tention. Computer Science, pages 2048–2057.

144


Team GESIS Cologne: An all in all sentence-based approach for FEVER

Wolfgang OttoGESIS – Leibniz-Institute for the Social Sciences in Cologne

Unter Sachsenhausen 6-850667 Cologne

[email protected]

Abstract

In this system description of our pipeline toparticipate at the Fever Shared Task, we de-scribe our sentence-based approach. Through-out all steps of our pipeline, we regarded sin-gle sentences as our processing unit. In our IR-Component, we searched in the set of all possi-ble Wikipedia introduction sentences withoutlimiting sentences to a fixed number of rel-evant documents. In the entailment module,we judged every sentence separately and com-bined the result of the classifier for the top 5sentences with the help of an ensemble classi-fier to make a judgment whether the truth of astatement can be derived from the given claim.

1 Introduction

Our approach is strongly related to the baselineapproach. It is using a sentence retrieval methodwithout more in-depth analysis of semantic prop-erties of Wikipedia sentences as a first component.For the IR task no external resources beside thegiven Wikipedia sentences have been taken intoaccount. The second component is the entailmentmodel which is the same Decomposable Atten-tion Model as the one used to generate the bestbaseline results in (Thorne et al., 2018). But inboth components, there are some differences. Inthe IR component, there is no document retrieveras a first step. Given a claim, we search directlyon all Wikipedia sentences of the reference cor-pus for possible candidate sentences. In the entail-ment component, the difference lies not at all inthe model, but in the data used during training andinference time. We trained the model sentence-wise and not claim-wise. I.e. we split the re-sult set for each claim into combinations of claimand each sentence separately. To be able to han-dle more than one sentence evidence we introducenew classes. One class to identify evidence sen-tences which are part of supporting evidence withmultiple sentences, and a second one to identify

evidence sentences which are part of refuting evi-dence with more than one sentence.

2 Sentence Retriever

The basic idea of our evidence retrieval engineis the intuition that a preselection of specificWikipedia articles for a given claim will excludesentences, which are highly related to the claim-based on a word or even entity overlap, but willbe excluded because the Wikipedia article has an-other topic.

Our approach tries to find sentence candi-dates from all sentences of the given shared taskWikipedia introduction sentences corpus. To keepthe system simple and rely on a well-tested envi-ronment we indexed all sentences with a SOLRsearch engine1 with the default configuration. Ouridea to find relevant candidate sentences, whichsupport or refute the given claim, is to identifythose which are connected to the same entities ornoun chunks. So we extracted those informationfrom the claim and create a SOLR query to get aranked sentence list from the search engine.

2.1 Preprocessing

Wikipedia Article Introductions: The problemof only working with single sentences is, thatsentences of a Wikipedia article introductionloose the connection to the article title in manycases. To create good retrieval results we need apreprocessing step for coreference resolution tomatch sentences like “He is the biggest planet inour solar system.” from the article about Jupiterto the claim “Jupiter is larger than any otherplanet in the solar system.” We decided on themost straightforward solution and concatenateda cleaned version of the title to the Wikipediasentences before indexing. For cleaning the titlewe cut of all parts beginning with a round bracket.

1http://lucene.apache.org/solr/.

145

Also underscores will be replaced with spaces.So “Beethoven -LRB-TV series-RRB-”2 will betransformed to “Beethoven” for example.

Query claim: For the generation of a queryfor a given claim we extracted all noun chunksand named entities with the natural languageprocessing tool SpaCy3 (Honnibal and Johnson,2015) with the provided en core web sm languagemodel4. Then we filter all resulting individualwords and phrases. Given this set of all wordsand multi-word units, we create a SOLR-querywhich is giving an advantage to adjacent words ofthe multi-word units which occur with a maximumword distance of two. Additionally, we query eachword of each item in the set separately with ashould query. The named entity Serena Williamsfor example is searched with a query where “Ser-ena”, “Williams” and “Serena Williams”˜2 shouldall be matched. The swung dash in the last partof this query indicates that search results, where“Serena” and “Williams” occur with a maximumdistance of two, will be pushed. The distanceof two is chosen because it helps in cases like“Arnold Alois Schwarzenegger” to push the matchof the search query “Arnold Schwarzenegger”˜2.Here a more complete example:

Claim:Serena Williams likes to eat out in asmall restaurant in Las Vegas.Named Entities:Serena Williams, Las Vegas.Noun Chunks:Serena Williams, a small restaurant, LasVegasUnigram searchterms:Serena, Williams, a, small, restaurant,Las, VegasPushed bigram searchterms:Serena Williams, a small, small restau-rant, Las Vegas

The result of the sentence retriever is a list ofsentences and their corresponding Wikipedia titleswhich matches best in concern of named entities

2Where “-LRB-” stands for “Left Round Bracket” and“-RRB-” for “Right Round Bracket” as it can be found inthe already tokenized Wikipedia resources which were madeavailable from the organizers of the competition.

3https://spacy.io/.4https://spacy.io/models/en#en_core_

web_sm.

and noun chunks based on the described extractionand querying.

3 Recognizing Textual Entailment

3.1 PreprocessingDuring preprocessing of train and test data, weconsider three steps. For the first step of tokeniz-ing claim and Wikipedia sentence, we treat both ofthem differently. The Wikipedia sentences are al-ready tokenized. The claims are tokenized withthe standard SpaCy’s rule-based tokenizer. Fortextual entailment, the same problem of corefer-ence resolution described for the IR componentpops up again. Because of this, we decided to addthe title information to the Wikipedia sentences asadditional information as well. Adding this infor-mation can help the entailment model identify theentity explained in the sentence. A working exam-ple:

Claim:Stars vs. the Forces of Evil is a series.Wikipedia title:Star vs. the Forces of EvilWikipedia Sentence (Sentence No. 6):On February 12 , 2015 , Disney renewedthe series for a second season prior to itspremiere on Disney XD .

Of course, it is a heuristic. There are sentenceswhere the added information doubles the info ofthe entity. But then again there are sentenceswhere the content does not match the entity de-scribed in the title. In practice, we join the to-kens of the title to the sentence while excludingthe additional information for disambiguation. I.e. for ”Hulk (Film)“ we only add “Hulk” to thecorresponding sentence string. For vectorizationof the sequence token, we use GloVe word em-beddings with a dimension of 300 produced withthe method from (Pennington et al., 2014). Tomaximize the overlap to the words used in ourWikipedia-based dataset we used the ones trainedon Wikipedia 2014 + Gigaword 5 by the StanfordNLP group.5

3.2 Prediction ClassesThe data set provides the special case where onesingle sentence is not enough to support or refutea claim. In this case, multiple sentence support

5https://nlp.stanford.edu/projects/glove/.

146

is delivered. In 9.0% of the validation set claimswhere supporting or refuting evidence exists, aminimum of two Wikipedia sentences is needed.In 14.7% of the supporting/refuting claims, thereis at least one possible multiple sentence evidence.Around 25% of the supporting or refuting sen-tences are part of multiple sentence evidence inthe validation set. Multiple sentence evidenceposes a problem for our approach of sentence-by-sentence entailment assessment. On the onehand, the class of a given claim and one sen-tence of multiple sentence support/refute cannotbe classified as supporting or refuting. On theother hand, more information is delivered thanin a regular NOT ENOUGH INFO claim sentencepair. We decided to deal with this by using notthree, but five classes. For sentence-wise pre-diction we have extended the given classes SUP-PORTS, REFUTES and NOT ENOUGH INFOwith the two new classes PART OF SUPPORTSand PART OF REFUTES.

3.3 Generating NOT ENOUGH INFOsentences

In Thorne et al. (2018) the authors introduced twoways of selecting sentences for claims which areannotated as NOT ENOUGH INFO. They com-pared classifiers trained on randomly chosen sen-tences for this class with classifiers trained on datawhere the top 5 results of the sentence retriever areused as text input for them. The results show thaton the textual entailment validation set both clas-sifiers trained on random sentences show better re-sults than the ones trained on top 5 results. But forthe whole pipeline, the resulting accuracy dropsaround 1% for the Decomposable Attention Model(41.6% vs. 40.6% pipeline accuracy in Thorneet al. (2018)). As we used the same model, wedecided to use the approach of selecting the topretrieved sentences for the NOT ENOUGH INFOannotated claims. To keep the number of sen-tences per class not too unequal, we chose to usethe top 3 results of our sentence retriever for thetest and validation set of the Decomposable Atten-tion Model. It should be noted, however, that oursentence retriever is working in a slightly differentway than the one used in the baseline approach.

For the occurrence of each label in the sentence-wise validation set for the entailment predictiontask see table 1.

label frequenceNOT ENOUGH INFO 19348SUPPORTS 7012REFUTES 7652PART OF SUPPORTS 2741PART OF REFUTES 2452

Table 1: Frequences of label in sentence-wise valida-tion set.

3.4 Decomposable Attention Model

For the task of recognizing textual entailmentwe take a Decomposable Attention Model as de-scribed in (Parikh et al., 2016) and is one of theclassifiers used in the baseline approach (Thorneet al., 2018). We selected the vanilla version ofthis network without self-attention on input se-quences. This model compares each word vectorrepresentation of the input sequences with the rep-resentation of phrases of the other input sequence.The process of selecting words from the other se-quence for comparison is called attention. Afterthis, the representations for this comparisons areaggregated and in a final step used to predict, ifone sequence supports or refutes the other or hasnot enough information for a decision.

The model is formulated with the aim oflearning during training time which words are tobe compared, how to compare them and in whichentailment relation both sequences are to eachother.

Basic Parameters: For training we used thegiven training and evaluation set for the sharedtask prepared and preprocessed as described abovein a sentence-wise manner. We used batches ofsize 32 with equal number of words during train-ing and a dropout rate of 0.2 for all three feedforward layers F, G and H of the neural networkmodel. F, G and H are used here analogous to theterminology in (Parikh et al., 2016). We trainedthe model for 300 epochs on all batches of thetraining set and choose the best performing modelmeasured on the validation set for the prediction ofthe test set. Tokens without word embedding rep-resentation in the GloVe Wikipedia 2014 + Giga-word 5 (out of vocabulary words) are treated withthe same approach as in (Parikh et al., 2016). Thewords are hashed to one of 100 random embed-dings.

147

3.5 Ensemble LearnerThe result of the entailment classifier is the judg-ment for each pair of claim and sentence of aWikipedia introduction if the claim can be en-tailed from the sentence based on the five intro-duced classes. For taking part in the FEVERShared Task, it is needed to decide on each claimone of the labels SUPPORTS, REFUTES or NOTENOUGH INFO. The second part is to find theright sentence which underpins the judgment. Theresult of the sentence-based entailment classifieris a list of judgments which might be contradic-tory. As a result, we need a classifier which ag-gregates the results for the sentences to one finalclaim judgment. For this, we combine the entail-ment judgments from the classifier by using a ran-dom forest classifier (Breiman, 1999). As input,we take the probability of the judgments for all fiveclasses of the top 5 results of the sentence retriever.In this way, the number of features can be summedup to 25. We kept the order of the sentences basedon the sentence retriever results. We trained thisaggregating classifier to predict one of the threeclasses awaited from the FEVER scorer for theevaluation of the shared task.6 Together with thetop 5 sentences of the sentence retriever, this re-sult represents the output of the whole pipeline. Togenerate pipeline results for the validation set wetrained the classifier on half of the validation setand predict the other half. For the FEVER SharedTask test set prediction we trained the classifier onall samples from the validation set.

4 Evaluation Results and Discussion

4.1 Sentence RetrieverTo be able to measure the results returned fromthe retrieval component we take the FEVER scorerinto account, too. To keep a different view on theoutcomes, we measured the recall values for dif-ferent allowed sizes of result sets. For additionalanalysis, we measured for each result set size sep-arated recall values for only refuting and only sup-porting values. Figure 1 shows that recall valuesfor the refuting sentences are lower than those forthe supporting ones. This seems to reflect the in-tuition that in refuting sentences the word over-lap to the claims are smaller than in supportingsentences. The fact that even with an allowed re-sult set size of 100 the recall with our approach is

6https://github.com/sheffieldnlp/fever-scorer.

Figure 1: The recall values for sentence retrieval basedon different allowed result set sizes. The black lineshows the baseline recall for an allowed result set sizeof five8

Figure 2: Heatmap of the precision of each class.

77.3% shows that a simple method which is de-pendent on word overlap between claim and sen-tences gets to its limits. In comparison to thebaseline where 44.22% of the claims in the val-idation set were fully supported after the docu-ment and the sentence selection components, inour approach, 53.6% of them have full support,even though we do not use a document retrievalcomponent at all. This would lead to an oracle ac-curacy of 69.1% (Baseline: 62.8%).

4.2 Entailment Classifier and full pipeline

The entailment classifier has an accuracy of 64.7%for the sentence-by-sentence prediction of the fiveclasses. This is not comparable with the results ofthe baseline because of the sentence-wise compar-ison and the five label classification scheme. A

148

look at the class-wise precision of the classifierand the number of sentences per class draws atten-tion to the fact that this value is strongly dependenton the number of sentences which are generatedfor the NOT ENOUGH INFO label. It is becausefor this class the model achieved the best precisionvalues and the label is over-represented in the val-idation set.

As expected the classifier has problems to dif-ferentiate between refuting sentences and the ones,which are part of a multiple sentence refute. Thesame applies to the supporting sentences as youcan be seen in Figure 2. For the all in all pipelineevaluation, we get a FEVER score of 46.7% onhalf of the validation set. The other half was usedto train the aggregating ensemble learner. On theshared task test set, we achieved a FEVER scoreof 40.77 (8th place).

The next steps to evolve our system should be tofocus on recall of sentence retrieving for refutingsentence and split up the strategies for both typesof sentences.

ReferencesLeo Breiman. 1999. Random forests - random features.

Technical Report 567.

Matthew Honnibal and Mark Johnson. 2015. An im-proved non-monotonic transition system for depen-dency parsing. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1373–1378.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. Proceedingsof the 2016 Conference on Empirical Methods inNatural Language Processing, pages 2249–2255.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.


149


Team SWEEPer: Joint Sentence Extraction and Fact Checking withPointer Networks

Christopher Hidey∗

Department of Computer ScienceColumbia UniversityNew York, NY 10027

[email protected]

Mona DiabAmazon AI Lab

[email protected]

AbstractMany tasks such as question answering andreading comprehension rely on informationextracted from unreliable sources. Thesesystems would thus benefit from knowingwhether a statement from an unreliable sourceis correct. We present experiments on theFEVER (Fact Extraction and VERification)task, a shared task that involves selecting sen-tences from Wikipedia and predicting whethera claim is supported by those sentences, re-futed, or there is not enough information. Factchecking is a task that benefits from not onlyasserting or disputing the veracity of a claimbut also finding evidence for that position. Asthese tasks are dependent on each other, anideal model would consider the veracity of theclaim when finding evidence and also find onlythe evidence that is relevant. We thus jointlymodel sentence extraction and verification onthe FEVER shared task. Among all partici-pants, we ranked 5th on the blind test set (priorto any additional human evaluation of the evi-dence).

1 Introduction

Verifying claims using textual sources is a diffi-cult problem, requiring natural language inferenceas well as information retrieval if the sources arenot provided. The FEVER task (Thorne et al.,2018) provides a large annotated resource for ex-traction of sentences from Wikipedia and verifica-tion of the extracted evidence against a claim. Asystem that extracts and verifies statements in thisframework must consist of three components: 1)Retrieving Wikipedia articles. 2) Identifying thesentences from Wikipedia that support or refutethe claim. 3) Predicting supporting, refuting, ornot enough info.

We combine these components into two stages:1) identifying relevant documents (Wikipedia ar-

∗Work completed while at Amazon AI Lab

ticles) for a claim and 2) jointly extracting sen-tences from the top-ranked articles and predict-ing a relation for whether the claim is supported,refuted, or if there is not enough informationin Wikipedia. We first identify relevant docu-ments by ranking Wikipedia articles according toa model using lexical and syntactic features. Then,we derive contextual sentence representations forthe claim paired with each evidence sentence inthe extracted documents. We use the ESIM mod-ule (Chen et al., 2017b) to create embeddings foreach claim/evidence pair and use a pointer net-work (Vinyals et al., 2015) to recurrently extractonly relevant evidence sentences while predictingthe relation using the entire set of evidence sen-tences. Finally, given these components, whichare pre-trained using multi-task learning (Le et al.,2016), we tune the parameters of the entailmentcomponent given the extracted sentences.

Our experiments reveal that jointly trainingthe model provides a significant boost in perfor-mance, suggesting that the contextual representa-tions learn information about which sentences aremost important for relation prediction as well asinformation about the type of relationship the evi-dence has to the claim.

2 Document Retrieval

For the baseline system, the document retrievalcomponent from DrQA (Chen et al., 2017a) fork = 5 documents only finds the supporting evi-dence 55% of the time. This drops to 44% usingthe same model for sentence retrieval at l = 5sentences. In comparison, in the original workof Chen et al. (2017a), they find a recall of 70-86% for all tasks with k = 5. This is partly dueto the misleading information present in the falseclaims, whereas for question answering, the ques-tion is not designed adversarially to contain con-

150

tradicting information. Examining the supportingand refuting claims in isolation, we find that doc-ument retrieval at k = 5 is 59% and 49% respec-tively and sentence selection at l = 5 is 49% and39%. For example, the false claim ”Murda Beatzwas born on February 21, 1994.” (his birth date isFebruary 11) also retrieves documents for peopleborn on February 21. In the question answeringscenario, an example question might be ”Whichrap artist was born on February 11, 1994 in FortErie, Ontario?” which allows an IR system to re-turn documents using the disambiguating n-gramsabout his location and place of birth.

This motivates the decision to focus on nounphrases. Many of the claims contain the correcttopic of the Wikipedia article in the subject posi-tion of both the claim and either the first or sec-ond sentence. When the topic is not in the firstsentence, it is often because the title is ambiguousand the first sentence is a redirect. For example,for the article “Savages (2012 film)” the first twosentences are “For the 2007 film, see The Savages.Savages is a 2012 American crime thriller film di-rected by Oliver Stone.”

We thus parse Wikipedia and all the claims us-ing CoreNLP (Manning et al., 2014) and train aclassifier with the following lexical and syntacticfeatures from the claim and the first two sentencesin the Wikipedia article:

• TF/IDF from DrQA for full article

• Overlap between the subject/object/modifierin the claim with the subject/object/modifierin the first and second sentence of Wikipedia.The topic of the article is often the subject ofthe sentence in Wikipedia, but it is occasion-ally a disambiguating modifier such as “alsoknown as”. The topic of the claim is oftenin the subject position as well. We also addoverlap between named entities (in case theparsing fails) and upper case words (in casethe parsing and NER fails). For each of sub-ject/object/modifier/upper/entity, for both up-per and lower case, we consider the coverageof the claim from the first two Wikipedia sen-tences, for 25 ∗ 2 ∗ 2 = 100 features.

• Average/max/minimum of GloVe (Penning-ton et al., 2014) embeddings in the claim andthe first and second sentence of Wikipedia.This feature captures the type of sentence. Ifit is a person, it may have words like ’born’,

Model Recall at k = 5Dev Test

Baseline (DrQA) 55.3 54.2MLP without title features 82.7 79.7MLP with title features 90.7 90.5

Figure 1: Document Retrieval Results

’known as’, etc. and if it is a disambiguationsentence it should have words like ’refers’ or’see’ or ’confused’. It also allows for the han-dling of cases where the sentence tokeniza-tion splits the first sentence.

• Features based on the title: whether the titleis completely contained in the claim and theoverlap between the claim and the title (withand without metadata information in paren-theses), for both upper and lower case.

2.1 ExperimentsWe train a Multi-Layer Perceptron (MLP) usingPytorch (Paszke et al., 2017) on the top 1000 doc-uments extracted using DrQA, which obtains a re-call of 95.3%, with early stopping on the paperdevelopment set. We used the Adam optimizer(Kingma and Ba, 2015) with a learning rate of0.001 and hyper-parameters of β1 = 0.9, β2 =0.999, and ε = 10−8. We used pre-trained 300-dimensional GloVe embeddings. We clip gradi-ents at 2 and use a batch size of 32.

2.2 ResultsDocument recall at k = 5 articles is presented inFigure 1. We present results on the paper devel-opment and test sets as the evidence for the sharedtask blind test set was unavailable as of this writ-ing. The model with lexical and syntactic featuresobtains around 25 points absolute improvementover the DrQA baseline and the title features whenadded provide an additional 8 points improvementon the development set and 10.8 points on test.

3 Joint Sentence Extraction and RelationPrediction

Recurrent neural networks have been shown to beeffective for extractive summarization (Nallapatiet al., 2017; Chen and Bansal, 2018). However,in this context we want to extract sentences thatalso help predict whether the claim is supported orrefuted so we jointly model extraction and relationprediction and train in a multi-task setting.

151

Figure 2: The contextual claim and evidence sentencerepresentations obtained with ESIM.

First, given the top k extracted documents fromour retrieval component, we select the top l sen-tences using a weighted combination of Jaccardsimilarity and cosine similarity of average wordembeddings. Hyper-parameters were tuned so thatthe model would have minimal difference in re-call from the document retrieval stage but still fitin memory on a Tesla V100 with 16GB of mem-ory. We selected k = 5 and l = 50, with Jaccardsimilarity weighted at 0.3 and GloVe embeddingsweighted at 0.7. We found that recall on the devel-opment set was 90.3.

We then store contextual representations of theclaim and each evidence sentence in a memorynetwork and use a pointer network to extract thetop 5 sentences sequentially. Simultaneously, weuse the entire memory of up to l sentences to pre-dict the relation: supports, refutes, or not enoughinfo.

3.1 Sentence RepresentationFor each sentence in the evidence, we create con-textual representations using the ESIM module(Chen et al., 2017b), which first encodes the claimand the evidence sentence using an LSTM, attendsto the claim and evidence, composes them usingan additional LSTM layer, and finally applies maxand average pooling over time (sentence length)to obtain a paired sentence representation for theclaim and evidence. This representation is de-picted in Figure 2. For more details, please referto (Chen et al., 2017b).

The representation for a claim c and extractedevidence sentence ep for c is then:

mp = ESIM(c, ep) (1)

3.2 Sentence ExtractionNext, to select the top 5 evidence sentences, weuse a pointer network (Vinyals et al., 2015; Chenand Bansal, 2018) over the evidence for claim c to

Figure 3: Our multi-task learning architecture withcontextual ESIM representations used for evidencesentence extraction via a pointer network and relationprediction with max/average pooling and dense layers.

extract sentences recurrently. The extraction prob-ability1 for sentence ep at time t < 5 is then:

utp =

vTe tanh(W [mp;h

t,q]), ifpt 6= ps∀s < t.

− inf, otherwise.(2)

P (pt|p0 · · · pt−1) = softmax(ut) (3)

with ht,q computed using the output of q hopsover the evidence (Vinyals et al., 2016; Sukhbaataret al., 2015):

αt,o = softmax(vTh tanh(Wg1mp +Wg2ht,o−1))

(4)

ht,o =∑

j

αt,oWg1mj (5)

At each timestep t, we update the hidden state zt ofthe pointer network LSTM. Initially, ht,0 is set tozt. We train and validate the pointer network usingthe extracted top l sentences. For all training ex-amples, we randomly replace evidence sentenceswith gold evidence if no gold evidence is found.

3.3 Relation PredictionIn order to predict the support, refute, or notenough info relation, we use a single-layer MLPto obtain an abstract representation of the sentencerepresentation used for extraction, then apply av-erage and max pooling over the contextual rep-resentations of the claim and evidence sentencesto obtain a single representation m for the entirememory. Finally, we use a 2-layer MLP to predictthis relation given m. The entire joint architectureis presented in Figure 3.

1Set to −inf only while testing

152

3.4 Optimization

We train the model to minimize the negative loglikelihood of the extracted evidence sequence2 andthe cross-entropy loss (L(θrel)) of the relation pre-diction. The pointer network is trained as in asequence-to-sequence model:

L(θptr) = −1/T∑

t=0..T

logPθptr(pt|p0:t−1) (6)

and the overall loss is then:

L(θ) = λL(θptr) + L(θrel) (7)

Since the evidence selection and relation pre-diction are scored independently for the FEVERtask, we select the parameters using early stoppingsuch that one set of parameters performs the bestin terms of evidence recall on the validation setand another performs the best for accuracy.

Although the models are trained jointly to selectevidence sentences that help predict the relation tothe claim, we may obtain additional improvementby tuning the parameters given the output of thesentence extraction step. In order to do so, we firstselect the top 5 sentences from the sentence extrac-tor and predict the relation using only those sen-tences rather than the entire memory as before. Inthis scenario, we pre-train the model using multi-task learning and tune the parameters for relationprediction while keeping the sentence extractionparameters fixed (and using separate representa-tions for ESIM). We also experimented with areinforcement learning approach to tune the sen-tence extractor as in (Chen and Bansal, 2018) butfound no additional improvement.

3.5 Experiments

We use Pytorch for our experiments. For themulti-task learning and tuning, we use the Ada-grad optimizer with learning rates of 0.01 and0.001 and gradients clipped to 5 and 2, respec-tively. For both experiments, we used a batch sizeof 16. We used the paper development set for earlystopping. We initialized the word embeddingswith 300-dimensional GloVe vectors and fixedthem during training, using a 200-dimensionalprojection. Out-of-vocabulary words were initial-ized randomly during both training and evaluation.

2The evidence as given has no meaningful order but weuse the support/refute sequence as provided in the dataset asit may contain annotator bias in terms of importance.

Dev TestLA ER F LA ER F

Base 52.1 44.2 32.6 50.9 45.9 31.9Gold 68.5 96.0 66.2 65.4 95.2 62.8MLP 60.4 76.6 51.1 58.7 74.0 49.1Sep. 56.8 74.9 45.3 53.8 72.7 42.3MTL 64.0 79.6 55.3 60.5 77.7 52.1Tune 64.5 79.6 55.8 61.9 77.7 53.2

Figure 4: Paper Development and Test Results (LA:Label Accuracy, ER: Evidence Recall, F: FEVERScore)

The second dimension of all other parameter ma-trices was 200. For the pointer network, we used abeam size of 5 and q = 3 hops. λ was set to 1.

4 Results

In Figure 4, we present the sentence extraction, re-lation prediction, and overall FEVER score for thepaper development and test sets. We compare tothe baseline (Base) from the work of Thorne et al.(2018). For comparison, we also provide the re-sults when using gold document retrieval with aperfect oracle to illustrate the upper bound for ourmodel (Gold). First, we illustrate the differencein performance when we train a feedforward net-work to score the sentences individually (MLP)instead of recurrently with a pointer network. Inthis setting, the sentence extraction in Figure 3 isreplaced by a 2-layer feedforward network that isindividually applied to every sentence in the mem-ory. The output of the network is a score which isthen used to rank the top l = 5 sentences. Themodel is still trained using multi-task learning butwith a binary cross-entropy loss in Equation 7 in-stead of θptr. Furthermore, we show the resultsof the sentence extraction and relation predictioncomponents when trained separately (Sep.). Wefinally present the best results - when trained withmulti-task learning (MTL) and then tuned (Tune).

Our results demonstrate that jointly training apointer network and relation prediction classifierimproves over training separately. We also notethat the pointer network, which extracts sentencesrecurrently by considering the previous sentence,improves over selecting sentences independentlyusing an MLP. Although we obtained improve-ment by tuning, the improvement is slight, whichsuggests that the parameter space discovered bymulti-task learning is already learning most of the

153

LA EF1 FPaper 62.2 31.6 53.5Blind 59.7 29.7 49.9Best 68.2 52.9 64.2

Figure 5: Blind Test Results (LA: Label Accuracy,EF1: Evidence F1, F: FEVER Score)

examples where the model can both identify thecorrect sentences and label. Finally, we noticethat improvements to document retrieval wouldalso improve our model. The gap between Goldand Tune is around 20 points for evidence recall.When using gold Wikipedia articles (1-2 docu-ments), the number of sentences available (around20 on average) is less than those selected for ourmodel (l = 50), which makes evidence retrievaleasier as the memory is smaller. As our modelsfor document retrieval are fairly simple, it is likelythat a more complex model could obtain betterperformance with fewer documents.

Results on the shared task blind test set (prior toany additional human evaluation of the evidence)are presented in Figure 5. For comparison, weshow the results on the paper development andtest sets (Paper) when submitted for the sharedtask as well as the results of the top system on theleaderboard (Best). On the blind test set (Blind),overall performance drops by 2-3 points for everymetric compared to the paper set. As our analy-sis in Section 4.1 shows, the performance dropssignificantly when 2 or more evidence sentencesare required. Thus, the performance decrease onthe blind test set may be caused by an increase inthe number of examples that require additional ev-idence, although as of this writing the evidence forthe test set has not yet been released.

4.1 Analysis

We present an analysis of the performance of thebest model (Tune) when the model requires mul-tiple pieces of evidence. We present results in Fig-ure 6 for no evidence (for the “not enough info”case) and 1, 2, and 3 or more sentences for labelaccuracy, evidence recall, and document recall atk = 5. When no evidence is required the modelonly obtains 50% accuracy. We found that this in-creases to 54% accuracy for the MTL model butthe accuracy on the other 2 classes decreases. Thissuggests that using a larger memory improves per-formance when there is not enough information

Num. LA ER DR0 50 N/A N/A1 74 86 932 65 17 263+ 73 3 14

Figure 6: Paper Development Results by Number ofEvidence Sentences Required (Num.: Sentences Re-quired, LA: Label Accuracy, ER: Evidence Recall,DR: Document Recall)

to make a prediction. Intuitively, reading all rel-evant Wikipedia articles in their entirety would benecessary for a human to determine this as well.When only 1 sentence is required the model per-forms well. However, the performance drops sig-nificantly on recall when 2 or more sentences arerequired. This is largely due to the performance ofthe document retrieval component, which seemsto perform poorly when the evidence retrieval re-quires 2 different Wikipedia articles. When ex-actly 2 sentences are required, the pointer networkretrieves around 60% of the evidence if the re-trieved documents are correct. When 3 or moresentences are required, both components performpoorly, suggesting there is significant room for im-provement in this case (although there are very fewexamples requiring this amount of evidence).

5 Conclusion

We presented the results of our system for theFEVER shared task. We described our documentretrieval system using lexical and syntactic fea-tures. We also described our joint sentence ex-traction and relation prediction system with multi-task learning. The results of our model suggestthat the largest gains in performance are likely tocome from improvements to detection of the “notenough info” class and retrieval of Wikipedia arti-cles (especially when more than one is required).

For future work, we plan to improve docu-ment retrieval and experiment with different sen-tence representations. Using title features im-proved document retrieval but for non-Wikipediadata titles would not be available. Furthermore, inother datasets, the titles may not exactly match thetext of a claim very often and named entity dis-ambiguation is sometimes needed. One avenue toexplore is neural topic modeling trained using ar-ticle titles (Bhatia et al., 2016).

154

ReferencesShraey Bhatia, Jey Han Lau, and Timothy Baldwin.

2016. Automatic labelling of topics with neuralembeddings. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 953–963. TheCOLING 2016 Organizing Committee.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017a. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017b. Enhanced lstm fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1657–1668. Association for Computational Linguis-tics.

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 675–686. Associa-tion for Computational Linguistics.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Represen- tations (ICLR).

Quoc Le, Ilya Sutskever, Oriol Vinyals, and LukaszKaise. 2016. Multi-task sequence to sequencelearning. In International Conference on LearningRepresen- tations (ICLR).


Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of doc-uments. In Proceedings of the Thirty-First AAAIConference on Artificial Intelligence, February 4-9,2017, San Francisco, California, USA., pages 3075–3081.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Advances inNeural Information Processing Systems 28, pages2440–2448. Curran Associates, Inc.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extractionand verification. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 809–819. Association for ComputationalLinguistics.

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur.2016. Order matters: Sequence to sequence for sets.In International Conference on Learning Represen-tations (ICLR).

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information ProcessingSystems 28, pages 2692–2700. Curran Associates,Inc.

155


QED: A Fact Verification System for the FEVER Shared Task

Jackson LukenDepartment of Computer Science

The Ohio State [email protected]

Nanjiang Jiang, Marie-Catherine de MarneffeDepartment of LinguisticsThe Ohio State University

jiang.1879,[email protected]

Abstract

This paper describes our system submissionto the 2018 Fact Extraction and VERifica-tion (FEVER) shared task. The system uses aheuristics-based approach for evidence extrac-tion and a modified version of the inferencemodel by Parikh et al. (2016) for classifica-tion. Our process is broken down into threemodules: potentially relevant documents aregathered based on key phrases in the claim,then any possible evidence sentences insidethose documents are extracted, and finally ourclassifier discards any evidence deemed irrel-evant and uses the remaining to classify theclaim’s veracity. Our system beats the sharedtask baseline by 12% and is successful at find-ing correct evidence (evidence retrieval F1 of62.5% on the development set).

1 Introduction

The FEVER shared task (Thorne et al., 2018) setsout with the goal of creating a system which cantake a factual claim and either verify or refute itbased on a database of Wikipedia articles. Thesystem is evaluated on the correct labeling of theclaims as “Supports,” “Refutes,” or “Not EnoughInfo” (NEI) as well as on valid evidence to supportthe label (except in the case of “NEI”). Each claimcan have multiple evidence sets, but only one setneeds to be found so long as the correct label is ap-plied. Figure 1 gives an example of a claim alongwith the evidence sets that support it, as well as aclaim and the evidence that refutes it.

We split the task into three distinct modules,with each module building on the data of the pre-vious one. The first module is a document finderfinding key terms in the claim which correspondto the titles of the Wikipedia articles, and return-ing those articles. The second module takes eachdocument found and finds all sentences which areclose enough to the claim to be considered evi-

“Supports” Claim: Ann Richards was profes-sionally involved in politics.Evidence set 1: Dorothy Ann Willis Richards(September 1, 1933 September 13, 2006) was anAmerican politician and 45th Governor of Texas.Evidence set 2: A Democrat, she first came tonational attention as the Texas State Treasurer,when she delivered the keynote address at the1988 Democratic National Convention.

“Refutes” Claim: Andrew Kevin Walker is onlyChinese.Evidence set: Andrew Kevin Walker (born Au-gust 14, 1964) is an American BAFTA-nominatedscreenwriter.

Figure 1: Claim/evidence examples from the FEVERdata.

dence. Finally, all sentences retrieved for a givenclaim are classified using an inference system assupporting or refuting the claim, or as “NEI”. Inthe following sections, we detail each module,providing results on the FEVER development setwhich consists of 19,998 claims (6,666 in eachclass). Our system focuses on finding evidencesets composed of only one sentence. Of the 13,332verifiable (“Supports” or “Refutes”) claims in thedevelopment set, only 9% cannot be satisfied withan evidence set consisting of only one sentence.The code for our system is available at https://github.com/jluken/FEVER.

2 Document Finder

To verify or refute a claim, we start by findingWikipedia articles that correspond to the claim.Key phrases within the claim are extracted andchecked against Wikipedia article titles. If the keyphrase matches an article title, the correspondingdocument is returned as potentially containing rel-

156

evant evidence to assess the claim’s veracity.

2.1 Wiki Database PreprocessingWe created three maps of the Wikipedia article ti-tles to deal with unpredictable capitalization andpages with a supplemental descriptor in the titlevia parenthesis (e.g., “Tool (band)” for the mu-sic group vs. the physical item.) The first map issimply a case-sensitive map of the document textmapped to its title. The second is titles mappedto lowercase. The third is a list of every title witha parenthesis description mapped to its root titlewithout parenthesis. These are used as “backup”documents to be searched if no evidence is foundin documents returned with the two other maps.

2.2 Key Phrase IdentificationThe key phrases aim at capturing the “topic” ofthe claim. We used capitalization, named en-tity, phrasal and part-of-speech tags, and depen-dency from the CoreNLP system (Manning et al.,2014) to identify key phrases. Subject, direct ob-ject, and their modifier/complement dependenciesare marked as “topics”. Noun phrases contain-ing those topic words are considered key phrases.Consecutive capitalized terms, allowing for lower-case words not capitalized in titles such as prepo-sitions and determiners, are also considered keyphrases. For instance, the key phrases for theclaims in Figure 1 are: Ann Richards, politics; An-drew Kevin Walker.

Once all possible key phrases in a claim arefound, each key phrase is checked against themaps of Wikipedia titles: if there is a full matchbetween a key phrase and a title, the correspond-ing article is returned. If the article found is a dis-ambiguation page, each article listed on the pageis returned. If the disambiguation page is empty,the results from the parenthesis map are returned.

2.3 Results and AnalysisOn the development set, when only consider-ing documents found using the case-(in)sensitivemaps, we achieve 19.1% precision and 84.8% re-call where at least one of the correct documentsare found. However when the backup documentsare also taken into consideration, recall raises to94.2% while precision drops to 7.5%. The drop inprecision is largely due to disambiguation pages,for which every document listed on the page getsreturned. At this stage, we focus on recall, extract-ing as many relevant documents as possible (7.64

on average per claim), which will be filtered out inlater stages.

Most of the 5.8% claims for which the systemdoes not find any correct document involve nounphrases which CoreNLP fails to recognize (such asthe song title In the End) as well as number mis-match between the claim and the Wikipedia articletitle (e.g., the system does not retrieve the page“calcaneal spur” for the claim “Calcaneal spursare unable to be detected in any situation.”) Work-ing on lemma could alleviate the latter issue.

3 Sentence Finder

Once all potential documents are collected by theDocument Finder, each sentence within each doc-ument is compared against the claim to see if it issimilar enough to be considered relevant evidence.

3.1 Claim Processing

The claim is processed to find information tocheck each document sentence against. We usethe root of the claim and a list of all nouns andnamed entities in the claim. However, nouns andnamed entities included in the document’s title arediscarded from the list. This is done under the as-sumption that every sentence in a document per-tains to the topic of that document (e.g., the secondevidence in the “Supports” claim of Figure 1 fromthe document “Ann Richards” refers to the subjectwithout explicitly stating so.)

3.2 Extracting Evidence from Documents

A sentence is deemed potential evidence if it con-tains the root of the claim when the root is a verbother than forms of be and have.

We also retrieve sentences whose words suffi-ciently match the claim’s list of nouns and namedentities:

- If two or more are missing, the sentence isdiscarded.

- If all items in the noun and named entity listcan be found in the document sentence, thesentence is added as evidence.

- If there is only one missing noun item, thesentence is added if there are at least twoother matching items in the sentence, both theclaim and document sentence are of the form“X is Y”, or the document sentence containsa synonym of the noun, according to the MITJava WordNet interface (Finlayson, 2014).

157

- If there is only one missing named entity,it can be swapped out with a named en-tity of the same label type. This allows tocapture evidence for refuting a claim, suchas mismatch in nationality (e.g., swappingout “American” for “Chinese” in the AndrewWalker example in Figure 1.) However, ifa claim is centered around an action, deter-mined by its root being a verb, an entity canbe swapped only if the document sentencecontains that same verb (or a synonym of theverb).

When the claim contains reference to either abirth or a death, the document sentence needs onlyto have a date encompassed within a set of paren-thesis to be considered a valid piece of evidence.

3.3 Results and Analysis

Given a hypothetical perfect Document Finder(PDF), the Sentence Finder achieves a 51.9% pre-cision and 50.3% recall on the development set.When using our existing Document Finder, preci-sion drops to 24.0% and recall to 46.6%. However,we found that a number of sentences we retrieveare not part of the gold standard when they are infact valid evidence for the claim. One such ex-ample is the sentence “As Somalia’s capital city,many important national institutions are based inMogadishu” to support the claim “There is a capi-tal called Mogadishu.” It is unclear how many ex-amples of this there are.

If we evaluate the Sentence Finder on retriev-ing at least one accurate evidence for the verifi-able claims, it achieves an accuracy of 66.2% withPDF, and 61.2% with ours. The Sentence Finderperforms better on “Support” claims (70.73% withPDF) than on “Refutes” claims (61.58%).

On average using PDF, 1.14 sentences are re-turned for every claim for which evidence is found(51.9% of these being in the gold standard.) For24.5% of the verifiable claims, the system fails toreturn any evidence (20.4% of “Supports” claims,28.5% of “Refutes” claims.) Two of the mostcommon causes of failure to retrieve evidence aremis-classification of named entity labels or part-of-speech tags by the CoreNLP pipeline, as wellas an unseen correlation of key phrases betweenthe claim and evidence based on context. For in-stance, our system fails to retrieve any of the evi-dence for the claim in Figure 1, missing the con-textual connection between politics and Governor

or Democrat.

4 Inference

Once evidence sentences are retrieved, we usedone of the state-of-the-art inference systems toassess whether the sentences verify the claim ornot. We chose the decomposable attention modelof (Parikh et al., 2016) because it is one ofthe highest-scored systems on the Stanford Natu-ral Language Inference (SNLI) corpus (Bowmanet al., 2015) that has a lightweight architecturewith relatively few parameters.

4.1 Preprocessing

Most of the evidence sentences are often in themiddle of a paragraph in the document, and theentity the document is about is referred to with apronoun or a definite description. For instance,The Southwest Golf Classic, in its Wikipedia ar-ticle, is referred to with the pronoun it or the nounphrase the event. We thus made the simplifying as-sumption that each pronoun is used to refer to theentity the page is about, and perform a determin-istic coreference resolution by replacing the firstpronoun in the sentence with the name of the page.

We ran a named entity recognizer trained onOntoNotes (Hovy et al., 2006) on claim and evi-dence sentences to extracted all the named entitiesand their types. The named entities concatenatedwith the sentence are fed into the word embeddinglayers whereas the named entity types are fed intothe entity embedding layers, as described below.

4.2 Embedding

We used GloVe word embeddings (Penningtonet al., 2014) with 300 dimensions pre-trained us-ing CommonCrawl to get a vector representationof the evidence sentence. We also experimentedtraining GloVe word embeddings using the pro-vided Wikipedia data, but found that they did notperform as well as the pre-trained vectors. Theword embeddings were normalized to 200 dimen-sions as described in (Parikh et al., 2016). For en-tity types, we trained an entity type embedding of200 dimensions. The word embeddings and entityembeddings are concatenated together and used asthe input to the network.

4.3 Training

All pairs of evidence/claim from the FEVER train-ing data are fed into the network for training.

158

Development set Test setBaseline Our system Baseline Our system

All All Supports Refutes NEI All All

FEVER score 31.3 43.9 54.9 24.7 52.0 27.5 43.4Label Accuracy 51.4 44.7 68.6 31.3 52.0 48.8 50.1Evidence Precision N/A 77.5 77.0 78.1 – N/A N/AEvidence Recall N/A 52.3 56.3 47.8 – N/A N/AEvidence F1 17.2 62.5 65.0 59.3 – 18.3 58.5

Table 1: Scores on the FEVER development and test sets. Baseline is the system from (Thorne et al., 2018). Theresults are prior to human evaluation of the evidence.

Since the “NEI” class does not have evidence as-sociated with it, we used the evidence found byour Sentence Finder for training the “NEI” class.If our Sentence Finder did not return any evidencefor a “NEI” claim, we randomly sampled five sen-tences from the sentences in the Wiki database anduse them as evidence.1

The network is trained using the Adam opti-mizer with a learning rate 0.002 with a batchsize of 140 and dropout ratio of 0.2. The net-work weights are repeatedly saved and we usedthe model performing best on the FEVER develop-ment set.

4.4 Assigning Class Labels

The network outputs a probability distribution forwhether the evidence/claim pair has label “Sup-ports”, “Refutes”, or “NEI”. For a given claim,we examine the labels assigned for all evidencesentences returned for that claim. First, we dis-card the evidence labeled as “NEI”. If there are noevidence left, we mark the claim as “NEI”. Oth-erwise, we add together the remaining predictiondistribution and use the highest scored label as la-bel for the claim. We return the five highest-scoredevidences, including those marked “NEI”.

4.5 Results and Analysis

The resulting scores on the development and testsets are in Table 1 (prior to human evaluation ofthe evidence retrieved by the system.) The FEVER

score is the percentage of claims such that a com-plete set of evidence is found and is classified withthe correct label. Precision, Recall, and F1 are themetrics for evaluating evidence retrieval (evidence

1We could try the NEARESTP sampling method describedin (Thorne et al., 2018), which achieves better performancewith a decomposable attention model for inference than ran-dom sampling.

aaaaaaaaGold

Labeled asSupports Refutes NEI

Supports 68.59 2.87 28.55Refutes 31.13 31.26 37.61NEI 42.29 5.75 51.97

Table 2: Contingency matrix (percentage) in the devel-opment set.

retrieval is not evaluated for the “NEI” class.)Table 2 shows the percentage of claims being

labeled as each class in the development set. Wesee that both “Refutes” and “NEI” are often mis-labeled as “Supports”, whereas the “Supports” areoften mislabeled as “NEI”.

Upon closer look at the classification errors, wesee that some fine-grained lexical semantics andworld knowledge are required to predict the cor-rect label, which the model was not able to cap-ture. For example, the claim “Gin is a drink”is supported by the sentence “Gin is a spiritwhich derives its predominant flavour from ju-niper berries (Juniperus communis)”, but our sys-tem classified the pair as “Refutes”.

The network also seemed to pick up on somelexical features present in the annotations. Theclaim “The Wallace mentions people that neverexisted” has gold label “NEI”, but is labeled as“Refutes” with high probability using three differ-ent evidence sentences we retrieved, even thoughsome of the sentences are not relevant at all. Thisis probably because the word “never” is highly in-dicative of the “Refutes” class, as we shall see inthe next section.

5 Discussion

Our system beats the shared task baseline on evi-dence retrieval F1 (62.5% vs. 17.2%) and FEVER

159

score (43.9% vs. 31.3%) for the development set.On the test set, prior to human evaluation of theevidence, our system ranked 7th out of 23 teamswith a FEVER score of 43.4%. For evidence re-trieval F1, we ranked 2nd with a score of 58.5%.

Gururangan et al. (2018) pointed out that natu-ral language inference datasets often contain an-notation artifacts. They found that many lexi-cal/syntactic features are highly predictive of en-tailment classes in most natural language infer-ence datasets. We performed the same analysis onthe FEVER training set to see whether a similar pat-tern holds. We calculated the probability distribu-tion of the length of the claims by tokens for eachclass. Contrary to Gururangan et al’s results, allclasses have similar mean and standard deviationsentence length. We also calculated the pointwisemutual information (PMI) between each word andclass. We found that negation words such as not,never, neither, and nor, have higher PMI value forthe “Refutes” class than for the other classes. Thisis similar to Gururangan et al.’s observation thatnegation words are strong indicators of contradic-tion in the SNLI dataset. The “Refutes” claims inthe FEVER training data indeed show a high per-centage of negation words2 (13.9% vs. 0.1% for“Supports” and 1.3% for “NEI”).

Another source of bias comes from the wayevidence annotation in the gold standard hasbeen created with humans manually verifying theclaims in Wikipedia. As pointed out in Sec-tion 3.3, evidence automatically retrieved can becorrect even though not present in the gold stan-dard. The way a human fact-checks might be dif-ferent from what a computer can achieve. It wouldbe interesting to analyze the evidence correctly re-trieved by the systems participating in the sharedtask but not present in the gold standard, to seewhether some patterns emerge.


and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages632–642.

Mark Alan Finlayson. 2014. Java libraries for access-ing the princeton wordNet: Comparison and evalua-

2We used the neg dependency tag as the criterion for anegation word.

tion. In Proceedings of the 7th International GlobalWordNet Conference, pages 78–85. H. Orav, C. Fell-baum, & P. Vossen (Eds.).

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts in naturallanguage inference data. In Proceedings of NAACL-HLT 2018, pages 107–112.

Eduard Hovy, Mitchell Marcus, Martha Palmer, LanceRamshaw, and Ralph Weischedel. 2006. OntoNotes:The 90% solution. In Proceedings of the HumanLanguage Technology Conference of the NAACL,pages 57–60.


Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2249–2255.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: A large-scale dataset for Fact Extractionand VERification. In Proceedings of NAACL-HLT2018, pages 809–819.

160


Team UMBC-FEVER : Claim verification using Semantic LexicalResources

Ankur Padia, Francis Ferraro and Tim FininComputer Science and Electrical Engineering

University of Maryland, Baltimore CountyBaltimore, MD 20150 USA

pankur1,ferraro,[email protected]

Abstract

We describe our system used in the 2018FEVER shared task. The system employed aframe-based information retrieval approach toselect Wikipedia sentences providing evidenceand used a two-layer multilayer perceptron toclassify a claim as correct or not. Our sub-mission achieved a score of 0.3966 on the Evi-dence F1 metric with accuracy of 44.79%, andFEVER score of 0.2628 F1 points.

1 Introduction

We describe our system and its use in the FEVERshared task (Thorne et al., 2018). We focused ontwo parts of the problem: (i) information retrievaland (ii) classification. For the first we opted fora linguistically-inspired approach: we automat-ically annotated claim sentences and Wikipediapage sentences with syntactic features and seman-tic frames from FrameNet (Baker et al., 1998a)and used the result to retrieve sentences relevantto the claims that provide evidence of their verac-ity. For classification, we used a simple two-layerperceptron and experimented with several config-urations to determine the optimal settings.

Though the overall classification of our bestversion was lower than the best approach fromThorne et al. (2018), which used a more sophis-ticated classification approach, we scored 10th outof 24 for the information retrieval task (measuredby F1). The improvement in our system workedwell on the IR task, obtaining a relative improve-ment of 131% on retrieving evidence over thebaseline F1 measure Thorne et al. (2018).

2 Approach

The FEVER task requires systems to assess a sen-tence making one of more factual claims (e.g.,“Rocky Mountain High is an Australian song”) astrue or false by finding sentences in Wikipedia that

Figure 1: Our system used a semantic frame ap-proach to support both retrieval of claim evidenceand classification of claims.

provide evidence to support or refute the claim(s).This naturally leads to two sub-tasks, an informa-tion retrieval task that returns a set of Wikipediasentences that are relevant to the assessment and aclassification task that analyzes the evidence andlabels the claim as Supported, Refuted or NotE-noughInfo. Figure 1 shows the overall flow ofour system, which uses semantic frames to ana-lyze and match a claim sentence to potential ev-idence sentences and a multilayer perceptron forclaim classification.

2.1 Finding Relevant Evidence Sentences

Our approach used semantic frames fromFrameNet (Baker et al., 1998b) as part of theanalysis in matching a claim with sentencesthat might provide evidence for its veracity. Aframe is a semantic schema that describes asituation, event or relation and its participants.The FrameNet collection has more than 1,200frames and 13,000 lexical units which are lemmasthat evoke or trigger a frame; see Fig. 2 for anexample of this schema. Complex concepts andsituations can be described by multiple frames.As an example, the sentence ‘John bought a new

161

Who Classifier Training type Classification Predicting evidenceFEVER Score ACC Precision Recall F1

UMBC1 MLP NFC 0.2572 0.4398 0.4868 0.3346 0.3966UMBC2 MLP NFUC 0.2628 0.4479 0.4868 0.3346 0.3966UMBC3 MLP NFIC 0.2599 0.4069 0.4868 0.3346 0.3966Baseline1 MLP (Thorne et al., 2018, Tab. 4) 0.1942 40.64 – – –Baseline2 DA CodaLab results 0.3127 0.5137 – – 0.1718

Table 1: Performance on development dataset of the system on different settings. We achieve comparableclassification performance with simple classifier model thanks to better evidence retrieval.

Figure 2: The FrameNet Commerce buy frame andits immediate neighbors

bike’ can trigger two frames: ‘Claim ownership’1

and ‘Commerce buy’.2

Our annotation processing differed slightly forclaims and potential evidence. We processed theclaims in the dataset with the annotation pipelinedescribed in Ferraro et al. (2014). Each claim wasannotated using a named entity recognizer, de-pendency parser and POS tagger from CoreNLP(Manning et al., 2014) and also by a frame an-notator (Das et al., 2010). For the evidence sen-tences, we used the pre-existing semantically an-notated version of Wikipedia (Ferraro et al., 2014)that contained the same types of annotations forall of Wikipedia pages from a 2016 Wikipediadump, serialized as Thrift objects using the con-crete schema (HLTCOE, 2018).

Depending on the dataset, we performed docu-ment and sentence retrieval. We did only sentenceretrieval for the training data and for developmentand testing data we did document and sentence re-trieval. Our motivation was to understand the ef-fect of frame-based retrieval, assuming the namedentity recognizer correctly identified the entity.

For the training dataset, we used the dataset’sdocument titles to retrieve Wikipedia documentsdirectly and choose its sentences that triggered

1The Claimant asserts rights or privileges, typically own-ership, over some Property. The Claimant may be acting onthe behalf of a Beneficiary.

2The Buyer wants the Goods and offers Money to a Sellerin exchange for them.

some frame as candidate evidence sentences. Weapplied exact frame matching, in which a sentenceis predicted as evidence if it triggers a frame that isalso triggered by the claim. All sentences that hadan exact match to one of the claim’s frames wereadded to the final evidence set.

The extraction of the documents and the sen-tences was different in the case of the developmentand testing datasets as no gold standard documentidentifiers were available. We used a two-layermultilayer perceptron to label the claim giventhe evidence as either SUPPORTS, REFUTES orNotEnoughInfo.

One complication was that the evidence wasextracted from a different Wikipedia dump(2017) than our frame-annotated Wikipedia cor-pus (2016). While the page title’s aligned well be-tween the two Wikipedia dumps, their sentencesexhibited more variations. This is the result ofWikipedia editors making changes to the page, in-cluding rearrangements, updates, adding materialor stylistic modifications. In order to find the cor-rect sentence index in the page, we used the Hun-garian algorithm (Kuhn, 1955) to find the match-ing sentences. We cast this problem as a dissim-ilarity minimization problem, where the dissimi-larity between a pair of sentences was 1 minus aJacard similarity metric over the set of sentencetokens.

2.2 Classifying Claims Given the Evidence

To produce features, we converted each claimword to a 400 dimension embedding (Mikolovet al., 2013) representation and took the averageover the length of the claim, using a zero vectorfor out-of-vocabulary words. We trained a two-layer MLP to label the claim using stochastic gra-dient descent with L2 and dropout to avoid over-fitting. We chose the final parameter values for theclaim classifier that gave best result on develop-ment dataset, which are shown in Table 2.

162

Parameter Valuelearning rate 0.01number of layers 2optimizer SGDhidden layer size 50L2 regularize 1e-06epoch 2batch size 64dropout 0.5

Table 2: MLP classifier parameter values

3 Ablation Study

We explored our approach by evaluating perfor-mance with settings corresponding to three differ-ent information retrieval strategies.• NFC: NER document retrieval + Frame sen-

tence retrieval + Classification• NFUC: NER document retrieval + Frame

sentence retrieval + (Union) introduction sec-tion of the Wikipedia page (Thorne et al.,2018) + Classification• NFIC: NER document retrieval + Frame sen-

tence retrieval + (intersection) introductionsection of Wikipedia page + Classification

3.1 ResultsTable 1 shows confusion matrices of our sys-tem when trained with the three different settings.The performance to predict the score is the sameas we are retrieving the frame based sentencesfrom the documents and adding FEVER processedWikipedia sentences on the fly at the training time.The addition of FEVER-processed Wikipedia sen-tences slightly increases the performance of thesystem.

Since the frame annotator is not perfect, itsometimes fails to trigger appropriate frames. Thismeans that while the vast majority of claims couldbe matched with potential evidence, there areclaims that cannot be matched with evidence. Thiswas neither uncommon nor rare: in the devel-opment set, 21.43% of the claims could not bematched with evidence sentences (the testing andtraining datasets had miss rates of 17.43% and25.78%, respectively).

As evident from the classification perfor-mance, additional data improves performance,with NFUC performing better than other two set-tings. Compared to the results in the test dataset,we scored nearly twice as well as the baseline in

PredictedSupport Refute Neither

Act

ual Support 4646 171 1849

Refute 3050 1198 1618Neither 4123 391 2152

(a) Dev confusion matrix for frame-based sentence retrievalonly (NFC).


Act


Refute 2777 2125 1764Neither 3968 365 2333

(b) Dev confusion matrix for the union of frame-based andintroduction-based sentence retrieval (NFUC).


Act


Refute 3122 1474 2070Neither 4159 214 2293

(c) Dev confusion matrix for the intersection of frame-basedand introduction-based sentence retrieval (NFIC).

Table 3: Classification-without-provenance ac-curacy confusion matrices on the developmentdataset for the three classes under.

terms of information retrieval with simple framematching. This is evidence for the effectiveness ofusing semantic frames in determining the credibil-ity of the claim, despite the recall issues discussedabove.

3.2 Discussion

The three settings had similar performance mea-sures because the set of sentences found byour system was a superset of those found bythe human assessors. Our frame-based retrievalfound 516,670 evidence sentences when matchingframes across the entire document mentioning en-tities and not just the introduction section. The setfound by assessors included 34,797 evidence sen-tences, all of which were included the frame-basedretrieval set.

Fig. 3 shows a correct and incorrect example.A manual examination revealed that the predict-ing evidence was correct nearly every time whenan appropriate frame was in the document. Whena frame is in the claim and not in the document,

163

Claim: Last Man Standing does not star Tim AllenPredicted evidence (Correct):

1. Timothy Allen Dick (born June 13, 1953), known professionally as Tim Allen, is an Americanactor, comedian and author

2. He is known for his role as Tim “The Toolman” Taylor in the ABC television show HomeImprovement (1991) as well as for his starring roles in several films, including the role ofBuzz Lightyear in the Toy Story franchise

3. From 2011 to 2017, he starred as Mike Baxter in the TV series Last Man StandingPredicted Label: REFUTES (due to evidence (3))Actual Label: REFUTES

(a) Relevant evidence is correctly retrieved and is classified correctly as refuting the claim.

Claim: Rocky Mountain High is an Australian songPredicted evidence (Correct):

1. “Rocky Mountain High” is a folk rock song written by John Denver and Mike Taylor aboutColorado, and is one of the two official state songs of Colorado

2. The song also made #3 on the Easy Listening chart, and was played by some country musicstations

3. Denver told concert audiences in the mid-1970s that the song took him an unusually longnine months to write

4. Members of the Western Writers of America chose it as one of the Top 100 Western songs ofall time

Predicted Label: SUPPORTSActual Label: REFUTES

(b) Relevant evidence is correctly retrieved, but was misclassified by the classifier as supporting the claim.

Figure 3: Error analysis examples of predicting evidence and classification. As evident from the exam-ples, the frame based retrieval extracts high quality evidence sentences when available in the document.However, the performance of the system is reduced depending on the classifier predictions, and perfec-tion of the automatic frame annotator.

the retrieval component gives empty results and,depending on the gold standard, the performancesuffers. The mismatch happens due to differ-ences with the Wikipedia version dumps. TheFEVER dataset used a 2017 dump and we usedone from 2016. A second error source was an-notation/misclassification by our frame annotationsystem. However, whenever there is a match, thequality of the evidence is high, as shown by thefirst and second example claims. Table 3 showsthe confusion matrices for the three classes (Sup-port, Refute, Not enough information) for each ofthe three settings.

4 Discussion and Conclusion

Our submission was an initial attempt to explorethe idea of using semantic frames to match claimswith sentences providing evidence that might sup-port or refute them. The approach has the ad-

vantage of being able to exploit relations betweenframes such as entailments, temporal ordering,causality and generalization that can capture com-mon sense knowledge. While the classificationscores were lower than we hoped, the evidence re-trieval scores represent impressive and promisingimprovements.

We plan to continue developing the approachand add it as a component of a larger system forcleaning noisy knowledge graphs (Padia, 2017;Padia et al., 2018).

We expect that the performance measures willimprove when the datasets are all extracted fromthe identical Wikipedia versions. Possible en-hancements include using the Kelvin (Finin et al.,2015) information extraction system to add entitycoreference and better entity linking to a knowl-edge graph of background knowledge, such asFreebase, DBpedia or Wikidata. This will sup-

164

port linking nominal and pronominal mentions toa canonical named mention and provide access tomore common aliases for entities. Such featureshave been shown to improve entity-based informa-tion retrieval (Van Durme et al., 2017).

We also hope to exploit Kelvin’s ability to rea-son about entities and relations. Its knowledgegraph knows, for example, that while one canonly be born in single geo-political location, suchplaces are organized in a part-of hierarchy. Anevent that happens in one a place can be said toalso take place at its enclosing locations. The sys-tem’s background knowledge includes that Hon-olulu is part of Hawaii which in turn is part of theUnited States. Moreover, it knows that if you wereborn in a country, it is very likely that you are acitizen of that country. This will allow it to recog-nize “Obama was born in Honolulu” as evidencethat supports the claim that “Obama is a citizen ofthe U.S.”.

Acknowledgments

This research was partially supported by giftsfrom the IBM AI Horizons Network and NorthropGrumman and by support from the U.S. Na-tional Science Foundation for UMBC’s high per-formance computing environment.

References

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998a. The berkeley framenet project. In Proceed-ings of the 17th International Conference on Com-putational Linguistics - Volume 1, COLING ’98,pages 86–90. Association for Computational Lin-guistics.

Collin F Baker, Charles J Fillmore, and John B Lowe.1998b. The berkeley framenet project. In Proceed-ings of the 17th international conference on Compu-tational linguistics-Volume 1, pages 86–90. Associ-ation for Computational Linguistics.

Dipanjan Das, Nathan Schneider, Desai Chen, andNoah A Smith. 2010. Probabilistic frame-semanticparsing. In Human language technologies: The2010 annual conference of the North Americanchapter of the association for computational linguis-tics, pages 948–956. Association for ComputationalLinguistics.

Francis Ferraro, Max Thomas, Matthew R Gormley,Travis Wolfe, Craig Harman, and Benjamin VanDurme. 2014. Concretely annotated corpora. InAKBC Workshop at NIPS.

Tim Finin, Dawn Lawrie, Paul McNamee, James May-field, Douglas Oard, Nanyun Peng, Ning Gao, Yiu-Chang Lin, Josh MacLin, and Tim Dowd. 2015.HLTCOE participation in TAC KBP 2015: Coldstart and TEDL. In Text Analytics Conference(TAC).

HLTCOE. 2018. Concrete. http://hltcoe.github.io-/concrete/.

Harold W. Kuhn. 1955. The Hungarian Method forthe assignment problem. Naval Research LogisticsQuarterly, page 8397.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The Stanford CoreNLP natural language pro-cessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational lin-guistics: system demonstrations, pages 55–60.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Ankur Padia. 2017. Cleaning Noisy KnowledgeGraphs. In Proceedings of the Doctoral Consortiumat the 16th International Semantic Web Conference,volume 1962. CEUR Workshop Proceedings.

Ankur Padia, Frank Ferraro, and Tim Finin. 2018. KG-Cleaner: Identifying and correcting errors producedby information extraction systems. arXiv preprintarXiv:1808.04816.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extraction andverification. CoRR, abs/1803.05355.

Benjamin Van Durme, Tom Lippincott, Kevin Duh, ,Deana Burchfield, Adam Poliak, Cash Costello, TimFinin, Scott Miller, James Mayfield, Philipp Koehn,Craig Harmon, Dawn Lawrie, Chandler May, MaxThomas, Annabelle Carrell, and Julianne Chaloux.2017. CADET: Computer Assisted Discovery Ex-traction and Translation. In 8th International JointConference on Natural Language Processing (Sys-tem Demonstrations), pages 5–8.

165


A Mostly Unlexicalized Model For Recognizing Textual Entailment

Mithun Paul, Rebecca Sharp, Mihai SurdeanuUniversity of Arizona, Tucson, AZ, USA

mithunpaul,bsharp,[email protected]

Abstract

Many approaches to automatically recogniz-ing entailment relations have employed classi-fiers over hand engineered lexicalized features,or deep learning models that implicitly cap-ture lexicalization through word embeddings.This reliance on lexicalization may complicatethe adaptation of these tools between domains.For example, such a system trained in the newsdomain may learn that a sentence like “Pales-tinians recognize Texas as part of Mexico”tends to be unsupported, but this fact (and itscorresponding lexicalized cues) have no valuein, say, a scientific domain. To mitigate thisdependence on lexicalized information, in thispaper we propose a model that reads two sen-tences, from any given domain, to determineentailment without using lexicalized features.Instead our model relies on features that areeither unlexicalized or are domain independentsuch as proportion of negated verbs, antonyms,or noun overlap. In its current implementa-tion, this model does not perform well on theFEVER dataset, due to two reasons. First, forthe information retrieval portion of the task weused the baseline system provided, since thiswas not the aim of our project. Second, this iswork in progress and we still are in the processof identifying more features and gradually in-creasing the accuracy of our model. In the end,we hope to build a generic end-to-end classi-fier, which can be used in a domain outside theone in which it was trained, with no or mini-mal re-training.

1 Introduction

The rampant spread of fake data recently (be it innews or scientific facts) and its impact in our dayto day life has renewed interest in the topic of dis-ambiguating between fake, or unsupported, infor-mation and real, or supported, information (Wang,2017).

Last year, the fake news challenge (Pomerleauand Rao, 2017) was organized as a valuable firststep towards creating systems to detect inaccurateclaims. This year the fact verification challenge(FEVER) (Thorne et al., 2018) was organized tofurther this. Specifically it was organized to fosterthe development of systems that combine informa-tion retrieval (IR) and textual entailment recogni-tion (RTE) together to address the fake claim prob-lem. However, developing a system that is trainedto tackle the issue only in one area (in this case forfake news detection) does not solve the problem inother domains. For example, models developed tospecifically detect fake news might not work wellto detect fake science articles.

An alternative will be to create systems thatcan be trained in broader domains and can thenbe used to test on specific domains with minimalmodification and/or parameter tuning. Such a sys-tem should also be able to capture the underly-ing idiosyncratic characteristics of the human au-thor who originally created such fake data. Forexample some common techniques used by writ-ers of such fake articles include using hedgingwords (e.g., possibly), circumventing facts, avoid-ing mentioning direct evidence, hyperbole etc. Tothis end, we propose a largely unlexicalized ap-proach that, when coupled with an informationretrieval (IR) system that assembles relevant ar-ticles for a given claim, would serve as a cross-domain fake-data detection tool which could ei-ther stand-alone or potentially supplement otherdomain-specific lexicalized systems.

The goal of this paper is to present a descrip-tion of such a preliminary system developed forthe FEVER challenge, its performance, and our in-tended future work.

166

2 Approach

We approach the task of distinguishing betweenfake and real claims in a series of steps. Specifi-cally, given a claim, we:

1. Information retrieval (IR): We use an IRcomponent to gather claim-relevant textsfrom a large corpus of evidence (i.e.,Wikipedia articles). For retrieving theWikipedia articles which contain sentencesrelevant to the given claim, we reused thecompetition-provided information retrievalsystem (Thorne et al., 2018) since we are fo-cusing here on the RTE portion of the task.To be specific, we used the DrQA (Chenet al., 2017) for document retrieval. For sen-tence selection the modified DrQA with bin-ning with comparison to unigram TF-IDF im-plementation using NLTK (Bird and Loper,2004) was used. The k and l parameters val-ues, for document retrieval and sentence se-lection, respectively, was also left as is to be5. Any claim which had the label of NOTENOUGH INFO was removed from the Or-acle setting.

2. Evidence aggregation: As part of thecompetition-provided IR system we next ag-gregate the top 10 documents and combinethe evidence sentences into a single docu-ment.

3. Classification: Finally, we compare theclaim to the evidence document to classifyit as either SUPPORTS or REFUTES. Forour learning framework, we employ a supportvector machine (SVM) (Hearst et al., 1998)with a linear kernel.

2.1 Features

In this section we describe the various groups offeatures that were used for the classification taskin the last component of the approach introducedabove. To create these features, the claim and ev-idence were tokenized and parsed using CoreNLP(Manning et al., 2014). The majority of the fea-tures are either proportions or counts so as tomaintain the unlexicalized aspect. Specific lexicalcontent was used only when the semantics weredomain independent (i.e., as with certain discoursemarkers such as however, possible, not, etc).

• Word overlap: This set of features wasbased on the proportion of words that overlapbetween the claim and the evidence. Specifi-cally, given a claim and a body of evidence, cand e, we compute the proportion of words inc ∪ e that overlap: |c ∩ e|

|c ∪ e| . We made similarfeatures for verb and noun overlap as well,where we also include two sub features forthe proportion of words in c and also the pro-portion of words in e: |c ∩ e|

|c| and |c ∩ e||e| .

For all these features, we used the lemmaform of the words and first removed stopwords (see Appendix A for the list of stopwords that were removed). In all, there were5 features in this feature set, two each fornoun and verb overlap as defined above andone for word overlap.

• Hedging words: Hedging is the process ofusing cautious or vague language to vary thestrengths of the argument in a given sentence.When present, it can indicate that the authoris trying to circumvent facts. To capture this,we have a set of indicator features that markthe presence of any word from a given list ofhedging words (see Appendix A for the listof hedging words used) in either the claim orevidence sentences This feature set has a totalof 60 hedging features. While these featuresare lexicalized, their semantics are domain-independent and therefore in scope of our ap-proach.

• Refuting words and negations: Whenpresent, refuting words can indicate that theauthor is unequivocally disputing a claim.To capture this, as with the hedging wordsabove, we include a set of indicator featuresfor refuting words (see Appendix A for thecomplete list of refuting words used) that arepresent in either the claim or evidence sen-tences. Also as with the hedging features, thesemantics of these words are expected to beconsistent across domains. This feature is aone hot vector denoting the existence (or ab-sence) of any of the aforementioned 19 refut-ing words, creating a feature vector of length19.

Another signal of disagreement between twotexts is the presence of a verb in one textwhich is negated in the other text, largelyregardless of the identity of the verb (e.g.,

167

Barack Obama was not born in the UnitedStates). To capture this, features were cre-ated to indicate whether a verb in the claimsentence was negated in the evidence andvice versa. This polarity indicator, createdthrough dependency parsing, thus contained4 features, each indicating tuples (posi-tive claim-negative evidence, negative claim-positive evidence, etc.)

• Antonyms: Presence of antonyms in evi-dence sentences may indicate contradictionswith the claim (e.g.: The movie was impres-sive vs the movie was dreadful). This featurecaptures the number of nouns or adjectivesin the evidence sentence that were antonymsof any noun or adjective in the claim sen-tence (and vice versa). Similar to the wordoverlap feature mentioned above, every suchantonym feature has two sub features, eachdenoting the proportion over antonyms inclaim and evidence, respectively. Thus, thereare a total of 4 antonym features. The listof antonyms used were extracted from WordNet (Miller, 1995).

• Numerical overlap: Human authors of fakearticles often exaggerate facts (e.g., claimingAround 100 people were killed as part of theriots, when the evidence shows a lower num-ber). To approximately measure this, we findthe intersection and difference of numericalvalues between claim and the evidence, mak-ing it 2 features.

• Measures of lexical similarity: While theuse of specific lexical items or their cor-responding word embeddings goes againstthe unlexicalized, domain-independent aimof this work, here we use relative distribu-tional similarities between the texts as fea-tures. Particularly, the relative position of thewords in an embedding space carries signif-icant information for recognizing entailment(Parikh et al., 2016). To make use of this,we find the maximum, minimum, and aver-age pairwise cosine similarities between eachof the words in the claim and the evidence.We additionally include the overall similaritybetween the two texts, using a bag-of-wordsaverage to get a single representation for eachtext. We used the Glove (Pennington et al.,2014) embeddings for these features.

Evidence Label FEVERModel F1 Accuracy scoreBaseline 0.1826 0.4884 0.2745Our model 0.1826 0.3694 0.1900

Table 1: Performance of our submitted model on thetest data.

Model Label AccuracyBaseline (Thorne et al., 2018) 65.13Our model at submission 55.60Our model post submission 56.88

Table 2: Oracle classification on claims in the develop-ment set using gold sentences as evidence

3 Experiments

3.1 Data and Tuning

We used the data from the FEVER challenge(Thorne et al., 2018), training on the 145,449claims provided in the training set and tuningon the provided development set (19,998 claims).Since we were focusing only on the textual en-tailment part, we removed the claims which hadthe label NOT ENOUGH INFO during training.As a result, we trained on the remaining 109,810claims in the training set and tuned on the remain-ing 13,332 in the development set.

3.2 Baseline

We compare against the provided FEVER base-line. The IR component of the baseline is identi-cal to ours as we reuse their component, but forthe textual entailment they use the fully lexical-ized state of the art decomposable attention modelof (Parikh et al., 2016).

4 Results

The performance of our submitted domain-independent model on the test data (using thebaseline IR system) can be found in Table 1, alongwith the performance of the fully lexicalized base-line. The current performance of our model is be-low that of the baseline, presumably due to thelack of domain-specific lexicalized features.

Since here we focus on the RTE componentonly, we also provide the model’s performance inan oracle setting on the development set, wherewe use the gold sentences as evidence in Table2. Included in the table are the results both at thetime of submission and post-submission. At thetime of submission, the model included only theword overlap, negated and refuting words, hedg-

168

Feature group removed AccuracyWith all features 56.89 %−Word overlap 50.89 %− Hedging words 50.96 %− Antonyms 52.17 %−Measures of lexical similarity 55.60 %− Refuting words and negations 55.82 %

Table 3: Ablation results: performance of our model ondevelopment after removing each feature group, one ata time. Performance is given in the oracle setting, usingthe gold sentences as evidence.

ing words and antonym features. Post-submission,we added the lexical similarity features.

Making use of the relative interpretability of ourfeature-based approach, we performed an ablationtest on the development data (again, in the oraclesetting using gold sentences as evidence) to testthe contribution of each feature group. The re-sults are shown in Table 3 . The word-overlap andhedging features had the largest contribution. Therelatively small contribution of the refuting wordsand negation features, on the other hand, could bedue to the limited word set or the lack of explicitrefuting in the evidence documents.

5 Analysis

To find the importance of each of the features asassigned by the classifier we printed the weightsfor the top five features for each class, shown inTable 4. As can be seen in this table, the fea-ture that was given the highest weight for the classREFUTES is the polarity feature that indicates aconflict in the polarity of the claim and evidence(as determined by finding a verb which occurs inboth, but is negated in the claim and not negated inthe evidence). The feature with the second highestweight for the REFUTES class is the proportion ofnouns that were common between claim and evi-dence. Another feature that the classifier has givena high importance for belonging to this class, is thecount of numbers that were present in the claimbut not in evidence (numbers are defined as tokenshaving CD as their part of speech tag).

Similarly, the feature which had the highestweights for the class SUPPORTS is that of theword overlap (which denotes the proportion ofunique words that appear both in claim and ev-idence). Notably, the existence of some of thehedging words were found to be indicative of theREFUTES class (e.g., question and predict) while

others were indicative of the SUPPORTS class (ar-gue, hint, prediction and suggest).

While most of the weights as generated by theclassifier are intuitive, these features are clearly in-sufficient, as demonstrated by the low accuracy ofthe classifier. To address this we manually ana-lyzed 30 data points from the development dataset that were wrongly classified by our model. Aparticular focus was to try to understand and traceback which features contributed (or did not con-tribute) to the SUPPORTS and REFUTES classes.

Several of the data points demonstrated ways inwhich straightforward extensions of the approach(i.e., additional features) could help. For exampleconsider this data point below, which belongs tothe class REFUTES but was classified to be in theclass SUPPORTS by our model:

Claim:Vedam was written and directed byChristopher Nolan .Evidence: Vedam is a 2010 Telugu languageIndian drama film written and directed byRadhakrishna Jagarlamudi....

We conjecture that this error occurred due to thelack of syntactic information in our system. Forexample, a simple extension to our approach thatcould address this example would look for similar(and dissimilar) syntactic dependencies betweenthe claim and evidence.

On the other hand, a few of the data pointscontained more complex phenomenon that wouldbe difficult to capture in the current approach.Consider the following example which belongs tothe class REFUTES but was wrongly classified asSUPPORTS by our model:

Claim:Sean Penn is only ever a stage actor.Evidence:Following his film debut in the dramaTaps and a diverse range of film roles in the1980s, ... Penn garnered critical attention for hisroles in the crime dramas At Close Range, Stateof Grace, and Carlito’s Way .

This example shows the difficulty involved in cap-turing the underlying complexities of words thatindirectly capture negation such as only, which ourfeatures do not capture presently.

Lastly, we found that certain aspects of our ap-proach, even with minimal dependence on lexical-ization, are still not as domain-independent as de-sired. Consider the example below, whose gold la-bel is SUPPORTS, but was classified as REFUTESby our model.

169

Weight Feature Name Description1.30 polarity neg claim pos ev Presence of verb negated in the claim but not in the evidence0.537 noun overlap Proportion of nouns in claim and evidence that overlap0.518 hedging evidence question Presence of the hedging word question in the evidence0.455 num overlap diff Count of numbers present in claim but not in the evidence0.385 hedging claim predict Presence of the hedging word predict in the claim-0.454 hedging evidence suggest Presence of the hedging word suggest in the evidence-0.477 hedging evidence prediction Presence of the hedging word prediction in the evidence-0.584 hedging claim hint Presence of the hedging word hint in the claim-0.585 hedging claim argue Presence of the hedging word argue in the claim-1.59 word overlap Proportion of words in the claim and evidence that overlap

Table 4: Top five features with the highest weight in each class, where the positive class is REFUTES and thenegative class is SUPPORTS.

Claim: The Gettysburg Address is a speech.Evidence: Despite the speech’s prominent placein the history and popular culture of the UnitedStates, the exact wording and location of thespeech are disputed.

We believe this error occurred because we havemore argumentative features (for example, in thiscase the presence of the word despite), and fewerfeatures to capture the type of neutral sentencescommon in data sources like Wikipedia pages,which have more informative, objective content.On the other hand, fake news articles contain moresubjective language, for which argumentative fea-tures are well-suited.

Keeping all these errors in mind our future goalis to enhance the performance of the system byadding more potent unlexicalized/domain inde-pendent features, including features that take ad-vantage of dependency syntax and discourse in-formation. Also another possibility we wouldlike to explore is replacing the current classifierwith other non-linear classifiers including a sim-ple feed-forward neural network. Through thesesteps, we hope to improve the accuracy of the clas-sifier predictions, pushing the performance closerto that of a fully lexicalized systems, and yet ableto transfer between domains.

6 Conclusion

Despite our current low performance in theFEVER challenge, we would like to propose thissystem as a precursor to an effort towards buildinga cross-domain fake data detection model, espe-cially considering its basic implementation. Theadded benefit for our simple system, when com-pared to other complicated neural network/deep

learning architectures (which are harder to inter-pret), is that this also provides an opportunity topeek into the what features contribute (or do notcontribute) to the development of such a cross-domain system.

Acknowledgements

We would like to thank Ajay Nagesh and MarcoA. Valenzuela-Escarcega for their timely help.

ReferencesSteven Bird, Ewan Klein, and Edward Loper. 2009.

Natural language processing with Python: analyz-ing text with the natural language toolkit. ” O’ReillyMedia, Inc.”.

Steven Bird and Edward Loper. 2004. Nltk: the nat-ural language toolkit. In Proceedings of the ACL2004 on Interactive poster and demonstration ses-sions, page 31. Association for Computational Lin-guistics.

Danqi Chen, Adam Fisch, Jason Weston, and An-toine Bordes. 2017. Reading wikipedia to an-swer open-domain questions. arXiv preprintarXiv:1704.00051.

Marti A. Hearst, Susan T Dumais, Edgar Osuna, JohnPlatt, and Bernhard Scholkopf. 1998. Support vec-tor machines. IEEE Intelligent Systems and their ap-plications, 13(4):18–28.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational lin-guistics: system demonstrations, pages 55–60.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

170



Dean Pomerleau and Delip Rao. 2017. Fake news chal-lenge.

James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extraction andverification. arXiv preprint arXiv:1803.05355.

William Yang Wang. 2017. ” liar, liar pants on fire”:A new benchmark dataset for fake news detection.arXiv preprint arXiv:1705.00648.

A Supplemental Material

A.1 Hedging words

allegedly, argument, belief, believe, conjecture,consider, hint, hypotheses, hypothesis, hypothe-size, implication, imply, indicate, predict, pre-diction, previous, previously, proposal, propose,question, reportedly, speculate, speculation, sug-gest, suspect, theorize, theory, think, whether

A.2 Stop words

We used a subset of the stop words (and partialwords) that come from the python Natural Lan-guage Toolkit (NLTK) (Bird et al., 2009):

a, about, ain, all, am, an, and, any, are, aren,aren’t, as, at, be, been, being, by, can, couldn,couldn’t, did, did n’t, didn, do, does, doesn,doesn’t, doing, don’t, few, for, from, further, had,hadn, hadn’t, has, hasn, hasn’t, have, haven,haven’t, having, he, her, here, hers, herself, him,himself, his, how, i, if, in, into, is, isn, isn’t, it, it’s,its, itself, just, ll, me, mightn, mightn’t, more, most,mustn, mustn’t, my, myself, needn, needn’t, nor, of,on, or, our, ours, ourselves, own, shan, shan’t, she,she’s, should, should’ve, shouldn, shouldn’t, so,some, such, than, that, that’ll, the, their, theirs,them, themselves, then, there, these, they, this,those, through, to, too, until, ve, very, was, wasn,wasn’t, we, were, weren, weren’t, what, when,where, which, while, who, whom, why, will, with,won’t, wouldn, wouldn’t, y, you, you’d, you’ll,you’re, you’ve, your, yours, yourself, yourselves

A.3 Refuting wordsbogus, debunk, denies, deny, despite, doubt,doubts, fake, false, fraud, hoax, neither, no, nope,nor, not, pranks, refute, retract

171

Author Index

Øvrelid, Lilja, 119

Aggarwal, Piush, 28Aker, Ahmet, 28, 114Alhindi, Tariq, 85, 127Aroyo, Lora, 16

Bangaru, Anusha, 28

Chakrabarty, Tuhin, 127Chawla, Piyush, 50Cheng, Bo, 138Chinnakotla, Manoj, 22Christodoulopoulos, Christos, 1Cocarascu, Oana, 1Collier, Nigel, 40Conforti, Costanza, 40

de Marneffe, Marie-Catherine, 156Diab, Mona, 150Dumitrache, Anca, 16

Esteves, Diego, 50, 132

Ferraro, Francis, 161Finin, Tim, 161

Ghanem, Bilal, 66Gupta, Vishal, 22Gurevych, Iryna, 103

Högden, Birte, 28Hanselowski, Andreas, 103Hidey, Christopher, 150Hirst, Graeme, 60

Jiang, Nanjiang, 156Jorge, Alípio, 91

Kadav, Asim, 79Kevin, Vincentius, 28Kim, Juho, 79Kowollik, Jan, 114

Lehmann, Jens, 50li, sizhen, 138Li, Zile, 103

Loureiro, Daniel, 91Luken, Jackson, 156

Madan, Neelu, 28Malon, Christopher, 79, 109Mitchell, Jeff, 97Mittal, Arpit, 1Miura, Yasuhide, 34, 124Muradov, Farid, 28Muresan, Smaranda, 85, 127

Naderi, Nona, 60Nooralahzadeh, Farhad, 119

Obamuyide, Abiola, 72Ohkuma, Tomoko, 34, 124Otto, Wolfgang, 145

Padia, Ankur, 161Paul, Mithun, 166Petridis, Savvas, 85Pilehvar, Mohammad Taher, 40

Rangel, Francisco, 66Reddy, Aniketh Janardhan, 50, 132Riedel, Sebastian, 97Rocha, Gil, 132Rosso, Paolo, 66

Sahan, Ali, 28Schiller, Benjamin, 103Schulz, Claudia, 103Schwenger, Claudia, 28Sharp, Rebecca, 166Shrivastava, Manish, 22Sorokin, Daniil, 103Stenetorp, Pontus, 97Surdeanu, Mihai, 166

Taboada, Maite, 10Takahashi, Takumi, 124Taniguchi, Motoki, 34, 124Taniguchi, Tomoki, 124Thorne, James, 1Torabi Asr, Fatemeh, 10

173

Vlachos, Andreas, 1, 72

Welbl, Johannes, 97Welty, Chris, 16

Yang, Hao, 138Yoneda, Takuma, 97

Zhang, Hao, 103Zhao, Shuai, 138

Fact Extraction and VERication · Oana Cocarascu (Imperial College London) Christos...

Documents

Transcript of Fact Extraction and VERication · Oana Cocarascu (Imperial College London) Christos...