KnowFi: Knowledge Extraction from Long Fictional Texts

Automated Knowledge Base Construction (2021) Conference paper

KnowFi: Knowledge Extraction from Long Fictional Texts

Cuong Xuan Chu [email protected]

Simon Razniewski [email protected]

Gerhard Weikum [email protected]

Max Planck Institute for Informatics

Saarbrucken, Germany

Abstract

Knowledge base construction has recently been extended to fictional domains like multi-volume novels and TV/movie series, aiming to support explorative queries for fans andsub-culture studies by humanities researchers. This task involves the extraction of rela-tions between entities. State-of-the-art methods are geared for short input texts and basicrelations, but fictional domains require tapping very long texts and need to cope with non-standard relations where distant supervision becomes sparse. This work addresses thesechallenges by a novel method, called KnowFi, that combines BERT-enhanced neural learn-ing with judicious selection and aggregation of text passages. Experiments with severalfictional domains demonstrate the gains that KnowFi achieves over the best prior methodsfor neural relation extraction.

1. Introduction

Motivation and Problem: Relation extraction (RE) from web contents is a key taskfor the automatic construction of knowledge bases (KB). It involves detecting a pair ofentities in a text document and inferring if a certain relation (predicate) holds betweenthem. Extracted triples of the form (subject, predicate, object) are used for populating andgrowing the KB. Besides this major use case, RE also serves other applications like textannotation and summarization, semantic search, and more.

Work on KB construction has mostly focused on general-purpose encyclopedic knowl-edge, about prominent people, places, products etc. and basic relations of wide interestsuch as birthplaces, spouses, writing of books, acting in movies etc. Vertical domains havereceived some attention, including health, food, and consumer products. Yet another caseare KBs about fictional works [Hertling and Paulheim, 2020, Labatut and Bost, 2019], suchas Game of Thrones (GoT), the Marvel Comics (MC) universe, Greek Mythology or epicbooks such as War and Peace by Leo Tolstoy or the Cartel novels by Don Winslow. ForKBs about fictional domains, the focus is less on basic relations like birthplaces or spouses,but more on relations that capture traits of characters and key elements of the narration.Relations of interest are allies, enemies, membership in clans, betrayed, killed etc.

Applications of fiction KBs foremost including supporting fans in entity-centric search.Some of the fictional domains have huge fan communities, and search engines frequentlyreceive queries such as “Who killed Catelyn Stark?” (in GoT). Entity summarization isa related task, for example, a user asking for the most salient traits of Ygritte (in GoT).Although fiction serves to entertain, some of the more complex domains reflect sub-culturaltrends and the zeitgeist of certain epochs. Analyzing their narrative structures and networks

1

of entities is of interest to humanities scholars. For example, superhero comics originated inthe 1940s and boomed in post-war years, reflecting that era’s zeitgeist (revived now). Warand Peace has the backdrop of the Napoleonic wars in Russia, and the Cartel trilogy blendsfacts and fiction about drug trafficking. KBs enable deeper analyses of such complex textsfor historians, social scientists, media psychologists and cultural-studies scholars.

State of the Art and its Limitations: RE with pre-specified relations for canonicalizedentities is based on distant supervision via pre-compiled seed triples [Mintz et al., 2009,Suchanek et al., 2009]. Typically, these training seeds come from initial KBs, which inturn draw on Wikipedia infoboxes. The best RE methods are based on this paradigm ofdistant supervision, leveraging it for neural learning (e.g., [Soares et al., 2019, Wang et al.,2020, Yao et al., 2019, Zhang et al., 2017, Han et al., 2020]). They work well for basicrelations, as there is no shortage of training samples (e.g., for birthplace or spouse). Oneof their key limitations is the bounded size of input text passages, typically a few hundredtokens only. This is not a bottleneck for basic relations where single sentences (or shortparagraphs) with all three SPO components are frequent enough (e.g., in the full text ofWikipedia articles). However, for RE with non-standard relations over long fictional textssuch as entire books, these limitations are major bottlenecks, if not showstoppers. Thispaper addresses the resulting challenges (also included among the open challenges in theoverview by [Han et al., 2020]):

• How to go about distant supervision for RE targeting non-standard relations that haveonly few seed triples?

• How to cope with very long input texts, such as entire books, where relevant cues forRE are spread across passages?

Approach and Contributions: This paper presents a complete methodology and systemfor relation extraction from long fictional texts, called KnowFi (Knowledge extraction fromFictional texts). Our method leverages semi-structured content in wikis of fan communitieson fandom.com (aka wikia.com). We extract an initial KB of background knowledge for 142popular domains (TV series, movies, games). This serves to identify interesting relationsand to collect distant supervision samples. Yet for many relations this results in very fewseeds. To overcome this sparseness challenge and to generalize the training across the widevariety of relations, we devise a similarity-based ranking technique for matching seeds intext passages. Given a long input text, KnowFi judiciously selects a number of contextpassages containing seed pairs of entities. To infer if a certain relation holds between twoentities, KnowFi’s neural network is trained jointly for all relations as a multi-label classifier.

Extensive experiments with long books on five different fictional domains show thatKnowFi clearly outperforms state-of-the-art RE methods. Even on conventional short-textbenchmarks with standard relations, KnowFi is competitive with the best baselines. Asan extrinsic use case, we demonstrate the value of KnowFi’s KB for the task of entitysummarization. The paper’s novel contributions are:

• a system architecture for the new problem of relation extraction from long fictional texts,like entire novels and text contents by fan communities (Section 3);

• a method to overcome the challenge of sparse samples for distant supervision for non-standard relations (Section 4).

2

fandom.com

wikia.com

• a method to overcome the challenge of limited input size for neural learners, by judi-ciously selecting relevant contexts and aggregating results (Section 5);

• a comprehensive experimental evaluation with a novel benchmark for relation extractionfrom very long documents (Section 6), with code and data release upon publication.

2. Related Work

Relation Extraction (RE): Early work on RE from text sources has used rules andpatterns, (e.g., [Agichtein and Gravano, 2000, Craven et al., 1998, Etzioni et al., 2004, Reisset al., 2008]), with pattern learning based on the principle of relation-pattern duality [Brin,1998]. Open IE [Banko et al., 2007, Mausam, 2016, Stanovsky et al., 2018] uses linguisticcues to jointly infer patterns and triples, but lacks proper normalization of SPO arguments.RE with pre-specified relations, on the other hand, is usually based on distant supervisionvia pre-compiled seed triples [Mintz et al., 2009, Suchanek et al., 2009]. A variety of methodshave been developed on this paradigm, from probabilistic graphical models (e.g., [Pujaraet al., 2015, Sa et al., 2017]) to deep neural networks (e.g., [Soares et al., 2019, Wang et al.,2020, Yao et al., 2019, Zhang et al., 2017, Han et al., 2020]). Distantly supervised neurallearning has become the method of choice, with different granularities.

Sentence-level RE: Most neural methods operate on a per-sentence level. Distant-supervision samples of SPO triples serve to identify sentences that contain an entity pair (Sand O) which stand in a certain relation. The sentence is then treated as a positive trainingsample for the neural learner. At test-time, the trained model can tag entity mentions andpredict if the sentence expresses a given relation or not. This basic architecture has beenadvanced with bi-LSTMs, attention mechanisms and other techniques (e.g., [Zhang et al.,2017, Cui et al., 2018, Trisedya et al., 2019]). A widely used benchmark for sentence-levelRE is TacRed [Zhang et al., 2017].

With recent advances on pre-trained language models like BERT [Devlin et al., 2019a](or ElMo, GPT-3, T-5 and the like), the currently best RE methods leverage this assetfor representation learning [Shi and Lin, 2019, Soares et al., 2019, Wadden et al., 2019, Yuet al., 2020].

Document-level RE: To expand the scope of inputs, Wang et al. [2019] proposed REfrom documents, introducing the DocRed benchmark. However, the notion of documentsis still very limited in size, given the restrictions in neural network inputs, typically around10 sentences (e.g., excerpts from Wikipedia articles). Wang et al. [2020] is a state-of-the-art method for this document-level RE task, utilizing BERT and graph convolutions forrepresentation learning. Zhou et al. [2021] further enhanced this approach. None of thesemethods can handle input documents that are larger than a few tens of sentences. KnowFiis the first method that is geared for book-length input.

Fiction Knowledge Bases: Unterstanding characters in literary texts and constructingnetworks of their relationships and interactions has become a small topic in NLP (e.g.,[Chaturvedi et al., 2016, Labatut and Bost, 2019, Srivastava et al., 2016]). The work of[Chu et al., 2019, 2020] has advanced this theme for entity typing and type taxonomies for

3

WikiaInfoboxes

Triples(e1, r, e2)

Entity Pairs(e1, e2)

All Passageswith (e1,e2)

PassageRanker

Relations r1, r2 …

Multi-Context Neural LearnerDistant Supervision

Selected Passages

BERT

Multi-LabelClassifier

ContextEmbed‘s

Figure 1: Overview of the KnowFi architecture.

fictional domains. However, this work does not address learning relations between entitiesfor KB population.

The DBkWik project [Hertling and Paulheim, 2020] has leveraged structured infoboxesof fan communities at wikia (now renamed to fandom.com), to construct a large KB offictional characters and their salient properties. However, this is strictly limited to relationsand respective instances that are present in infoboxes. Our work leverages wikia infoboxesfor distant supervision, but our method can extract more knowledge from a variety of textsources, including storylines and synopses by fans and, most demandingly, the full text ofentire books.

3. System Overview

The architecture of the KnowFi system is illustrated in Figure 1. There are two majorcomponents:

• Distant supervision involves pre-processing infoboxes from Wikia-hosted fan commu-nities, to obtain seed pairs of entities. These are used to retrieve relevant passages fromthe underlying text corpora: either synopses of storylines in Wikia or full-fledged con-tent of original books. As the number of passages per entity pair can be very large inbooks, we devise a judicious ranking of passages and feed only the top-k passages intothe next stage of training the neural network. Details are in Section 4.

• Multi-context neural learning feeds the top-k passages, with entity markup, jointlyinto a BERT-based encoder [Devlin et al., 2019b]. On top of this representation learning,a multi-label classifier predicts the relations that hold for the input entity pair. Detailsare in Section 5.

Note that a passage can vary from a single sentence to a long paragraph. The two seedentities would ideally occur in the same sentence, but there are many cases where they areone or two sentences apart. Figure 2 shows example texts from a GoT synopses in Wikiaand from one of the original books.

The pre-processing of Wikia infoboxes resulted in 2.37M SPO triples for ca. 8,000 differ-ent relation names between a total of 461.4k entities, obtained from 142 domains (movie/TVseries, games etc.). This forms our background knowledge for distant supervision. For ob-taining matching passages, we focused on the 64 most frequent relations, including friend,ally, enemy and family relationships. Note that this stage is not domain-specific. Later weapply the learned model to specific domains such as GoT or Marvel Comics.

4

fandom.com

Excerpt from Game of Thrones synopses at Wikia:Eighteen years before the War of the Five Kings, Rhaegar Targaryen allegedly abducted Lyanna Stark in ascandal that led to the outbreak of Robert‘s Rebellion. Rhaegar eventually returned to fight n the war, butnot before leaving Lysanna behind at the Tower of Joy, guarded by Lord Commander Gerold Hightower andSer Arthur Dayne of the Kingsguard. Eddard Stark rode to war along her betrothed, Robert Baratheon, torescue his sister and avenge the deaths of their father and brother at the orders of Aerys II, the Mad King.

Excerpt from Harry Potter book:Harry had been a year old the night that Voldemort - the most powerful Dark wizard for a century, a wizardwho had been gaining power steadily for eleven years, arrived at his house and killed his father and mother.Voldemort had then turned his wand on Harry; he had performed the curse that had disposed of many full-grown witches and wizards in his steady rise to power and, incredibly, it had not worked. Instead of killing thesmall boy, the curse had rebounded upon Voldemort. Harry had survived with nothing but a lightning-shapedcut on his forehead, and Voldemort had been reduced to something barely alive.

Figure 2: Examples of input texts.

4. Distant Supervision with Passage Ranking

The KnowFi approach to distant supervision differs from prior works in two ways:

• Passage ranking: Identifying the best passages that contain seed triples, by judiciousranking, and using only the top-k passages as positive training samples.

• Passages with gaps: Including passages where the entities of a seed triple merelyoccur in separate sentences with other sentences in between.

Passage ranking: Seed pairs of entities are matched by many sentences or passages in theinput corpora. For example, the pair (Herminone, Harry) appears in 1539 sentences in thethe seven volumes of the Harry Potter series together. Many of these contain cues that theystand in the friends relation, but there are also many sentences where the co-occurrenceis merely accidental. This is a standard dilemma in distant supervision for multi-instancelearning [Riedel et al., 2010, Li et al., 2020]. Our approach is to identify the best passagesamong the numerous matches, by judicious ranking on a per-relation basis.

For each relation, we build a prototype representation by selecting sentences that containlexical matches of all three SPO arguments, where the predicate is matched by its labelin the background knowledge or a short list (average length of three) of synonyms andclose hyponyms or hypernyms (e.g., “allegiance” or “loyalty” matching ally), manuallycollected from WordNet. Newly seen passages for entity pairs can then be scored againstthe per-relation prototypes by casting both into tf-idf-weighted bag-of-word models (oralternatively, word2vec-style embeddings) and computing their cosine distance. This way,we rank candidate passage for each seed pair and target relation.

Passages with gaps: Unlike encyclopedic articles, long texts on fictional domains havea narrative style where single sentences are unlikely to give the full information in themost compact way. Therefore, we consider multi-sentence contexts where entity mentionsacross different sentences. In addition to simple paragraphs, we consider passages with gapswhere we include sentences that are not necessarily contiguous but leave out uninformativesentences. This way, we maximize the value of limited-size text spans fed into the neural

5

BERT

[CLS] ... [E1] e1 [/E1] .... [E2] e2 [/E2] ... [SEP] e1_type [SEP] e2_type

BERT BERT...

...

Per-passage Layer

Classification Layer

Aggregation Layer

context 1 context 2 context k

Figure 3: Neural network architecture for multi-context RE.

learner. This is in contrast to earlier techniques that consume whole paragraphs and relyon attention mechanism for giving higher weight to informative parts.

KnowFi has two configuration parameters: the maximum number of sentences allowedbetween sentences that contain seed entities, and the number of sentences directly precedingor following the occurrence of a seed entity. In our experiments, we include text where thetwo entities appear at most 2 sentences apart and 1 preceding and 1 following sentence foreach of the entity mentions, up to 512 tokens which is the current limit of BERT-basednetworks.

Negative training samples: In addition to the positive training samples by the aboveprocedure, we generate negative samples by the following random process. For each relationr, we pick random entities e1 and e2 for each of the S and O roles such that there are otherentities x and y for which the background knowledge asserts (e1, r, x) and (y, r, e2) withx 6= e2 and y 6= e1. This improves on the standard technique of simply choosing anypair e1, e2 for which (e1, r, e2) does not hold, by selecting more difficult cases and thusstrengthening the learner. For example, both Herminone and Malfoy have some friends,but they are not friends of each other. The training of KnowFi uses a 1:1 ratio of positiveto negative samples.

5. Multi-Context Neural Extraction

KnowFi is trained with and applicable to multiple passages as input to an end-to-endTransformer-based network with full backpropagation of cross-entropy loss. Our neuralarchitecture has two specific components: a per-passage layer to learn BERT-based rep-resentations for each passage, and an aggregation layer that combines the signals from allinput passages. In the experiments in this paper, the aggregation layer is configured toconcatenate the representations of all passages, but other options are feasible, too.

Each input passage is encoded with markup of entity mentions. In addition, we de-termine semantic types for the entities, using the SpaCy tool (https://spacy.io/) thatprovides one type for each mention, chosen from a set of 18 coarse-grained types (person,

6

https://spacy.io/

Dataset # Instances # Rel. # Pos. Inst. # Neg. Inst. avg. # Pos. Inst./Rel. avg. # Pas./Inst.

Train 81,025 64 40,920 40,105 640 1.5

Dev 20,257 64 10,363 9,894 162 1.5

Table 1: Statistics on training and validation set. (Rel.: relation, Inst.: instances, Pos.: pos-itive instances, Neg.: negative instances, avg. #Pos.Inst./Rel.: average number of positiveinstance per relation, avg. #Pas./Inst.: average number of passages per instance)

Universe # rel. # facts top relations

Lord of the Rings 13 1,143 race, culture, realm, weapon

Game of Thrones 18 2,547 ally, culture, title, religion

Harry Potter 20 4,706 race, ally, house, owner

Dune 11 133 homeworld, ruler, commander

War and Peace 10 101 relative, child, spouse, sibling

Table 2: Statistics on test data of the five test universes.

nationality/religion, event, etc.). The type of each entity mention in a passage is appendedto the input vector. Figure 3 illustrates the neural network for multi-context RE.

6. LoFiDo Benchmark

To evaluate RE from long documents, we introduce the LoFiDo corpus (Long FictionDocuments). We compile SPO triples from infoboxes of 142 Wikia fan communities. Af-ter cleaning extractions and clustering synonyms, we obtain a total of 64 relations such asenemy, friend, ally, religion, weapon, ruler-of, etc.

For evaluating KnowFi and various baselines, we focus on 5 especially rich and diversedomains (i) Lord of the Rings (a series of three epic novels by J.R.R Tolkien), (ii) A Songof Ice and Fire (a series of five fantasy novels by George R.R. Martin, well-known for theGame of Thrones TV series based on it), (iii) Harry Potter (a series of seven books, writtenby J.K Rowling), (iv) Dune (a science-fiction novel by Frank Herbert), and (v) War andPeace (a classic novel by Leo Tolstoy). For the first four, Wikia infoboxes provide groundtruth; for War and Peace, we manually crafted a small ground-truth KB. 20% of the triplesfrom each of these universes are withheld for testing. For the first four domains, we considerboth original novels as well as narrative synopses from Wikia as input sources. War andPeace is not covered by Wikia.

LoFiDo Statistics. Our LoFiDo corpus contains 81,025 instances for training and 20,257instances for validation. For testing, we use five specific universes, which take input fromboth books and Wikia texts. The total number of instances in the test data from Wikia textsis 14,610, while in the case of books, it is 64,120. Ground-truth data for five test universesare provided for evaluation. Table 1 shows statistics on the training and validation data,while Table 2 shows statistics on the ground-truth of five domains in the test data. Furtherdetails on this dataset are in Appendix A and B. Code and data of KnowFi are availableat https://www.mpi-inf.mpg.de/yago-naga/knowfi.

7

https://www.mpi-inf.mpg.de/yago-naga/knowfi

7. Experiments

7.1 Setup

Baselines. We compare KnowFi to three state-of-the-art baselines on RE:

• BERT-Type [Shi and Lin, 2019] which uses BERT-based encodings augmented withentity type information, also based on SpaCy output in our experiments for fair com-parison.

• BERT-EM [Soares et al., 2019] which include entity markers in input sequences;

• GLRE [Wang et al., 2020] which additionally computes global entity representationsand uses them to augment the text sequence encodings.

The first two baselines run on a per-sentence basis, whereas GLRE is a state-of-the artmethod for extractions from short documents, which we train on paragraph-level inputs.The inputs for these models (i.e. sentences or paragraphs) are randomly selected.

KnowFi Parameters. For passage selection, we rely on TF-IDF-based bag-of-words sim-ilarity between the passages and relation contexts, where each relation context contains thetop-100 tokens, based on their tfidf scores. For selecting passages as multi-context input, wecompute the cosine between tf-idf-based vectors of each passage against the relation-specificprototype vector; we select all passages with cosine above 0.5 as positive training samples.For the neural network, we use BERTLARGE (https://huggingface.co/transformers/model_doc/bert.html) with 24 layers, 1024 hidden size and 16 heads. The learning rate is 5e − 5with Adam, the batch size is 8, and the number of training epochs is 10.

Evaluation Metrics. The evaluation uses standard metrics like precision, recall andF1, averaged over all extracted triples. We report micro-averaged numbers for all relationstogether, and drill down on selected relations of interest. In addition, we report numbers forHITS@k and MRR. As ground-truth, we perform two different modes of evaluation:

• Automated evaluation is based on ground-truth from Wikia infoboxes. This is de-manding on precision, but penalizes recall because of its limited coverage.

• Manual evaluation is based on obtaining assessments of extracted triples via crowd-sourcing. This way, we include correct triples that are not in Wikia infoboxes, and thusachieve higher recall.

7.2 Results

Automated Evaluation. Table 3 shows average precision, recall and F1 score. We cansee that sentence-level baselines achieve comparatively high coverage, due to consideringevery sentence. Yet their precision is extremely low. GLRE and KnowFi achieve muchhigher precision, though GLRE fails to achieve competitive recall, presumably because itstraining on all paragraphs lowers its predictive power. As an illustration, GLRE producesonly 173 assertions from all Harry Potter books, while KnowFi produces 600.

We also observe that for all methods, extraction from books is considerably harder thanfrom the more concise synopses in Wikia.

In addition to the P/R/F1 scores, in Table 4 we also take an entity-centric view andevaluate how well correct extractions rank. The HITs@k metric reports how often a cor-rect result appears among the top extractions per entity-relation pair (e.g., among top-5

8

https://huggingface.co/transformers/model_doc/bert.html

https://huggingface.co/transformers/model_doc/bert.html

ModelsBooks Wikia Texts

Precision Recall F1-score Precision Recall F1-score

BERT-Type (Shi and Lin) 0.00 0.07 0.00 0.02 0.05 0.00

BERT-EM (Soares et al.) 0.06 0.11 0.08 0.11 0.20 0.14

GLRE (Wang et al.) 0.17 0.03 0.05 0.18 0.07 0.10

KnowFi 0.14 0.11 0.12 0.17 0.26 0.21

Table 3: Automated evaluation: average precision, recall and F1 scores.


HIT@1 HIT@3 HIT@5 MRR HIT@1 HIT@3 HIT@5 MRR

BERT-Type (Shi and Lin) 0.01 0.02 0.04 0.02 0.09 0.20 0.23 0.16

BERT-EM (Soares et al.) 0.24 0.35 0.37 0.35 0.49 0.59 0.61 0.54

GLRE (Wang et al.) 0.40 0.53 0.54 0.46 0.47 0.62 0.68 0.57

KnowFi 0.45 0.54 0.55 0.50 0.60 0.71 0.72 0.66

Table 4: Automated evaluation: average HIT@K and MRR scores.

extracted enemies of Harry Potter), while MRR reports the mean reciprocal rank of thefirst extraction. We can observe that KnowFi outperforms all baselines on both metrics.

Manual Evaluation. The low absolute scores in the above evaluation largely stem fromincomplete automated ground truth. We therefore conducted an additional manual evalua-tion. For each domain, we select top 100 extractions from the results and used crowdsourcingto manually label their correctness. The annotators were Amazon master workers with alltime approval rate > 90%, and additional test questions were used to filter responses. Weobserved high inter-annotator agreement, on average of 0.81.

Table 5 shows results of our manual evaluation on four domains (Dune was left out dueto complexity). As one can see, KnowFi outperforms the baselines on most input texts, andachieves a remarkable precision on both books and wikia texts (average of 0.57 on booksand 0.73 on wikia texts).

We repeat the entity-centric evaluation with manual labels for three relations of specialinterest in fiction, friend, enemy and ally. We select 10 popular entities each from LoTR,GoT and Harry Potter. The resulting precision scores are shown in Table 6. As one can see,KnowFi is achieves high precision among its top extractions, e.g., 78% and 73% precisionat rank 1 for friend assertions from books/Wikia texts.

Evaluation on Short-Text Datasets. To evaluate the robustness of KnowFi, we alsoevaluate its performance on the existing sentence-level RE dataset TACRED, and the shortdocument-level RE dataset DocRED. The results are shown in Tbl. 7. We find KnowFi’s


LoTR GOT HP WP Avg. LoTR GOT HP WP Avg.

BERT-Type (Shi and Lin) 0.01 0.54 0.09 0.11 0.19 0.09 0.12 0.15 0.19 0.14

BERT-EM (Soares et al.) 0.45 0.66 0.37 0.29 0.44 0.70 0.78 0.48 0.50 0.62

GLRE (Wang et al.) 0.27 0.25 0.56 0.47 0.39 0.37 0.56 0.71 0.56 0.55

KnowFi 0.45 0.76 0.55 0.50 0.57 0.71 0.83 0.71 0.67 0.73

Table 5: Manual evaluation - average precision scores over 4 input texts (LoTR: Lord ofthe Rings, GOT: Game of Thrones, HP: Harry Potter, WP: War and Peace).

9

Sourcesfriend (top k objects) enemy (top k objects) ally (top k objects)1 3 5 1 3 5 1 3 5

Books 0.78 0.82 0.80 0.55 0.45 0.47 0.63 0.67 0.63

Wikia Texts 0.73 0.76 0.75 0.60 0.48 0.49 0.70 0.67 0.62

Table 6: Manual evaluation - precision of friend, enemy and ally relations.

ModelsTACRED DocRED

F1 - Dev F1 - Test F1 - Dev F1 - Test

BERT-Type (Shi and Lin) 0.65 0.64 - -

BERT-EM (Soares et al.) 0.64 0.62 - -

GLRE (Wang et al.) - - - 0.57

KnowFi 0.67 0.66 0.52 0.51

Table 7: Automated evaluation - short text datasets TACRED and DocRED.

performance on TACRED is on par with BERT-Type and BERT-EM (0.66 test-F1, versus0.63 and 0.62 for the baselines), the modest gain indicating that the combination of entitytypes and markers is beneficial. On DocRED, KnowFi achieved 0.51 F1-score, slightly belowthe GLRE model at 0.57 F1-score. We hypothesize that the modest losses stem from thefact that GLRE is specifically tailored for the short documents of TACRED, where multi-context aggregation is not relevant. At the same time, the single contexts GLRE considershave no inherent size limitation, unlike the 2-sentence distance threshold used in KnowFi.

Ablation Study. To evaluate the impact of passage ranking, we ran KnowFi withoutpassage ranking for both training and prediction. Instead, passages were randomly selected.In automated evaluation, without passage ranking, KnowFi achieves comparable recall butlower precision: 0.07 vs. 0.14 on books and 0.12 vs. 0.17 on Wikia texts. This patternis also observed in manual evaluation, where KnowFi, without passage ranking, achieves aprecision of 0.43 vs. 0.57 on books and 0.55 vs. 0.73 on Wikia texts. Further experimentsand ablation studies are in Appendix C and D.

Error Analysis. The precision gain from automated to manual evaluation (Table 3 vs.Table 5) indicates that ground-truth incompleteness is a confounding factor. By inspectinga sample of 50 false positives, we found that 20% originated from incomplete ground truth,while 54% were indeed not inferrable from the given contexts, pointing to limitations of thedistant-supervision-based training (e.g., extracting friendship from the sentence “Thorincame to Bilbo’s door”). Another 15% were errors in determining the subject or object incomplex sentences with many entity mentions, e.g., extracting friend(Boromir, Gimli) fromthe sentence “Boromir went ahead, Legolas and Gimli, who by now had become friends,followed.” Finally, 7% of the false positives captured semantically related relations butmissed the correct ones.

By sampling false negatives, we found that in 52% of the cases the retrieved contexts didnot allow the proper inference, indicating limitations in the context retrieval and ranking.In 33% of the cases, a human reader could spot the relation in the top-ranked contexts (e.g.,hasCulture (Legolas, Elf) in “He saw Legolas seated with three other Elves”). AppendixE shows some anecdotal examples for the output of KnowFi and baselines.

10

In Lord of the Rings, which summary is more informative for Frodo Baggins:

Summary 1:<Frodo, has parent, Drogo>, <Frodo, has culture, Shire>, <Frodo, has enemy, Sauron>,<Frodo, has friend, Sam>, <Frodo, has weapon, Sting>

Summary 2:<Frodo, has owner, Gandalf>, <Frodo, has weapon, Ring>, <Frodo, has parent, Drogo>,<Frodo, has affiliation, Sam>, <Frodo, has culture, Marish>

Table 8: Sample task for assessing entity summaries.

8. Extrinsic Use Case: Entity Summarization

To assess the salience in the extractions produced by KnowFi, we pursued a user studyto compare entity summaries, one from KnowFi and one by randomly drawing from oneof the three baselines. Each entity summary includes at most 5 best extractions (distinctrelations) from the book series Lord of the Rings, Game of Thrones and Harry Potter. Foreach domain, we generate summaries for 5 popular entities. We give pairs of summaries,with randomized order, to Amazon master workers for selecting the more informative one.Table 8 shows an example of this crowdsourcing task. The annotators preferred KnowFi-based summaries over BERT-Type, BERT-EM and GLRE in 93%, 64% and 81% of thecases, respectively.

9. Conclusion

To the best of our knowledge, this paper is the first attempt at relation extraction (RE) fromlong fictional texts, such as entire books. The presented method, KnowFi, is specificallygeared for this task by its judicious selection and ranking of passages. KnowFi outperformsstrong baselines on RE by a substantial margin, and it performs competitively even on theshort-text benchmarks TacRed and DocRed. The absolute numbers for precision and recallshow that there is still a lot of room for improvement. This underlines our hypothesis thatlong fictional texts are a great challenge for RE. Our LoFiDo corpus of Wikia texts, bookcontents, and ground-truth labels will be made available to foster further research. Allinformation can be found at https://www.mpi-inf.mpg.de/yago-naga/knowfi.

Acknowledgments

We thank Matteo Cannaviccio and Filipe Mesquita from Diffbot Inc. for their advice andsupport on labeling data, and Diffbot Inc. for free access to their RE API. We also thankour anonymous reviewers for their valuable comments.

11

https://www.mpi-inf.mpg.de/yago-naga/knowfi

References

Eugene Agichtein and Luis Gravano. Snowball: extracting relations from large plain-textcollections. In JCDL, 2000.

Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and OrenEtzioni. Open information extraction from the web. In IJCAI, 2007.

Sergey Brin. Extracting patterns and relations from the world wide web. In WebDBWorkshop, 1998.

Snigdha Chaturvedi, Shashank Srivastava, Hal Daume III, and Chris Dyer. Modeling evolv-ing relationships between characters in literary novels. In AAAI, 2016.

Cuong Xuan Chu, Simon Razniewski, and Gerhard Weikum. Tifi: Taxonomy induction forfictional domains. In The Web Conference, 2019.

Cuong Xuan Chu, Simon Razniewski, and Gerhard Weikum. Entyfi: Entity typing infictional texts. In WSDM, 2020.

Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, KamalNigam, and Sean Slattery. Learning to extract symbolic knowledge from the world wideweb. In AAAI, 1998.

Lei Cui, Furu Wei, and Ming Zhou. Neural open information extraction. ACL, 2018.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-trainingof deep bidirectional transformers for language understanding. In NAACL-HLT, 2019a.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-trainingof deep bidirectional transformers for language understanding. In NAACL-HLT, 2019b.

Oren Etzioni, Michael J. Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, TalShaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. Web-scale informationextraction in knowitall (preliminary results). In WWW, 2004.

Xu Han, Tianyu Gao, Yankai Lin, Hao Peng, Yaoliang Yang, Chaojun Xiao, Zhiyuan Liu,Peng Li, Jie Zhou, and Maosong Sun. More data, more relations, more context and moreopenness: A review and outlook for relation extraction. In AACL/IJCNLP, 2020.

Sven Hertling and Heiko Paulheim. Dbkwik: extracting and integrating knowledge fromthousands of wikis. Knowl. Inf. Syst., 2020.

Vincent Labatut and Xavier Bost. Extraction and analysis of fictional character networks:A survey. ACM Computing Surveys (CSUR), 2019.

Yang Li, Guodong Long, Tao Shen, Tianyi Zhou, Lina Yao, Huan Huo, and Jing Jiang. Self-attention enhanced selective gate with entity-aware embedding for distantly supervisedrelation extraction. In AAAI, 2020.

12

Mausam. Open information extraction systems and downstream applications. In IJCAI,2016.

Filipe Mesquita, Matteo Cannaviccio, Jordan Schmidek, Paramita Mirza, and Denilson Bar-bosa. Knowledgenet: A benchmark dataset for knowledge base population. In EMNLP-IJCNLP, 2019.

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relationextraction without labeled data. In ACL-AFNLP, 2009.

Jay Pujara, Hui Miao, Lise Getoor, and William W. Cohen. Using semantics and statisticsto turn data into knowledge. AI Mag., 2015.

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.

Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and ShivakumarVaithyanathan. An algebraic approach to rule-based information extraction. In ICDE,2008.

Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentionswithout labeled text. In ECML-PKDD, 2010.

Christopher De Sa, Alexander Ratner, Christopher Re, Jaeho Shin, Feiran Wang, Sen Wu,and Ce Zhang. Incremental knowledge base construction using deepdive. VLDB J., 2017.

Peng Shi and Jimmy Lin. Simple bert models for relation extraction and semantic rolelabeling. arXiv preprint arXiv:1904.05255, 2019.

Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matchingthe blanks: Distributional similarity for relation learning. In ACL, 2019.

Shashank Srivastava, Snigdha Chaturvedi, and Tom Mitchell. Inferring interpersonal rela-tions in narrative summaries. In AAAI, 2016.

Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. Supervised openinformation extraction. In NAACL, 2018.

Fabian M Suchanek, Mauro Sozio, and Gerhard Weikum. Sofie: a self-organizing frameworkfor information extraction. In WWW, 2009.

Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong Qi, and Rui Zhang. Neural relationextraction for knowledge base enrichment. In ACL, 2019.

David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. Entity, relation, andevent extraction with contextualized span representations. EMNLP, 2019.

Difeng Wang, Wei Hu, Ermei Cao, and Weijian Sun. Global-to-local neural networks fordocument-level relation extraction. EMNLP, 2020.

13

Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. Fine-tunebert for docred with two-step process. arXiv preprint arXiv:1909.11898, 2019.

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, LixinHuang, Jie Zhou, and Maosong Sun. Docred: A large-scale document-level relationextraction dataset. In ACL, 2019.

Dian Yu, Kai Sun, Claire Cardie, and Dong Yu. Dialogue-based relation extraction. ACL,2020.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning.Position-aware attention and supervised data improve slot filling. In EMNLP, 2017.

Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing Huang. Document-level relation ex-traction with adaptive thresholding and localized context pooling. AAAI, 2021.

14

Appendix A. Training Data Extraction

Many current KBs like Yago, DBpedia or Freebase have been built by extracting the infor-mation from infoboxes, category network and leveraging the markup language of Wikipedia.The relations of these KBs are then used as schema for many later supervised relation ex-tractors. However, for fiction, Wikipedia has too low coverage of entities and relevantrelations.

Wikia. Wikia (or Fandom) is the largest web platform for fiction and fantasy. It containsover 385k communities with total of over 50M pages. Each community (usually discussesabout one fictional universe) is organized as a single Wiki. With a wide range of coverageon fiction and fantasy, Wikia is one of the 10 most visited websites in the US (as of 2020)1.

Crawling. We download all universes which contain over 1000 content pages and haveavailable dump files from Wikia, and get total of 142 universes in the end. From theseuniverses, we extract all information from their category networks and infoboxes, and builda background knowledge base for each universe.

Definition A.1. Background KB of an universe is a collection of entities, entity men-tions, simple facts that describe relations between entities and a type system of the universe.

Background knowledge extraction. To extract the background KBs, we follow asimple procedure:

• Type system construction: The type system is extracted from Wikia categorynetwork. We adapt the technique from the TiFi system [Chu et al., 2019] to structureand clean the type system.

• Entity detection: Entities and entity mentions can be easily extracted from thedump file. We consider page titles as the entities in the universe (except administrationand category pages). On the other hand, entity mentions only appear in texts. Byusing Wiki markup, each mention can be extracted and linked to the entity with aconfident score which is computed based on its frequency.

• Infobox extraction: Facts about each entity are extracted from its infobox. Infoboxis presented in table format with the entity’s attributes and their values. Each ex-tracted fact is presented in a triple with subject, predicate (relations) and object. Inparticular, we consider the main entity as subject, the attributes as predicates, andthe values as objects. We manually check if there is any misspelling in the relationsand merge them if necessary.

This results an average of 158k entities and 13.5k facts in each universe. The informationfrom these background KBs is then used for all three later steps.

Relation Filtering. After extracting the background KBs, we get all relations fromthe facts of all universes and consider them as relation candidates that can be extractedin fictional domains. However, beside meta relations which are not really related to thecontent of universes, such as season, page, episode,..., there is much noise in the relationssince they are manually created by fans. To remove noise and keep popular relations, wedo relation filtering as follows:

1. https://ahrefs.com/blog/most-visited-websites/

15

0

1000

2000

3000

homeworld ch

ild

parentname typ

eera

siblingfriend

enemyally

residencetitle

spouserace

predecessor

owner

member

master

leader

founder

successorruler

location

language

affiliationheadsector

government cit

y

capital

culture

nationality

company

relative

country cla

n

inhabitant

dislikehouse

occupation

education

religion

continent

abilitybattle

birthplacepartof

variation

inventorblood

participant

performer

hobby

organization da

te

weaponparty

vehicle

deathplace

employer

productarm

y

birthdate

Figure 4: Statistics on training data.

• Pre-processing: a combination of stemming and keeping relations with length at least3 (except for some relations like job, age, son, etc.).

• Infrequent-relation removing: we only keep relations which are in at least 5 universesand appear in over 20 facts.

• Meta-relation removing: we manually check if the relation is a meta-relation. In total,there are 247 relations considered as meta-relations.

• Misspelling detection: Misspelling relations are manually detected and grouped withthe correct relations, for example, affilation and affiliation.

• Grouping: Synonym relations are manually grouped together, for example, leader andcommander.

After relation filtering, we reduce the number of relations from over 8,000 to 64 relations.These relations are considered as popular relations in fictional domains and used as targetsfor the relation extraction step. We realize that in fictional domains, the relations expressingthe friendly or hostile relationship between two entities are interesting, hence, we keep friendand enemy as two relations which are always extracted. Figure 4 shows statistics on trainingdata. We publish the training data as supplementary material.

Appendix B. Background KB Statistics

One of our contribution is the background KB dataset on popular universes in fictionaldomains. To have an overview about the dataset, Table 9 shows some statistics on ourbackground KBs database, which include information about universes, entities, type sys-tems, relations and facts.

From the 5 domains used for testing, the number of relations varies from 10 to 20, andthe number of ground-truth triples varies from 1,100 to 4,700 for the first three domains,and was between 100 and 200 the last two.

16

Statistics Top Universes Top Relations

# Universes 142 Universes # Facts Relations # # universes

per universe Star Wars 282,440 name 238,290 111

# Facts 13,539 Monster Hunter 153,178 type 112,347 94

# Relations 163 World of Warcraft 144,586 gender 95,972 77

# Entities 158,066 Marvel 77,826 affiliation 85,676 61

# Entity Mentions 224,782 DC Comics 69,190 era 53,871 12

# Entity Types 1246 Forgotten Realms 63,360 hair(color) 50,325 41

Table 9: Statistics on background KBs.

ThresholdBooks Wikia Texts


0.4 0.07 0.11 0.09 0.12 0.32 0.17

0.5 0.14 0.11 0.12 0.17 0.26 0.21

0.6 0.20 0.10 0.13 0.21 0.17 0.19

Table 10: Automated Evaluation: Study on the similarity threshold.

Appendix C. Additional Experiments

Similarity Threshold. In our experiments, we consider all passages with cosine above 0.5as positive training samples (section 7.1). To assess the effect of the similarity threshold,we conduct an ablation study on it. Table 10 reports the automated results of KnowFion both books and Wikia texts, where the threshold varies. For the author response wecompleted two other runs (threshold 0.4 and 0.6) that indicate modest influence, for thefinal version we would provide insights for all threshold values from 0 to 1 (in 0.1 step size).The results show that, with a similarity threshold around 0.5, the model achieves the bestF1. By increasing the similarity threshold, the model is able to achieve higher precision butlower recall and vice versa.

Embedding-based Passage Ranking. KnowFi uses a simple TF-IDF-based schemafor passage ranking. To assess the effectiveness of this method, we conduct an ablationstudy on the ranking step. Instead of using TF-IDF, we compute the embeddings of thepassages and the relation contexts using Sentence-BERT [Reimers and Gurevych, 2019].The cosine similarity between the passage embedding and the relation context embeddingis then computed using the sklearn library. We select all passages with cosine above 0.0(range [-1,1]), as positive training samples, with maximum of 5 passages per each traininginstance. Table 11 shows the automated results of KnowFi on both books and Wikiatexts. The higher scores on recall shows that the embeddings can help the model capturethe semantic relationships between passages and relation contexts better, especially whenhandling the cases of synonymy, while TF-IDF only handles the cases of lexical matching.However, in general, both techniques are on par, and embeddings do not improve the results,in terms of F1-score.

GLRE with Passage Ranking. In our experimental setup, the inputs of GLRE [Wanget al., 2020] (for both train and test) are randomly selected. To assess the effect of passageranking on GLRE, we conduct a study where the inputs of GLRE are selected by using ourmethod for passage ranking. Table 12 shows that, by using passage ranking to filter the

17

Ranking MethodsBooks Wikia Texts


TF-IDF-based 0.14 0.11 0.12 0.17 0.26 0.21

Embedding-based 0.12 0.11 0.12 0.10 0.30 0.15

Table 11: Automated Evaluation: Study on the ranking method.

Ranking MethodsBooks Wikia Texts


GLRE (Wang et al.) 0.17 0.03 0.05 0.18 0.07 0.10

GLRE + Passage Ranking 0.20 0.03 0.06 0.21 0.10 0.13

KnowFi 0.14 0.11 0.12 0.17 0.26 0.21

Table 12: Automated Evaluation: GLRE with Passage Ranking.

inputs, GLRE is able to achieve higher precision and recall, compared to GLRE withoutpassage ranking. However, this enhanced variant is still inferior to KnowFi by a substantialmargin.

Appendix D. Impact of Training Data Quality

Training data is one of the most important factors that impact the quality of the supervisedmodels, therefore, it is essential to maintain the quality of the training data, especially whenworking on specific domains where the training data is usually not available. To evaluate thequality of our training data, we compare KnowFi and a variant (i.e., without using passageranking on training data collection) with two other methods which are trained using manualtraining datasets:

• TACRED [Zhang et al., 2017], a popular dataset for relation extraction on the sentencelevel. We train our relation extraction model using TACRED and use the model toextract the relations from the test data.

• Diffbot [Mesquita et al., 2019], a commercial API for relation extractions. We run theDiffbot API on our test data to extract the relations.

Note that these comparisons are not systematic, as there are confounding other differencesbetween these two methods and KnowFi. In particular, the comparison with TACRED alsocomes with a reduction to sentence-level training examples only, while the Diffbot API is apre-trained general-purpose extractor with necessary limitations in more specific use cases.We automatically evaluate the extractions on three popular relations, spouse, sibling, child,since they are contained in all datasets.

Results. Table 13 reports the results on two universes, Lord of the Rings and Game ofThrones. The results show that, our training data achieves comparable results with otherdatasets and even higher F1-scores, in both books and Wikia texts.

18

Universes ModelsBooks Wikia Texts


LoTR

Diffbot 0.68 1.75 0.98 1.69 54.39 3.28TACRED-based 28.57 0.93 1.79 5.34 37.96 9.36KnowFi - w/o ranking 1.34 4.38 2.05 2.20 79.82 4.27KnowFi 15.1 2.00 3.53 8.19 27.19 12.58

GOT

Diffbot 6.10 18.97 9.24 7.85 61.46 13.92TACRED-based 8.45 4.61 5.96 19.66 40.11 26.39KnowFi - w/o ranking 8.29 15.81 10.87 9.64 47.43 16.03KnowFi 11.63 18.83 12.64 19.8 50.59 28.47

Table 13: Average scores on three popular relations: spouse, sibling, child

Source Relation Context(s) BERT-EM GLRE KnowFi GT

Books

enemyC1: So to gain time Gollum challenged Bilbo to the Riddle-game, saying that if he asked a riddle which Bilbocould not guess, then he would kill him and eat him.C2: There Gollum crouched at bay, smelling and listening; and Bilbo was tempted to slay him with his sword.

3 7 3 -

weapon

C1: They watched him rejoin the rest of the Slytherin team, who put their heads together, no doubt asking Malfoywhether Harry’s broom really was a Firebolt.C2: Faking a look of sudden concentration, Harry pulled his Firebolt around and sped off toward the Slytherin end.C3: Harry was prepared to bet everything he owned, including his Firebolt, that it wasn’t good news...

3 7 3 -

allyC1:...Lord Blackwood shall be required to confess his treason and abjure his allegiance to the Starks ...C2:...“I swore an oath to Lady Stark, never again to take up arms against the Starks”, said Blackwood ...

7 7 3 3

founderThere was a great roar and a surge toward the foot of the stairs; he was pressed back against the wall asthey ran past him, the mingled members of the Order of the Phoenix, Dumbledore’s Army, and Harry’s oldQuidditch team, all with their wands drawn, heading up into the main castle.

7 3 3 3

WikiaTexts

friend Mulciber was also a friend of Severus Snape, which upset Lily Evans, who was Snape’s best friend at the time. 3 7 3 -

spouse...Later, after sweets and nuts and cheese had been served and cleared away, Margaery and Tommen beganthe dancing, looking more than a bit ridiculous as they whirled about the floor. The Tyrell girl stood a goodfoot and a half taller than her little husband, and Tommen was a clumsy dancer at best ...

7 3 3 3

weaponRandyll repeatedly berates Sam: he insults his weight, tells him the Night’s Watch failed to make a man out of him,and says he will never be a great warrior , or inherit Heartsbane, the Tarly family’s ancestral Valyrian steel sword.

3 7 3 3

culture

C1: The most powerful Ainu, Melkor (later called Morgoth or ”Dark Enemy” by the elves), Tolkien’s equivalent of,disrupted the theme, and in response, Eru Iluvatar introduced new themes that enhanced the music beyond thecomprehension of the Ainur.C2: Melkor’s brother was Manwe, although Melkor was greater in power and knowledge than any of the Ainur.

3 3 3 3

Table 14: Anecdotal examples for the outputs of KnowFi (GT: ground-truth, subject inred, object in blue).

Appendix E. Anecdotal Examples

Table 14 gives examples for the output of the various methods on sample contexts. The redcolor texts denote subjects and the blue color texts denote objects.

19

KnowFi: Knowledge Extraction from Long Fictional Texts

Documents

Transcript of KnowFi: Knowledge Extraction from Long Fictional Texts