Weakly Supervised Subevent Knowledge Acquisition

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5345–5356,November 16–20, 2020. c©2020 Association for Computational Linguistics

5345

Weakly Supervised Subevent Knowledge Acquisition

Wenlin Yao1 Zeyu Dai1 Maitreyi Ramaswamy1 Bonan Min2 Ruihong Huang1

1 Department of Computer Science and Engineering, Texas A&M University2 Raytheon BBN Technologies

(wenlinyao,jzdaizeyu,maitreyiramaswam,huangrh)@[email protected]

Abstract

Subevents elaborate an event and widely existin event descriptions. Subevent knowledge isuseful for discourse analysis and event-centricapplications. Acknowledging the scarcity ofsubevent knowledge, we propose a weakly su-pervised approach to extract subevent relationtuples from text and build the first large scalesubevent knowledge base. We first obtain theinitial set of event pairs that are likely to havethe subevent relation, by exploiting two obser-vations that 1) subevents are temporally con-tained by the parent event, and 2) the defini-tions of the parent event can be used to furtherguide the identification of subevents. Then,we collect rich weak supervision using the ini-tial seed subevent pairs to train a contextualclassifier using BERT and apply the classifierto identify new subevent pairs. The evalua-tion showed that the acquired subevent tuples(239K) are of high quality (90.1% accuracy)and cover a wide range of event types. Theacquired subevent knowledge has been shownuseful for discourse analysis and identifying arange of event-event relations1.

1 Introduction

A subevent is the event that happens as a part of theother event (i.e., parent event) spatio-temporally(Glavaš and Šnajder, 2014). Subevents, whichelaborate and expand an event, widely exist inevent descriptions. For instance, when describ-ing election events, people usually describe typicalsubevents such as “nominate candidates”, “debates”and “people vote”. Knowing typical subevents ofan event can help with analyzing several discourserelations (such as expansion and temporal relations)between text units. Furthermore, knowing typical

1Code and the knowledge base are avail-able at https://github.com/wenlinyao/EMNLP20-SubeventAcquisition

subevents of an event is important for understand-ing the internal structure of the event (what is theevent about?) and its properties (is this a violent orpeaceful event?), and therefore has great potentialto benefit event detection, event tracking, event vi-sualization and event summarization among manyother applications.

While being in high demand, little subeventknowledge can be found in existing knowledgebases. Therefore, we aim to extract subevent knowl-edge from text and build the first subevent knowl-edge base covering a large number of commonlyseen events and their rich subevents.

Little research has focused on identifying thesubevent relation between two events in a text.Several datasets annotated with subevent relations(Glavaš et al., 2014; Araki et al., 2014; O’Gormanet al., 2016) exist, but they are extremely small andusually contain dozens to one/two hundred doc-uments. Subevent relation classifiers trained onthese small datasets are not suitable to use to ex-tract subevent knowledge from text, consideringthat subevent relations can appear in dramaticallydifferent contexts depending on topics and events.

We propose to conduct weakly supervised learn-ing and train a wide-coverage contextual classifierto acquire diverse event pairs of the subevent rela-tion from text. We start by creating weak super-vision, where we aim to identify the initial set ofsubevent relation tuples from a text corpus. With nocontextual classifier at the beginning, it is difficultto extract subevent relation tuples because subeventrelations are rarely stated explicitly. Instead, wepropose a novel two-step approach to indirectlyobtain the initial set of subevent relation tuples, ex-ploiting two key observations that (1) subevents aretemporally contained by the parent event, and thuscan be extracted with linguistic expressions that in-

5346

dicate the temporal containment relationship2, and(2) the definition of the parent event is useful toprune spurious subevent tuples away to improvethe quality.

Specifically, we first use several preposition pat-terns (e.g., ei during ej) that indicate the temporalrelation contained_by between events to identifycandidate subevent relation tuples. Then, we con-duct an event definition-guided semantic consis-tency check to remove spurious subevent tuplesthat often include two temporally overlapping butsemantically incompatible events. For example, anews article may report a bombing event that hap-pened in parallel during a festival, but the intensebombing event is not semantically compatible withthe entertaining event festival, as informed by thecommon definition of festival:

A festival is an organized series of cele-bration events, or an organized series ofconcerts, plays, or movies, typically oneheld annually.

Next, we identify sentences from the text corpusthat contain an event pair, and use these sentencesto train a contextual classifier that can recognize thesubevent relation in text. We train the contextualsubevent relation classifier by fine-tuning the pre-trained BERT model (Devlin et al., 2019). We thenapply the contextual BERT classifier to identifynew event pairs that have the subevent relation.

We have built a large knowledge base of 239Ksubevent relation tuples. The knowledge base con-tains subevents for 10,318 unique events, with eachevent associated with 20.1 subevents on average.Intrinsic evaluation demonstrates that the learnedsubevent relation tuples are of high quality (90.1%of accuracy) and are valuable for event ontologybuilding and exploitation.

The learned subevent knowledge has been shownuseful for identifying subevent relations in text,including both intra-instance and cross-sentencecases. In addition, the learned subevent knowl-edge is shown useful for identifying temporal andcausal relations between events as well, for thechallenging cross-sentence cases where we usuallyhave little contextual clues to rely on. Furthermore,when incorporated into a recent neural discourse

2While subevents are also spatially contained by the parentevent, we did not use this observation to identify candidatesubevent relations because the spatial contained_by relationbetween two events is not frequently stated in text.

parser, the learned subevent knowledge has notice-ably improved the performance for identifying twotypes of implicit discourse relations, expansion andtemporal relations.

In short, we made three main contributions: 1).We developed a novel weakly supervised approachto acquire subevent knowledge from text. 2). Wehave built the first large scale subevent knowledgebase that is of high quality and covers a wide rangeof event types. 3). We performed extensive evalu-ation showing that the harvested subevent knowl-edge not only improves subevent relation extrac-tion, but also improves a wide range of NLP taskssuch as causal and temporal relation extraction anddiscourse parsing.

2 Related Work

Subevent Identification: Only a few studies havefocused on identifying subevent relations in text.(Araki et al., 2014) built a logistic regression modelto classify the relation between two events intofull coreference (FC), subevent parent-child (SP),subevent sister (SS), and no coreference (NC).They improved the prediction of SP relations byperforming SS prediction first and using SS pre-diction results in a voting algorithm. (Glavaš andŠnajder, 2014) trained a logistic regression classi-fier using a range of lexical and syntactic featuresand then used Integer Linear Programming (ILP) toenforce document-level coherence for constructingcoherent event hierarchies from news. Recently,(Aldawsari and Finlayson, 2019) outperformed pre-vious models for subevent relation prediction usinga linear SVM classifier, by introducing several newdiscourse features and narrative features.

Subevent Knowledge Acquisition: Consideringthe generalizability issue of supervised contextualclassifiers trained on small annotated data, our pilotresearch on subevent knowledge acquisition (Bad-gett and Huang, 2016) relies on heuristics, wherewe first identify sentences in news articles that arelikely to contain subevents by exploiting a senten-tial pattern3, and then, we extract subevent phrasesfrom those sentences using a phrasal pattern4. Inaddition, this pilot work does not aim to acquirethe parent event together with subevents, instead,

3Subevents often appear in sentences that start or end withcharacteristic phrases such as “media reports” and “witnesssaid”.

4Subevent phrases often occur together in conjunction con-structions as a sequence of subevent phrases.

5347

Populate

election, debatefestival, bombing

election, debatefestival, bombing

PopulateSemantic

Consistency Check (Section 4.2)

Apply

Train Weak Supervision (Section 4)

Contextual Classifier (Section 5)Corpus

prep. Patterns

Surrounding Subevents

... presidential debate ... election were held ... ... a debate to discuss ... during election of ...

... election compaigns ... engage in a debate ...

campaign, persuadeelection, begin term campaign, persuade

... a campaign in which ... by persuading voters ... ... organized a campaign ... persuaded Congress ...

... after a disputed election ... would begin his term ... Distill

Candidate Seed Pairs

Candidate New Pairs

Contexts

ContextsNew Pairs

Identifying New Pairs (Section 6)

Figure 1: Overview of the Subevent Knowledge Acquisition System

it learns a list of subevent phrases from documentsthat are known to describe a certain type of event.Specifically, in this work, we only acquired 610subevent phrases for one type of parent event, civilunrest events. The recent work (Bosselut et al.,2019; Sap et al., 2019) uses generative languagemodels to generate subevent knowledge amongmany other types of commonsense knowledge.

We can potentially incorporate our learnedsubevent knowledge into a general event ontologyto enrich subevent links in the ontology. For in-stance, the Rich Event Ontology (REO) (Brownet al., 2017) unifies two existing knowledge re-sources (i.e., FrameNet (Fillmore et al., 2003) andVerbNet (Kipper et al., 2008)) and two event anno-tated datasets (i.e., ACE (Doddington et al., 2004)and ERE (Song et al., 2015)) to allow users to querymultiple linguistic resources and combine event an-notations. However, REO contains few subeventrelation links between events.

Identification and Acquisition of other EventRelations: Compared to relatively little researchdevoted to subevent identification and acquisi-tion, significantly more research has been donefor identifying and extracting several other typesof event relations, especially temporal relations(Pustejovsky et al., 2003; Chklovski and Pantel,2004; Bethard, 2013; Llorens et al., 2010; D’Souzaand Ng, 2013; Chambers et al., 2014) and causalrelations (Girju, 2003; Bethard and Martin, 2008;Riaz and Girju, 2010; Do et al., 2011; Riaz andGirju, 2013; Mirza and Tonelli, 2014, 2016).

3 Overview of the Weakly SupervisedApproach

Figure 1 shows the overview of the weakly super-vised learning approach for subevent knowledge

acquisition. The key of this approach is to identifyseed event pairs that are likely to be of the subeventrelation in a two-step procedure (Section 4). Wefirst use several temporal relation patterns (e.g., eiduring ej) to identify candidate seed pairs sincea child event is usually temporally contained byits parent event; and then, we conduct a definition-guided semantic consistency check to remove spu-rious subevent pairs that are semantically incompat-ible and are unlikely to have the subevent relation,e.g., (festival, bombing).

Next, we find occurrences of seed pairs in a largetext corpus to quickly generate many subevent re-lation instances, we will also create negative in-stances to train the subevent relation classifier (Sec-tion 5). Then, the trained contextual classifier willbe used to identify new event pairs of the subeventrelation by examining multiple occurrences of anevent pair in text (Section 6). We use the EnglishGigaword (Napoles et al., 2012) as the text corpus.

4 Weak Supervision

4.1 Seed Event Pair Identification

We use six preposition patterns (i.e., during, in,amid, throughout, including, and within) to ex-tract candidate seed event pairs. Specifically, weuse dependency relations5 to recognize preposi-tion patterns, and extract the governor word anddependent word of each pattern. We then checkwhether both words are event triggering words, andtry to attach an argument to an event word to forman event phrase that tends to be more expressiveand self-contained than a single event word, e.g.,sign agreement vs sign, or, attack on troops vsattack. We consider both verb event phrases and

5We use Stanford dependency relations (Manning et al.,2014), e.g., prep_during.

5348

noun event phrases (Appendix A provides more de-tails). We further require that at least one argumentis included in an event pair which may be attachedto the first or the second event. In other words, wedo not consider event pairs in which neither eventhas an argument.

To select seed subevent pairs, we consider eventpairs that co-occur with at least two different pat-terns for at least three times. In this way, we iden-tified around 43K candidate seed pairs from theGigaword corpus. However, many candidate seedpairs identified by the preposition patterns onlyhave the temporal contained_by relation but do nothave the subevent relation. In order to remove suchspurious subevent pairs, we present an event defi-nition guided approach next to conduct semanticconsistency check between the parent event and thechild event of a candidate subevent relation tuple.

4.2 Definition-Guided Semantic CheckThe intuition is that the definition of a parent eventword describes important aspects of the event’smeanings and signifies its potential subevents. Forexample, based on the definition of festival, eventsrelated to “celebrations”, such as ceremony beingheld and set off fireworks, are likely to be correctsubevents of festival; however, bomb explosion andpeople being killed may be distinct events that onlyhappen temporally in parallel with festival.

Specifically, we perform semantic consistencychecks collectively for many candidate event pairsby considering similarities between events and sim-ilarities between the definition of an event and itssubevents, and we cluster event phrases into groupsso that any two event phrases within a group aresemantically compatible. Therefore, when the clus-tering operation is completed, we will recognizean event pair as a spurious subevent relation pairif its two events fall into different clusters. Next,we describe details on graph construction and theclustering algorithm we used.

4.2.1 Graph ConstructionGiven a set of event pairs needing the semantic con-sistency check, we construct an undirected graphG(V,E), where each node in V represents a uniqueevent phrase. We connect event phrases with twotypes of weighted edges. First, for each candidatesubevent relation tuple, we create an edge of weight1.0 between the parent event and the child event.Second, we create an edge between any two eventphrases if their similarity is greater than a certain

threshold, and the edge weight is their similarityscore. To calculate the similarity between two eventphrases, we pair each word from one event phrase(either the event word or an argument) with eachword from the other event phrase and calculate sim-ilarity between two word embeddings6, then thesimilarity between two event phrases is the averageof their word pair similarities. We set the similar-ity threshold as 0.3, after inspecting 200 randomlyselected event pairs with their similarities. If twoevent phrases are already connected because theyare a candidate subevent relation pair, we add theirsimilarity score to the edge weight.

Next, we incorporate event definitions by addingnew nodes and new edges to the graph. Specif-ically, for each event phrase that appears as theparent event in some candidate subevent relationtuples, we create a new node for its event wordrepresenting the event word definition. If the eventword has multiple meanings and therefore multi-ple definitions, we consider at most five definitionsretrieved from WordNet (Miller, 1995) and cre-ate one node for each definition, assuming eachdefinition of the parent event will attract differenttypes of children events. Then, we connect eachdefinition node of a parent event with its childrenevents, if their similarity is over the same similaritythreshold used previously. The similarity betweena definition node and a child event is calculated byexhaustively pairing each non-stop word from thedefinition sentence and each word from the childevent phrase and taking the average of word pairsimilarities.

4.2.2 The Clustering AlgorithmWe use a graph propagation algorithm calledSpeaker-Listener Label Propagation Algorithm(SLPA) (Xie et al., 2011). SLPA has been shown ef-fective for detecting overlapping clusters (Xie et al.,2013), which is preferred because multiple types ofevents may share common subevents. For instance,people being injured is a commonly seen subeventof conflict events (e.g., combat) as well as disasterevents (e.g., hurricane). In addition, SLPA is self-adapted and can converge to the optimal numberof clusters, with no pre-defined number of clustersneeded. Event clusters often become stable soonafter 50 iterations, to ensure convergence, we ranthe algorithm for 60 iterations.

After performing the semantic consistency

6We used word2vec word embeddings.

5349

check, we retained around 30K seed event pairs.We find occurrences of these event pairs in the Giga-word corpus and obtained around 388K7 sentencescontaining an event pair. These sentences will beused as positive instances to train the contextualclassifier.

5 The Contextual Classifier Using BERT

Recently, BERT (Devlin et al., 2019) pretrainedon massive data has achieved high performanceon various NLP tasks. We fine-tune a pretrainedBERT model to build the contextual classifier forsubevent relation identification.

BERT model is essentially a bi-directionalTransformer-based encoder that consists of mul-tiple layers where each layer has multiple attentionheads. Formally, given a sentence with N tokens,each attention head transforms a token vector tiinto query, key, and value vectors qi, ki, vi throughthree linear transformers. Next, for each token,the head calculates the self-attention scores for allother tokens of the input sentence against this tokenas the softmax-normalized dot products betweentwo query and key vectors. The output oi of each at-tention head is a weighted sum of all value vectors:

oi =NX

j=1

wijvj , wij =exp(qTi kj)PNl=1 exp(q

Ti kl)

In this way, we can obtain N contextualized embed-dings {oi}Ni=1 for all words {wi}Ni=1 in a sentenceusing the BERT model. To enforce the BERT en-coder to look at context information other than thetwo event trigger words of a subevent pair, e.g.,war, person battle, we replace the two event triggerwords in a sentence with a special token [MASK]as the original BERT model did for masking. Thecontextualized embeddings at two event triggers’positions (two [MASK]’s positions) are concate-nated and then fed into a feed-forward neural net-work with a softmax prediction layer for three-wayclassification, i.e., to predict two subevent relations(parent-child and child-parent relations dependingon the textual order of two events) and no subeventrelation (Other).

In our experiments, we use the pretrainedBERTbase model provided by (Devlin et al., 2019)with 12 transformer block layers, 768 hidden size

7Some event pairs appear very frequently in the corpus, toencourage diversity of the training data, we keep at most 20sentences that contain the same event pair.

and 12 self-attention heads8. We train the classifierusing cross-entropy loss and Adam (Kingma andBa, 2015) optimizer with initial learning rate 1e-5,0.5 dropout, batch size 16 and 3 training epochs.

5.1 Negative Training InstancesHigh-quality negative training instances that can ef-fectively compete with positive instances are impor-tant to enable the classifier to distinguish subeventrelations from non-subevent relations. We includetwo types of negative instances to fine-tune theBERT classifier.

First, we randomly sample sentences that con-tain an event pair different from any seed pair orcandidate pair (Section 6.1) as negative instances.We sample such negative sentences equal to fivetimes of positive sentences, considering that mostsentences in a corpus do not contain a subevent rela-tion. Second, we observe that the subevent relationis often confused with temporal and causal eventrelations because a subevent is strictly temporallycontained by its parent event. Therefore, to im-prove the discrimination capability of the classifier,we also include sentences containing temporallyor causally related events as negative instances.Specifically, we apply a similar strategy - usingpatterns9 to extract temporal and causal event pairsand then search for these pairs in the text corpus tocollect sentences that contain a temporal or causalevent pair. Event pairs that co-occur with tempo-ral or causal patterns for at least three times areselected for population. We collected 63K tempo-rally related event pairs and 61K causally relatedevent pairs, which were used to identify 371K sen-tences that contain one of the event pairs. In total,we obtained around 1.8 million negative traininginstances.

6 Identifying New Subevent Pairs

We next apply the contextual BERT classifier toidentify new event pairs that express the subeventrelation. It is unnecessary to test on all possiblepairs of events since two random events that co-occur in a sentence have a small chance to havethe subevent relation. In order to narrow downthe search space, we first identify candidate eventpairs that are likely to have the subevent relation.

8Our implementation was based on https://github.com/huggingface/transformers.

9Three temporal patterns - “following”, “before”, “after”and seven causal patterns - “lead to”, “result in”, “result from”,“cause”, “cause by”, “due to”, “because of” are used.

5350

Then, we apply the contextual classifier to examineinstances of each candidate event pair in order todetermine valid subevent relation pairs.

6.1 Candidate Event PairsWe consider two types of candidate event pairs.First, the preposition patterns used to identify seedsubevent relation tuples are again used to identifycandidate event pairs, but with less strict conditions.Specifically, we consider event pairs that co-occurwith any pattern for at least two times as candidateevent pairs. In this way, we identified 1.4 millioncandidate event pairs from the Gigaword corpus.

Second, when a subevent relation tuple appearsin a sentence, it is common to observe othersubevents of the same parent event in the surround-ing context. Therefore, we collect sentences thatcontain a seed subevent relation tuple, and identifyadditional subevents of the same parent event in thetwo preceding and two following sentences. Fur-thermore, we observe that the additional subeventsoften share the subject or direct object with thesubevent of the seed tuple, as a consequence, weonly consider such event phrases found in the sur-rounding sentences and pair them with the par-ent event of the seed tuple to create new candi-date event pairs. Using this method, we extractedaround 89K candidate event pairs from the Giga-word corpus.

6.2 New Subevent Pair Selection CriteriaWe identify a candidate event pair as a newsubevent relation pair only if the majority of itssentential contexts, specifically more than 50% ofthem, were consistently labeled as showing thesubevent relation by the BERT classifier. In addi-tion, we disregard rare event pairs and require thatat least three instances of an event pair have beenlabeled as showing the subevent relation.

The full weakly supervised learning process ac-quires 239K subevent relation pairs, including 30Kseed pairs and 209K classifier identified pairs. Thesubevent knowledge base has 10,318 unique eventsshown as parent events, and each parent event isassociated with 20.1 children events on average.

6.3 An Example Subevent Knowledge GraphThe initial exploration of the learned subeventknowledge shows two interesting observations ofevent hierarchies. Figure 2 shows an example eventgraph. First, we can draw a partition of the eventspace at multiple granularity levels by grouping

Seed Pairs P/R/F1Before Semantic check 44.9/25.3/32.4After Semantic check 55.9/26.2/35.7

Table 1: Performance of the Contextual Classifier.

events based on subevents they share, e.g., the up-per and the lower sections of the example eventgraph illustrate two broad event clusters sharing nosubevent, and within each cluster, we see smallerevent groups (colored) that share subevents exten-sively within a group while sharing fewer subeventsbetween groups. Second, subevents encode eventsemantics and reveal different development stagesof the parent events, e.g., subevents of natural disas-ter events (top left corner) reflect disaster responseand recovery stages.

7 Intrinsic Evaluation

7.1 Precision of the Contextual ClassifierThe contextual classifier is a key component of ourlearning approach. We evaluate the performanceof the BERT contextual classifier on identifyingsubevent relations against several other types ofevent-event relations (e.g., temporal, causal rela-tions, etc.), using the Richer Event Description(RED) corpus (O’Gorman et al., 2016) that is com-prehensively annotated with rich event-event rela-tions. Since the contextual classifier mainly per-forms at the sentence level, we only consider toidentify intra-sentence subevent relations in theRED dataset10.

Table 1 shows the comparisons between twotraining settings - the BERT classifier trained onseed pairs before vs after applying the semanticcheck (43k vs 30k seed pairs) and their identifiedtraining instances. Conducting the semantic checkimproves the precision of the trained classifier by11% with no loss on recall. Without using any an-notated data, the classifier achieves the precisionof 55.9%. While the precision on predicting eachsentential context is not perfect, note that we retaina candidate subevent relation pair only if the ma-jority and more than three of its sentential contextsshow the subevent relation.

7.2 Accuracy of Acquired Subevent PairsWe randomly sampled around 1% of acquiredsubevent pairs, including 300 from seed subeventpairs and 2,090 from newly learned subevent pairs,

10RED has 2635 intra-sentence event relations, 530 of themare subevent relations.

5351

hurricanelost home

suffer damage

seek refuge

earthquake

flooding

evacuate people

war

demonstration

people diepeople are stranded

protest

people fight

invade country

people are killed

exchange fire

people are injured

summit

forum

seminar

conference

interview

Olympics

tournament

championship

deliver speech

discuss issues

reach agreement

sign agreement

raise concerns

coach people

athlete is tested

victory over location

set record

win medal

fans revel

person glory

person delight

overwhelm opponent

person is named

treat patient

burn flag

block road

arrest people

president resign

flee area

player appear

make mistake

casualty be reported

home crumble

exchange view

make comment

make announcement

bombing

competition

Disaster Response

Disaster Recovery

Figure 2: Example Subevent Knowledge Graph ( ! denotes the Parent!Child subevent relation).

Method RED HiEveTrain and test on intra-sentence event pairs

Basic BERT Classifier 61.8/52.3/56.6 49.0/46.7/47.9+ Subevent links 64.8/55.1/59.5 52.5/49.2/50.8+ Event embeddings 67.4/54.2/60.0 52.8/46.3/49.4

Train and test on cross-sentence event pairsBasic BERT Classifier 65.0/64.8/64.9 33.8/37.4/35.5

+ Subevent links 69.6/66.3/67.9 34.0/37.9/35.8+ Event embeddings 69.2/62.9/65.9 32.5/40.8/36.2

Table 2: Subevent Relation Identification. P/R/F1 (%).We predict Parent-Child and Child-Parent subevent re-lations and report the micro-average performance.

and asked two human annotators to judge whetherthe subevent relation exists between two events.The two annotators labeled 250 event pairs in com-mon and agreed on 93.6% (234) of them, and theremaining subevent pairs were evenly split betweenthe two annotators. According to human annota-tions, the accuracy of seed pairs is 91.6% and theaccuracy of newly learned event pairs is 89.9%,with the overall accuracy of 90.1%.

7.3 Coverage of Acquired Subevent Pairs

To see whether the acquired subevent knowledgehas good coverage of diverse event types, we com-pare the unique events appearing in the acquiredsubevent relation tuples with events annotated intwo datasets, ACE (Doddington et al., 2004) andKBP (Ellis et al., 2015), both with rich event typesannotated and being commonly used for event ex-traction evaluation. We found that 73.8% (656/889)of events in ACE and 66.9% (934/1396) of eventsin KBP match with events in the acquired subevent

pairs. Because we aim to evaluate the coverage ongeneral event types instead of specific events, weignore event arguments and only match event wordlemmas.

In addition, we compare our learned 239Ksubevent pairs with the 30K ConceptNet subeventpairs. Interestingly, the two sets only have 311event pairs in common, which shows that our learn-ing approach extracts subevent pairs from real textsthat are often hard to think of by crowd sourcingworkers, the approach used by ConceptNet.

8 Extrinsic Evaluation

8.1 Subevent Relation Identification

To find out whether the learned subevent knowl-edge can be used to improve subevent relation iden-tification in text, we conducted experiments on twodatasets, RED11 and HiEve12 (Glavaš et al., 2014).In our experiments, we consider intra-sentenceand cross-sentence event pairs separately. We ran-domly split data into five folds and conduct cross-validation for evaluation. We fine-tune the sameBERT model using RED or HiEve annotations to

11RED has 530 intra-sentence and 415 cross-sentencesubevent relations.

12HiEve annotated 3,200 event mentions and theirsubevents as well as coreference relations in 100 documents.We first extended the subevent annotations using transitiveclosure rules and coreference relations (Glavaš et al., 2014;Aldawsari and Finlayson, 2019), which produces 490 intra-sentence and 3.1K cross-sentence subevent relations.

5352

Method Macro Acc Comparison Contingency Expansion TemporalBase Model 50.8/47.8/49.0 56.42 43.8/39.0/41.3 44.7/51.3/47.8 66.6/65.7/66.2 48.2/35.0/40.6

+ Subevent (ours) 53.2/49.5/51.0 59.08 44.3/34.9/39.1 49.2/46.1/47.6 66.3/73.3/69.6 52.8/43.8/47.9

Table 3: Multi-class Classification Results on the PDTB dataset. We report accuracy (Acc), macro-average (Macro)P/R/F1 (%) over four implicit discourse relation categories as well as performance on each category.

Method RED TimeBankTrain and test on intra-sentence event pairs




Table 4: Temporal Relation Identification. P/R/F1 (%).We predict Before and After temporal relations and re-port the micro-average performance.

Method RED ESCTrain and test on intra-sentence event pairs




Table 5: Causal Relation Identification. P/R/F1 (%).We predict Cause-Effect and Effect-Cause relations andreport the micro-average performance.

predict subevent relations vs others1314. Note thatfor cross-sentence event pairs, we simply concate-nate two sentences and insert in between the specialtoken [SEP] used in the original BERT.

We propose two methods to incorporate thelearned subevent knowledge. 1) Subevent links.For a pair of events to classify in the RED or HiEvedataset, we check if they match with our learnedsubevent relation tuples. We ignore event argu-ments for matching events and only consider tomatch event word lemmas, for this reason, onepair of events might match with multiple learnedsubevent relations. We count subevent relationsthat match with a given event pair, (X, Y), in two

13For the RED dataset, we consider all the annotated event-event relations in RED other than subevent relations as others.

14For the HiEve dataset, we exhaustively create event men-tion pairs among all the annotated event mentions in HiEveand consider all the mention pairs that were not annotatedwith the subevent relation as others. In this way, we generated3.5K intra-sentence and 59.5K cross-sentence event mentionpairs as others.

directions (X subevent! Y) and (Y subevent! X) sepa-rately, and encode the log values of the two countsin a vector. 2) Event embedding. Subevent rela-tions can be used to build meaningful event em-beddings to have the embeddings of a parent eventand a child event preserve the subevent relationbetween them. Therefore, we train a BiLSTMencoder15 to build event phrase embeddings, us-ing the knowledge representation learning modelTransE (Bordes et al., 2013)16 such that p+ r ⇡ cgiven a parent-child event pair (p, c) having thesubevent relation r. We will use the trained BiL-STM encoder to obtain an embedding for an eventphrase in the RED or HiEve dataset.

Finally, for subevent relation identification, weconcatenate two event word representations ob-tained by the BERT encoder with either a subeventlink vector or two event embeddings obtained usingthe above two methods. Results are shown in Table2. We can see that compared to the basic BERTclassifier, incorporating learned subevent knowl-edge achieves better performance on both datasets,for both intra-sentence and cross-sentence cases.

8.2 Temporal and Causal RelationIdentification

Subevents indicate how an event emerges and de-velops, and therefore, the learned subevent knowl-edge can further be used to identify other seman-tic relations between events, such as temporal andcausal relations. For evaluation, we use the sameRED 17 dataset plus two more datasets, TimeBankv1.218 (Pustejovsky et al., 2003) and Event Story-line Corpus (ESC) v1.519 (Caselli and Inel, 2018),

15The BiLSTM has the hidden size of 50 and uses max-pooling to encode an event phrase.

16We trained TransE for 20 iterations.17RED has 1104 (1010) intra-sentence and 182 (119) cross-

sentence temporal (causal) relations. We consider all theannotated event-event relations in RED other than temporal(causal) relations as others.

18TimeBank has 1,122 intra-sentence and 247 cross-sentence “before/after” temporal relations. We consider allthe annotated event-event relations in TimeBank other than“before/after” relations as others.

19ESC has 1,649 intra-sentence and 3,952 cross-sentencecausal relations. We exhaustively create event mention pairs

5353

dedicated to evaluate temporal relation and causalrelation identification systems respectively. Weuse the same experimental settings, including 5-fold cross-validations and evaluating predictionsof intra- and cross-sentence cases separately. Inaddition, we repurpose the BERT model to predicttemporal relations vs others or predict causal rela-tions vs others, and we use the same two methodsto incorporate the learned subevent knowledge.

Table 4 and 5 show results of temporal andcausal relation identification. We can see thatsubevent knowledge has little impact for identi-fying intra-sentence temporal and causal relationsthat may heavily rely on local contextual patternswithin a sentence. However, for identifying themore challenging cross-sentence cases that usuallyhave little contextual clues to rely on, the learnedsubevent knowledge has noticeably improved thesystem performance on both datasets. This is truefor both temporal relations and causal relations.Overall, the systems achieved the best performancewhen using the event embedding approach to incor-porate subevent knowledge.

8.3 Implicit Discourse Relation Classification

We expect subevent knowledge to be useful forclassifying discourse relations between two textunits in general because subevent descriptions of-ten elaborate and provide a continued discussionof a parent event introduced earlier in text. Forexperiments, we used our recent discourse parsingsystem (Dai and Huang, 2019) that easily incor-porates external event knowledge as a regularizerinto a two-level hierarchical BiLSTM model (BaseModel) for paragraph-level discourse parsing. Theexperimental setting is exactly the same as in (Daiand Huang, 2019).

Table 3 reports the performance of implicit dis-course relation classification on PDTB 2.0 (Prasadet al., 2008). Incorporating the acquired subeventpairs (239K) into the Base Model improves theoverall macro-average F1-score and accuracy by2.0 and 2.6 points respectively, which is non-trivialconsidering the challenges of implicit discourserelation identification. The performance improve-ments are noticeable on both the expansion relationand the temporal relation categories.

among all the annotated event mentions in ESC and considerall the mention pairs that were not annotated with the causal re-lation as others. In this way, we generated 4.1K intra-sentenceand 34K cross-sentence event mention pairs as others.

9 Conclusions

We have presented a novel weakly supervised learn-ing framework for acquiring subevent knowledgeand built the first large scale subevent knowledgebase containing 239K subevent tuples. Evaluationshowed that the acquired subevent pairs are of highquality (90.1% of accuracy) and cover a wide rangeof event types. We performed extensive evaluationsshowing that the harvested subevent knowledge notonly improves subevent relation extraction, but alsoimprove a wide range of NLP tasks such as causaland temporal relation extraction and discourse pars-ing. In the future, we would like to explore usesof the subevent knowledge base for other event-oriented applications such as event tracking.

Acknowledgments

We thank our anonymous reviewers for provid-ing insightful review comments. We gratefullyacknowledge support from National Science Foun-dation (NSF) via the awards IIS-1942918 andIIS-1755943. This work was also supported byDARPA/I2O and U.S. Army Research Office Con-tract No. W911NF-18-C-0003 under the WorldModelers program, and in part by the Officeof the Director of National Intelligence (ODNI),Intelligence Advanced Research Projects Activ-ity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER program. Theviews, opinions, and/or findings expressed arethose of the author(s) and should not be interpretedas representing the official views or policies, eitherexpressed or implied, of NSF, ODNI, IARPA, theDepartment of Defense or the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for governmental purposesnotwithstanding any copyright annotation therein.

ReferencesMohammed Aldawsari and Mark Finlayson. 2019. De-

tecting subevents using discourse and narrative fea-tures. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 4780–4790.

Jun Araki, Zhengzhong Liu, Eduard H Hovy, andTeruko Mitamura. 2014. Detecting subevent struc-ture for event coreference resolution. In LREC,pages 4553–4558.

Allison Badgett and Ruihong Huang. 2016. Extract-ing subevents via an effective two-phase approach.

5354

In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages906–911.

Steven Bethard. 2013. Cleartk-timeml: A minimalistapproach to tempeval 2013. In Second Joint Con-ference on Lexical and Computational Semantics (*SEM), volume 2, pages 10–14.

Steven Bethard and James H Martin. 2008. Learn-ing semantic links from a corpus of parallel tem-poral and causal relations. In Proceedings of the46th Annual Meeting of the Association for Compu-tational Linguistics on Human Language Technolo-gies: Short Papers, pages 177–180. Association forComputational Linguistics.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in neural informationprocessing systems, pages 2787–2795.

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. Comet: Commonsense transformers forknowledge graph construction. In Association forComputational Linguistics (ACL).

Susan Brown, Claire Bonial, Leo Obrst, and MarthaPalmer. 2017. The rich event ontology. In Proceed-ings of the Events and Stories in the News Workshop,pages 87–97, Vancouver, Canada. Association forComputational Linguistics.

Tommaso Caselli and Oana Inel. 2018. CrowdsourcingStoryLines: Harnessing the crowd for causal relationannotation. In Proceedings of the Workshop Eventsand Stories in the News 2018, pages 44–54, Santa Fe,New Mexico, U.S.A. Association for ComputationalLinguistics.

Nathanael Chambers, Taylor Cassidy, Bill McDowell,and Steven Bethard. 2014. Dense event orderingwith a multi-pass architecture. Transactions of theAssociation for Computational Linguistics, 2:273–284.

Timothy Chklovski and Patrick Pantel. 2004. Verbo-cean: Mining the web for fine-grained semantic verbrelations. In Proceedings of the 2004 Conferenceon Empirical Methods in Natural Language Process-ing.

Zeyu Dai and Ruihong Huang. 2019. A regulariza-tion approach for incorporating event knowledgeand coreference relations into neural discourse pars-ing. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages2974–2985.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of

deep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Quang Xuan Do, Yee Seng Chan, and Dan Roth.2011. Minimally supervised event causality identi-fication. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing,pages 294–303. Association for Computational Lin-guistics.

George R Doddington, Alexis Mitchell, Mark A Przy-bocki, Lance A Ramshaw, Stephanie M Strassel, andRalph M Weischedel. 2004. The automatic contentextraction (ace) program-tasks, data, and evaluation.In Lrec, volume 2, page 1. Lisbon.

Jennifer D’Souza and Vincent Ng. 2013. Classifyingtemporal relations with rich linguistic knowledge.In HLT-NAACL, pages 918–927.

Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie Strassel. 2015.Overview of linguistic resources for the tac kbp 2015evaluations: Methodologies and reults. In Proceed-ings of the TAC KBP 2015 Workshop, pages 16–17.

Charles J Fillmore, Christopher R Johnson, andMiriam RL Petruck. 2003. Background to framenet.International journal of lexicography, 16(3):235–250.

Roxana Girju. 2003. Automatic detection of causal re-lations for question answering. In Proceedings ofthe ACL 2003 workshop on Multilingual summariza-tion and question answering-Volume 12, pages 76–83. Association for Computational Linguistics.

Goran Glavaš and Jan Šnajder. 2014. Constructing co-herent event hierarchies from news stories. In Pro-ceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing,pages 34–38.

Goran Glavaš, Jan Šnajder, Parisa Kordjamshidi, andMarie-Francine Moens. 2014. Hieve: A corpus forextracting event hierarchies from news stories. InProceedings of 9th language resources and evalua-tion conference, pages 3678–3683. ELRA.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Representations (ICLR).

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2008. A large-scale classification ofenglish verbs. Language Resources and Evaluation,42(1):21–40.

Hector Llorens, Estela Saquete, and Borja Navarro.2010. Tipsem (english and spanish): Evaluating crfsand semantic roles in tempeval-2. In Proceedings of

5355

the 5th International Workshop on Semantic Evalua-tion, pages 284–291. Association for ComputationalLinguistics.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (ACL), pages 55–60.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

Paramita Mirza and Sara Tonelli. 2014. An analysis ofcausality between events and its relation to temporalinformation. In COLING, pages 2097–2106.

Paramita Mirza and Sara Tonelli. 2016. Catena: Causaland temporal relation extraction from natural lan-guage texts. In The 26th International Conferenceon Computational Linguistics, pages 64–75.

Courtney Napoles, Matthew Gormley, and BenjaminVan Durme. 2012. Annotated gigaword. In Pro-ceedings of the Joint Workshop on Automatic Knowl-edge Base Construction and Web-scale KnowledgeExtraction, pages 95–100. Association for Computa-tional Linguistics.

Tim O’Gorman, Kristin Wright-Bettner, and MarthaPalmer. 2016. Richer event description: Integratingevent coreference with temporal, causal and bridg-ing annotation. In Proceedings of the 2nd Work-shop on Computing News Storylines (CNS 2016),pages 47–56, Austin, Texas. Association for Com-putational Linguistics.

Tim O’Gorman, Kristin Wright-Bettner, and MarthaPalmer. 2016. Richer event description: Integratingevent coreference with temporal, causal and bridg-ing annotation. In Proceedings of the 2nd Workshopon Computing News Storylines (CNS 2016), pages47–56.

R. Prasad, N. Dinesh, Lee A., E. Miltsakaki,L. Robaldo, Joshi A., and B. Webber. 2008. ThePenn Discourse Treebank 2.0. In lrec2008.

James Pustejovsky, Patrick Hanks, Roser Sauri, An-drew See, Robert Gaizauskas, Andrea Setzer,Dragomir Radev, Beth Sundheim, David Day, LisaFerro, et al. 2003. The timebank corpus. In Corpuslinguistics, volume 2003, page 40.

Mehwish Riaz and Roxana Girju. 2010. Another lookat causality: Discovering scenario-specific contin-gency relationships with no supervision. In Seman-tic Computing (ICSC), 2010 IEEE Fourth Interna-tional Conference on, pages 361–368. IEEE.

Mehwish Riaz and Roxana Girju. 2013. Toward a bet-ter understanding of causality between verbal events:

Extraction and analysis of the causal power of verb-verb associations. In Proceedings of the annual SIG-dial Meeting on Discourse and Dialogue (SIGDIAL).Citeseer.

Maarten Sap, Ronan Le Bras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi. 2019.Atomic: An atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI Con-ference on Artificial Intelligence, volume 33, pages3027–3035.

Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese,Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick,Neville Ryant, and Xiaoyi Ma. 2015. From light torich ere: annotation of entities, relations, and events.In Proceedings of the the 3rd Workshop on EVENTS:Definition, Detection, Coreference, and Representa-tion, pages 89–98.

Jierui Xie, Stephen Kelley, and Boleslaw K Szyman-ski. 2013. Overlapping community detection in net-works: The state-of-the-art and comparative study.Acm computing surveys (csur), 45(4):43.

Jierui Xie, Boleslaw K Szymanski, and Xiaoming Liu.2011. Slpa: Uncovering overlapping communi-ties in social networks via a speaker-listener interac-tion dynamic process. In Data Mining Workshops(ICDMW), 2011 IEEE 11th International Confer-ence on, pages 344–349. IEEE.

5356

A Event Representations

We consider both verb event phrases and nounevent phrases.Verb Event Phrases: To ensure good coverage ofregular event pairs, we consider all verbs20 as eventwords except possession verbs21. The thematic pa-tient of a verb refers to the object being acted uponand is essentially part of an event, therefore, wefirst consider the patient of a verb in forming anevent phrase22. The agent is also useful to specifyan event especially for an intransitive verb event,which does not have a patient. Therefore, we in-clude the agent of a verb event in an event phraseif its patient was not found. The patient or agentof a verb is identified using dependency relations23.If neither a patient nor an agent was found, we in-clude a preposition phrase (a preposition and itsobject) that modifies a verb in the event representa-tion to form an event phrase. Example verb eventphrases are “agreement be signed” and “occupyterritory”.Noun Event Phrases: We include a prepositionphrase (a preposition and its object)that modifiesa noun event in the event representation to form anoun event phrase. We first consider a prepositionphrase headed by the preposition of, then a prepo-sition phrase headed by the preposition by, lastlya preposition phrase headed by any other preposi-tion. Example noun event phrases are “ceremonyat location” and “attack on troops”.

Note that many noun words do not refer to anevent, therefore, we apply two strategies to quicklycompile a list of noun event words. First, we obtaina list of deverbal nouns24 (5028 event nouns) byquerying each noun in WordNet (Miller, 1995) andchecking if its root word form has a verb sense.Second, we identify five intuitive textual patterns,e.g., participate in EVENT, and extract their prepo-sitional direct objects as potential noun events. Thefive patterns are: participate in EVENT, involve in

20We used POS tags to detect verb events.21We determined that possession verbs, such as “own”,

“have” and “contain”, mainly express the ownership status sowe discarded these event phrases.

22In particular, we require a light verb (e.g., do, make, takeetc.) to have a direct object because light verbs have littlesemantic content of their own.

23We use Stanford dependency relations (Manning et al.,2014). We identify the patient as the direct object of an activeverb or the subject of a passive verb; we identify the agent asthe subject of an active verb or the object of preposition bymodifying a passive verb.

24Derivative nouns ending with suffixes -er, -or are dis-carded.

EVENT, engage in EVENT, play role in EVENT andseries of EVENT. We rank extractions first by thenumber of times they occur with these patternsand then by the number of unique patterns theyoccur with. We next quickly went through thetop 5,000 nouns and manually removed non-eventwords, which results in 3154 noun event words.Event Phrase Generalization: Including argu-ments into event representations generates eventphrases that are too specific though. In orderto obtain generalized event phrase forms, we re-place named entity arguments with their entitytypes (Manning et al., 2014). We also replace per-sonal pronouns with the entity type PERSON.

Weakly Supervised Subevent Knowledge Acquisition

Documents

Transcript of Weakly Supervised Subevent Knowledge Acquisition