Stanford’s Distantly Supervised Slot Filling Systems for KBP 2014 · Stanford submitted two runs...

Stanford’s Distantly Supervised Slot Filling Systems for KBP 2014

Gabor Angeli∗, Sonal Gupta∗, Melvin Johnson Premkumar∗, Christopher D. Manning∗Christopher Re∗, Julie Tibshirani∗, Jean Y. Wu∗, Sen Wu∗, Ce Zhang†

∗Stanford University; Stanford, CA 94305{angeli, sonalg, melvinj, manning}@stanford.edu{chrismre, jtibs, jeaneis, senwu}@stanford.edu

†University of Wisconsin; Madison, WI [email protected]

Abstract

We describe Stanford’s entry in the TAC-KBP 2014 Slot Filling challenge. We sub-mitted two broad approaches to Slot Fill-ing, both strongly based on the ideas ofdistant supervision: one built on the Deep-Dive framework (Niu et al., 2012), and an-other based on the multi-instance multi-label relation extractor of Surdeanu et al.(2012). In addition, we evaluate the im-pact of learned and hard-coded patterns onperformance for slot filling, and the impactof the partial annotations described in An-geli et al. (2014).

1 Introduction

We describe Stanford’s two systems in the 2014KBP Slot Filling competition. The first, and bestperforming system, is built on top of the Deep-Dive framework. We describe the system brieflybelow, and in more detail in Section 2; the centrallesson we would like to emphasize from this sys-tem is that leveraging large computers and scal-able distributed software systems allows for com-pletely removing the information retrieval compo-nent of a traditional KBP system, and allows forquick turnaround times while processing the entiresource corpus as a single unit. DeepDive offers aconvenient framework for developing systems onthese large computers, including defining the pre-processing pipelines (feature engineering, entitylinking, mention detection, etc.) and then definingand training a relation extraction model.

The second system Stanford submitted is basedaround the MIML-RE relation extractor, follow-ing closely from the 2013 submission, but with theaddition of learned patterns, and a new MIML-RErelation extraction model incorporating informa-tive sentence-level labels during training. The cen-tral lesson we would like to emphasize from this

system is that a relatively small annotation effort(10k sentences) over carefully selected examplescan yield a surprisingly large gain in end-to-endperformance on the Slot Filling task (4.4 F1). Wedescribe the two systems briefly below.

1.1 DeepDiveStanford submitted two runs using the Deep-Dive framework for relation extraction (Niu et al.,2012).1 The systems differed only in the thresh-old chosen for proposing relations. DeepDive is aframework for constructing trained systems whosegoal is to make it easier to integrate domain-specific knowledge and user feedback. It runs inthree phases: feature extraction, engineered con-straints, and statistical inference and learning. Thefirst phase is implemented as steps in a pipeline,with intermediate results stored in a large dis-tributed database. For example, once initializedwith a Sentences table, mention detection runs asimple Python script on each row in the Sentencestable and populates the resulting mentions in aMentions table. Featurization can then take everypair of mentions from this mentions table, and runa featurization script to populate the features table.Importantly, the entire source corpus is processedas a single unit, without need for an informationretrieval component.

Inference (and consequently learning) over thedata is performed by grounding a factor graph viaa set of database queries, and then running a Gibbssampler to infer relation probabilities (Niu et al.,2011a). The model uses the distant supervisionassumption, as in Mintz et al. (2009).

Negative examples are constructed identicallyto Stanford’s 2013 system, taking incompatibleand incomplete relations with respect to the inputknowledge base. The training corpus consists ofonly the source corpus for the year it is evaluatedon – for the 2013 dev set and 2014 evaluation set,

1http://deepdive.stanford.edu/

(1) Feature Extraction (2) Probabilistic Engineering (3) Inference and Learning

Feature Weight and_his_wife 5.0 and -3.0

Unstructured Info.

Structured Info.

Web Pages

PDF documents

Evidence Schema Sentence … Barack Obama and his wife M. Obama …

Mentions word offset1 offset2 M. Obama 6 7

MentionPairFeatures m1 m2 f Barack Obama

M. Obama

and_his_wife

Factor Graph factors variables

MHasSpouse m1 m2 Barack Obama M. Williams Barack Obama M. Obama

feature=“and_his_wife”

Weights (Learning)

Probability (Inference) m1 m2 prob Barack Obama M. Williams 0.1 Barack Obama M. Obama 0.9

Calibration Plots

Knowledge Base

HasSpouse

Query Relation

Figure 1: An overview of a system built with DeepDive that takes as input both structured and unstruc-tured information and creates a knowledge base as the output. There are three phases of DeepDive’sexecution model: (1) Feature Extraction; (2) Probabilistic Engineering; and (3) Inference and Learning.Section 2 contains a more detailed walkthrough of these phases, and Figure 2 shows more details of howto conduct tasks in these three phases using SQL and script languages, e.g., Python.

this consists of only the 2013 source corpus. In ad-dition, snowball sampling (Goodman, 1961) overa large web corpus was used to collect additionalnamed entity triggers, and some high-precision re-lation triggers. We used Freebase as our sourceof training data, aligned to the set of KBP rela-tions. Entity Linking was performed with a seriesof high-precision rules.

The system obtained an F1 of 36.7 on the 2014evaluation. In development runs, the system ob-tained an F1 of 62 on the 2010 data, and 56 on the2013 data.

1.2 MIML-RE

In addition to the DeepDive system, Stanford sub-mitted three systems making use of the MIML-REmodel (Surdeanu et al., 2012). These are to a largeextent similar to Stanford’s submission last year,but with some key improvements.

First, the MIML-RE model was re-trained us-ing selective human-annotated relation mentionlabels, as described in Angeli et al. (2014) andsummarized in Section 3. Two submitted runs dif-fer only by their use of the unmodified versus ac-tive learning models.

In addition, our model employed a somewhatextended set of hand-coded patterns, as well as alarger collection of automatically learned patterns(see Section 3.5). Two runs differ in their use ofthese patterns versus using only the MIML-RE re-lation extractor.

For training, we used both the 2010 and 2013source corpora, as well as a 2013 dump ofWikipedia. For testing, we used only the official

corpus for the evaluation year, as this was shownto perform better in the 2013 evaluation than usingall three corpora.

The dev F1 scores for each year for the bestMIML-RE based system are as follows. Slotthresholds were trained on a per-relation basis tomaximize the score on the 2013 development set.Our best MIML-RE based system obtained an F1

of 32; active learning gained us an improvement of4.4 F1, while patterns gained us 1.9 F1. Our scoreson the development sets were:

2009 2010 2011 2012 201325.54 32.02 28.53 30.74 40.86

We proceed to describe our two approaches inmore detail.

2 A Brief Overview of DeepDive

We briefly introduce the programming and exe-cution model of DeepDive. Part of this text isadapted from our previous work (Re et al., 2014).There are three phases in DeepDive’s executionmodel, as shown in Figure 1, and described below.All of the experiments on the KBP corpus wereperformed using a 1 TB memory computer; a dis-tributed GreenPlum relational database was set upwith 48 nodes on the computer. It is key to pointout that each of these phases occurs in parallel on48 cores, and is run primarily in main memory.

(1) Feature Extraction. The input to this stageoften contains both structured and unstruc-tured information, and the goal is to producea relational database that describes the fea-tures or signals of the data, which we call the

(a) Input

e1 e2 Barack Obama Michelle Obama

FreebaseSpouse

m1 m2 Barack Obama Michelle

MentionPairs

m e Michelle Michelle Obama

EntityLinking

(d) Distant Supervision (c) Constants & Domain Knowledge (b) Feature Extraction sql: ! SELECT t1.Mention1, ! t1.Mention2 ! FROM FreebaseSpouse t0 ! MentionPairs t1 ! EntityLinking t2 ! EntityLinking t3 ! WHERE t1.m1 = t2.m ! AND t1.m2 = t3.m ! AND t0.e1 = t2.e ! AND t0.e2 = t3.e !

m1 m2 Barack Obama Michelle

PostiveExamples

sql: ! SELECT t0.*, t1.* ! FROM HasSpouse t0 ! HasSpouse t1 ! WHERE t0.e1 = t1.e1 ! AND t0.e2 <> t1.e2 !function: AND(t0, t1) !weight: -2 !

(Barack_Obama, Michelle_Obama)

(Barack_Obama, Michelle_Williams)

Factor Graph

sql: ! SELECT content ! FROM Sentence !py_func: ! for c in sys.stdin(): ! mentions = extMentions(c) ! for m1 in mentions: ! for m2 in mentions: ! f = extFeature(c,m1,m2) ! print [m1, m2, f] !

content Barack Obama and his wife Michelle thank supporters.

Sentence

m1 m2 f Barack Obama

Michelle and_his_wife

MentionPairFeature function: AND weight: -2

Figure 2: Illustration of popular operations in DeepDive. (a) Prepare data sets in relational form that canbe used by DeepDive. (b) Generate labels using distant supervision with SQL; (c) Integrate constraintswith SQL and logic functions; (d) Extract features with SQL and script languages (e.g., Python).

evidence schema. We use the phrase evidenceto emphasize that, in this stage, data is not re-quired to be precisely correct; a fact whichis very relevant for distantly supervised sys-tems.

(2) Probabilistic Engineering. The goal of thisphase is to transform the evidence schemainto a probabilistic model, specifically a fac-tor graph that specifies:

• The set of random variables that the userwants to model. To define the set ofrandom variables in DeepDive is to cre-ate new relations called query relations,in which each tuple corresponds to onerandom variable. This operation can bedone by using SQL queries on existingrelations.• How those random variables are corre-

lated; e.g., “The mention ‘Obama’ refersto the president is correlated with therandom variable that indicates whether‘Obama’ is a person.” To specify how,the user specifies a factor function. Al-though our submissions do not make useof more complex constraints, they areeasy to encode in the framework. Fig-ure 2(c) shows a rough example that de-scribes the intuition that “people tendto be married to only a single person.”One (of many ways) to say this is tosay that there is some correlation be-tween pairs of married tuples; i.e., us-ing the logical function AND(t0, t1) re-turns 1 in possible worlds in which both

married tuples are true and 0 in oth-ers. DeepDive then learns the “strength”of this correlation from the data, whichis encoded as weight.2 Here, −2 indi-cates that it is less likely that both mar-ried tuples are correct. This phase isalso where the user can write logicalconstraints, e.g., hard functional con-straints.

Our previous work has shown that thisgrounding phase (Niu et al., 2011a) can bea serious bottleneck if one does not use scal-able relational technology. We have learnedthis lesson several times.

(3) Inference and Learning. This phase islargely opaque to the user: it takes the fac-tor graph as input, estimates the weights,performs inference, and produces the out-put database along with a host of diagnos-tic information, notably calibration plots (seeFig. 3). More precisely, the output of Deep-Dive is a database that contains each ran-dom variable declared by the user with itsmarginal probability. For example, one tuplein the relation HasSpouse might be (BarackObama, Michelle Obama), and ideally, wehope that DeepDive outputs a larger proba-bility for this tuple as output. The scalabilityand performance of this component is studiedin our previous work (Zhang and Re, 2013;Zhang and Re, 2014; Niu et al., 2011b).

2A weight is roughly the log odds, i.e., log p1−p

where p isthe marginal probability of this random variable. This is stan-dard in Markov Logic Networks (Richardson and Domingos,2006), on which much of DeepDive’s semantics is based.

2.1 Operations to Improve Quality inDeepDive

We describe three routine tasks that a user per-forms to improve a DeepDive based system.

Feature Extraction. DeepDive’s data modelallows the user to use any scripting language forfeature extraction. Figure 2(b) shows one suchexample using Python. One baseline feature thatis often used in relation extraction systems is theword sequence between mention pairs in a sen-tence (?; Hoffmann et al., 2010), and Figure 2(b)shows an example of extracting this feature. Theuser first defines the input to the feature extrac-tor using an SQL query, which selects all avail-able sentences. Then the user defines a UDF (UserDefined Function) that will be executed for eachtuple returned by the SQL query. In this exam-ple, the UDF is a Python function that first reads asentence from STDIN, extracts mentions from thesentence, extracts features for each mention pair,and outputs the result to STDOUT. DeepDive willthen load the output of this UDF to the Mention-PairFeature relation.

Constraints and Domain Knowledge. Oneway to improve a KBP system is to integrate do-main knowledge. DeepDive supports this opera-tion by allowing the user to integrate constraintsand domain knowledge as correlations among ran-dom variables, as shown in Figure 2(c).

Imagine that the user wants to integrate a sim-ple rule that says “one person is likely to bethe spouse of only one person.” For exam-ple, given a single entity “Barack Obama,” thisrule gives positive preference to the case whereonly one of (Barack Obama, Michelle Obama)and (Barack Obama, Michelle Williams) is true.Figure 2(c) shows one example of implementingthis rule. The SQL query in Figure 2(c) defines aview in which each tuple corresponds to two rela-tion candidates with the same first entity but dif-ferent second entities. The function AND(t0, t1)defines the “type of correlation” among variables,and the weight “-2” defines the strength of the cor-relation. This rule indicates that it is less likelythat both (Barack Obama, Michelle Obama) and(Barack Obama, Michelle Williams) are true (i.e.,when AND(t0, t1) returns 1). Typically, DeepDiveis used to learn the weights from data.

Distant Supervision. Like the MIML-REsystem, DeepDive is built on top of the distant su-

pervision assumption, allowing more data to be in-corporated easily.

As we have described, the user has at least theabove three ways to improve the system and is freeto use one or a combination of them to improve thesystem’s quality. The question we address next is,“What should the user do next to get the largestquality improvement in the KBP system?”

2.2 Calibration Plots in DeepDive

A DeepDive system is only as good as its featuresand rules. In the last two years, we have found thatunderstanding which features to add is the mostcritical step in the process. Part of this process isthe usual fine-grained error analysis; however, wehave found that a macro-error analysis can be use-ful to guard against statistical errors and give an at-a-glance description of an otherwise prohibitivelylarge amount of output.

In DeepDive, calibration plots are used to sum-marize the overall quality of the results. BecauseDeepDive uses a joint probability model, eachrandom variable is assigned a marginal probabil-ity. Ideally, if one takes all the facts to whichDeepDive assigns a probability score of 0.95, then95% of these facts are correct. We believe thatprobabilities remove a key element: the developerreasons about features, not the algorithms under-neath. This is a type of algorithm independencethat we believe is critical.

DeepDive programs define one or more test setsfor each relation, which are essentially a set of la-beled data for that particular relation. This set isused to produce a calibration plot. Figure 3 showsan example calibration plot, which provides an ag-gregated view of how the system behaves. Byreading each of the subplots, we can get a roughassessment of the next step to improve our system.We explain each component below.

As shown in Figure 3, a calibration plot containsthree components: (a) accuracy, (b) # predictions(test set), which measures the number of extrac-tions in the test set with a certain probability; and(c) # predictions (whole set), which measures thenumber of extractions in the whole set with a cer-tain probability. The test set is assumed to havelabels so that we can measure accuracy, while thewhole set does not.

(a) Accuracy. To create the accuracy his-togram, we bin each fact extracted by DeepDiveon the test set by the probability score assigned to

Figure 3: Illustration of a sample calibration plot in DeepDive.

each fact; e.g., we round to the nearest value in theset k/10 for k = 1, . . . , 10. For each bin, we com-pute the fraction of those predictions that is cor-rect. Ideally, this line would be on the (0,0)-(1,1)line, which means that the DeepDive-producedprobability value is calibrated, i.e., it matches thetest-set accuracy. For example, Figure 3(a) showsa curve for calibration. Besides calibration errors,differences in these two lines can be caused bynoise in the training data, quantization error dueto binning, or sparsity in the training data.

(b) # Predictions (Testing Set). We also cre-ate a histogram of the number of predictions ineach bin. In a well-tuned system, the # Predictionshistogram should have a “U” shape. That is, mostof the extractions are concentrated at high proba-bility and low probability. We do want a number oflow-probability events, as this indicates DeepDiveis considering plausible but ultimately incorrect al-ternatives. Figure 3(b) shows a U-shaped curvewith some masses around 0.5-0.6. Intuitively, thissuggests that there is some hidden type of exam-ple for which the system has insufficient features.More generally, facts that fall into bins that are notin (0,0.1) or (0.9,1.0) are candidates for improve-ments, and one goal of improving a KBC system isto “push” these probabilities into either (0,0.1) or(0.9,1.0). To do this, we may want to sample fromthese examples and add more features to resolvethis uncertainty.

(c) # Predictions (Whole Set). The final his-togram is similar to Figure 3(b), but illustratesthe behavior of the system on the whole data, forwhich we do not have supervision. We can visu-ally inspect that Figure 3(c) has a similar shape to(b); If not, this would suggest possible overfittingor some bias in the selection of the held-out set.

2.3 Description of the Submission System

We describe system details of the two runs we sub-mitted with DeepDive, with a focus on the lessonswe learned during the development of these sys-tems.

Input Data Preparation The input to DeepDiveis a text corpus with NLP-markups that are storedas a database relation. The baseline system webuilt contains all NLP-markups produced by Stan-ford CoreNLP. During our development, we findthat in one system snapshot with 45 recall (2010data set), parsing-related errors contribute to 31%of recall errors. A further analysis shows that someof these sentences do have a correct parsing re-sult in other parsers. Motivated by this observa-tion, our final submission contains the union ofparsing result from three parsers, namely, StanfordCoreNLP, Malt parser, and BBN’s SERIF. Amongthese three parsers, we run Stanford CoreNLPand Malt parser by ourselves, and the result ofSERIF is shipped with the TAC 2014 KBP En-glish Source Corpus (LDC2014E13). We conductpost-processing on the result of Malt parser andSERIF to make sure they share the same format asthe output of Stanford CoreNLP. For Malt parser,we collapse the dependency tree as in StanfordCoreNLP; for SERIF, we use Stanford CoreNLP(Manning et al., 2014) to translate the constituteparsing result into collapsed dependencies. Thesepost-processing steps make sure we can use ex-actly the same DeepDive program to process theoutput of all these three parsers, and therefore, donot need to change the DeepDive program.

Feature Extraction The features we used in oursubmission largely follow state-of-the-art work.Baseline features that we use include the depen-dency paths and word sequences between a men-tion pair. For each baseline feature, we apply

one or a conjunction of the following operationsto generate a new feature, including (1) replace averb (resp. noun) with its POS or NER tag; (2) re-place a verb (resp. noun) with its WordNet synset;(3) replace a verb (resp. noun) with its WordNethypernyms. These are basic features that we usedfor a mention pair in the sentence. Another classof features that we used involves a mention pairand a keyword (e.g., president, chair, marry, etc.)in the sentence. Given two mentions m1, m2, anda keyword w, we create a feature that is a conjunc-tion between (1) (m1, w)’s feature; (2) (m2, w)’sfeature; and (3) w’s lemma form. For (mi, w)’sfeature, we use the same feature extractor as wedefined for a single mention pairs. The dictionaryof keywords is generated in a way described be-low.

Distant Supervision We follow the standardpractice of distant supervision to produce posi-tive and negative examples. To improve upon thiscomponent, our goal is to harvest more (distantly)labeled training examples. Therefore, our submis-sion involves a labeling phase that is motivated bythe success of Hearst’s patterns (Hearst, 1992) inharvesting high-quality taxonomic relations.

Our labeling effort starts from the result of stan-dard distant supervision, which is a set of tuplesof mention pairs that are labeled as positive exam-ples for relation R, denoted as R(m1,m2), and foreach mention pair, we have the set of features asso-ciated with it, denoted as a relation F (m1,m2, f).Then, we pick all features f that (1) are associatedwith at least one mention pair in R, and (2) ap-peared in the whole corpus at least twice for dis-tinct sentences. For each feature, we manually la-bel it as an indicator of the relation or not. For fea-tures labeled as indicators, we only use them in thefollowing way–We include all mention pairs asso-ciated with this feature as newly generated posi-tive examples (and negative examples for other re-lations). One simple optimization we conducted isthat if a more general feature (e.g., the feature af-ter substituting a noun to its POS tag) is labeled ascorrect, then all features that are more specific thanthis feature should also be correct and do not needa human label. We manually labeled 100K fea-tures in this way. Although 100K features soundslike a large amount of features to label, we findthat it only takes one student 2 days, and aftergetting familiar with this task, the labeling speedcould be as fast as one feature per second.

2.4 Future Work

Error Analysis We conducted error analysis af-ter the score and ground truth was released to clas-sify errors that our submission made into multiplecategories. We briefly report this error analysishere with the hope that it is interesting to otherKBP participants. This error analysis is conductedon our submission with 36 F1 (54 Precision and27 Recall) and focuses mainly on precision.

Out of all 509 proposed relations, 276 of themare labeled as correct, 154 wrong, 53 inexact, and26 redundant. We mainly focus on the 154+54wrong and inexact relations, which constitute 40points of precision loss. Because this submissionsystem uses a probability threshold 0.8 to proposerelations, 20 points of these errors are expecteddue to errors in relation extraction. In the remain-ing 20 points of precision loss, we find that entitylinking errors contribute to 7 points of precisionloss. One common pattern is that DeepDive con-fuses entities which are different from the queryentity but share the same name. Another 7 pointsof precision loss are caused by normalizing an-swers that are of the type date, age, and title. Thesetwo classes of errors require deeper understandingof entities that appear in the corpus.

DeepDive Open Datasets Apart from our effortin building the TAC KBP system, DeepDive hasalso been applied to other scientific applications,including Paleontology (Peters and others, 2014),Geology (Zhang et al., 2013), Pharmacogenomics,Genomics, and other defense-related applications.Similar to our efforts in TAC KBP, these appli-cations usually require time-consuming data col-lection and NLP preprocessing. With the hopeto help other researchers to access these datasets,we released a set of datasets that are under li-cense with preprocessing, e.g., OCR and NLP.3

Examples of these datasets include a full snap-shot of Wikipedia, PubMed Central, BioMed Cen-tral, PLoS, Biodiversity Heritage Library etc., andmore datasets will be released in the year 2015.This effort is generously supported by CHTC atthe University of Wisconsin-Madison4 that is pro-viding more than 5 million machine hours to pre-pare all these datasets.

3http://deepdive.stanford.edu/doc/opendata/

4http://chtc.cs.wisc.edu/

3 Active Learning in MIML-RE

The second system submitted by Stanford is basedaround the MIML-RE relation extractor Surdeanuet al. (2012), in a similar vein to Stanford’s 2013submission. We describe this MIML-RE basedKBP system; in particular, we describe the im-provements to the system made since Stanford’s2013 entry.

A key improvement in the 2014 KBP entry isthe re-training of MIML-RE incorporating fixedlabels for uncertain sentences. The approach isdescribed in more detail in Angeli et al. (2014).We describe the process for incorporating thesesentence-level statistics, and the selection crite-ria used for selecting sentences to annotate. Thiswork yielded a net gain of 4.4 F1 in the evaluation.

3.1 Training with sentence-level annotationsFollowing Surdeanu et al. (2012), MIML-RE istrained through hard discriminative ExpectationMaximization, inferring the latent z values in theE-step and updating the weights for both the z andy classifiers in the M-step. During the E-step, weconstrain the latent z to match our sentence-levelannotations when available.

We describe three criteria for selection exam-ples to annotate. The first – sampling uniformly– is a baseline for our hypothesis that intelligentlyselecting examples is important. For this criterion,we select mentions uniformly at random from thetraining set to annotate. The other two criteriarely on a metric for disagreement provided by theQuery By Committee (QBC) active learning cri-terion; we describe our adaptation of QBC forMIML-RE as a preliminary to introducing thesecriteria.

3.2 QBC For MIML-REWe use a version of QBC based on bootstrap-ping (Saar-Tsechansky and Provost, 2004). Tocreate the committee of classifiers, we re-samplethe training set with replacement 7 times and traina model over each sampled dataset. We mea-sure disagreement on z-labels among the classi-fiers using a generalized Jensen-Shannon diver-gence (McCallum and Nigam, 1998), taking theaverage KL divergence of all classifier judgments.

We first calculate the mention-level confi-dences. Note that z(m)

i ∈ Mi denotes the latentvariable in entity pair i with index m; z(−m)

i de-notes the set of all latent variables except z(m)

i :

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25 0.3

Pre

cisi

on

Recall

Sample JSHigh JS

UniformMIML-RE

Figure 4: A comparison of models trained withvarious selection criteria on the evaluation of Sur-deanu et al. (2012), all initialized with the corre-sponding supervised classifier.

p(z(m)i |yi,xi) =

p(yi, z(m)i |xi)

p(yi|xi)

=

∑z(−m)i

p(yi, zi|xi)∑z(m)i

p(yi, z(m)i |xi)

.

Notice that the denominator just serves to nor-malize the probability within a sentence group.We can rewrite the numerator as follows:

∑z(−m)i

p(yi, zi|xi)

=∑z(−m)i

p(yi|zi)p(zi|xi)

= p(z(m)i |xi)

∑z(−m)i

p(yi|zi)p(z(−m)i |xi).

For computational efficiency, we approximatep(z

(−m)i |xi) with a point mass at its maximum.

Next, we calculate the Jensen-Shannon (JS) diver-gence from the k bootstrapped classifiers:

1

k

k∑c=1

KL(pc(z(m)i |yi,xi)||pmean(z

(m)i |yi,xi)) (1)

where pc is the probability assigned by each of thek classifiers to the latent z(m)

i , and pmean is the av-erage of these probabilities. We use this metricto capture the disagreement of our model with re-spect to a particular latent variable. This is thenused to inform our selection criteria.

System P R F1

Mintz++ 41.3 28.2 33.5MIML + Dist 38.0 30.5 33.8Supervised + SampleJS 33.5 35.0 34.2MIML + Sup 35.1 35.6 35.5MIML + Sup + SampleJS 39.4 36.2 37.7

Table 1: A comparison of the best performing su-pervised classifier with other systems on the 2013development set. The top section compares the su-pervised classifier with prior work. The lower sec-tion highlights the improvements gained from ini-tializing MIML-RE with a supervised classifier.

We note that QBC may be especially useful inour situation as our objective is highly nonconvex.If two committee members disagree on a latentvariable, it is likely because they converged to dif-ferent local optima; annotating that example couldhelp bring the classifiers into agreement.

The second selection criterion we consider isthe most straightforward application of QBC – se-lecting the examples with the highest JS disagree-ment. This allows us to compare our criterion, de-scribed next, against an established criterion fromthe active learning literature.

3.3 Sample by JS Disagreement

We propose a novel active learning sampling cri-terion that incorporates not only disagreement butalso representativeness in selecting examples toannotate. Prior work has taken a weighted combi-nation of an example’s disagreement and a scorecorresponding to whether the example is drawnfrom a dense portion of the feature space (e.g.,McCallum and Nigam (1998)). However, this re-quires both selecting a criterion for defining den-sity (e.g., distance metric in feature space), andtuning a parameter for the relative weight of dis-agreement versus representativeness.

Instead, we account for choosing representa-tive examples by sampling without replacementproportional to the example’s disagreement. For-mally, we define the probability of selecting anexample z

(m)i to be proportional to the Jensen-

Shannon divergence in (1). Since the training set isan approximation to the prior distribution over ex-amples, sampling uniformly over the training set isan approximation to sampling from the prior prob-ability of seeing an input x. We can view our crite-rion as an approximation to sampling proportional

to the product of two densities: a prior over exam-ples x, and the JS divergence mentioned above.

3.4 Analysis of Selection CriteriaIt’s worth emphasizing at least three results fromthis line of work. First, that the selection crite-ria for annotating examples is very important forperformance. Figure 4 shows the performance ofthe model using various selection criteria; we notethat highJS and sampleJS noticeably outperformthe uniform criterion. Second, that MIML-RE isvery sensitive to initialization. To illustrate, thegains in Figure 4 disappear entirely if the model isinitialized with Mintz++ (as in the original paper);gains only appear if the model is initialized froma supervised classifier. Lastly, it’s worth notingthat this supervised classifier, used on its own, per-forms surprisingly well on the 2013 developmentset. Table 1 shows the system’s performance usingdistant supervision, vanilla MIML-RE (33.8 F1 on2013), and a supervised classifier trained from theSampleJS examples (34.2 F1). This suggests thatit may be reasonable for newcomers to the task totrain a relatively simple classifier, and perhaps fo-cus on downstream processes; although, of course,the best classifiers (37.7 F1) are still those whichincorporate the examples into a distant supervisionframework such as MIML-RE.

3.5 Learning Patterns to AugmentMIML-RE

To extract slot fillers, we also used two typesof patterns in our system: surface word lexico-syntactic patterns (such as, “X was married to Y”)and dependency patterns (such as, “X (nsubj) ←marry→ (dobj) Y”). Patterns often are useful forextracting slots with high precision, even thoughwith low recall. Usually recall is increased by it-eratively learning patterns by using the extractionslearned in the previous iteration as labeled data.The lexico-syntactic patterns were hand writtenand the dependency patterns were both hand writ-ten and learned.

We learned the dependency patterns for eachslot independently. For each slot, first, the knownentities and slot fillers were matched to sentencesas described above. For each pair of entity headword and slot filler head word in a matched sen-tence, we extracted the shortest path between themin the dependency tree of the sentence as the can-didate pattern. If any of the entity or the slot fillerwere tagged with a named entity tag, we included

Slot Dependency Patternper:employee of ENTITY (nsubjpass)← dropped→(prep from) {ner:/ORGANIZATION/}=FILLper:country of death {ner:/COUNTRY/}=FILL (nsubj)← mourns→(dobj) ENTITYorg:founded by founders→(conj and) {ner:/PERSON/}=FILL→(nn) ENTITY

Table 2: Some examples of learned dependency patterns used by the system. The format of the patterns isin Semgrex format (http://nlp.stanford.edu/software/tregex.shtml) over StanfordDependencies, where ENTITY and FILL are the entity head token and the slot fill head token respec-tively.

Official ScoreSystem P R F1

DeepDive (recall) 54.4 27.8 36.8DeepDive (precision) 54.8 24.9 34.3MIML-RE 36.2 28.7 32.0MIML-RE (-patterns) 35.3 26.3 30.2MIML-RE (-active) 29.2 26.2 27.6

Table 3: A summary of Stanford’s performanceon the 2014 official evaluation. The first two rowscorrespond to a recall tuned and precision tunedversion of the DeepDive system. The last threerows correspond to the MIML-RE based system,with an ablation study of (1) removing learned andhard-coded patterns, and (2) using the MIML-REmodel from 2013.

the restriction in the pattern. The candidate pat-terns were weighted by the measure:

Pos√Neg +All + 2

where Pos is the number of times the patternmatched the correct pair of entity and slot filler,Neg is the number of the times the pattern ex-tracted a pair of entity and slot filler that were re-lated by another slot, and All is the total numberof times the pattern matched the sentences. Pat-terns above a certain threshold were selected. Welearned a total of 1073 patterns for all slots. Due tolack of time, we ran the system on a small sampleof the training data for one iteration. In future, weplan to extend this to run pattern learning for mul-tiple iterations and improve pattern scoring usingthe approach of Gupta and Manning (2014).

Some examples of the learned dependency pat-terns are given in Table 2.

4 Summary of Results

We give the results of the official KBP evaluationin Table 3, summarizing the five submissions:

DeepDive (recall) A run using DeepDive, tunedalong the calibration curve to tune for the best F1

and correspondingly relatively biased towards re-call.

DeepDive (precision) A run using DeepDive,tuned along the calibration curve to favor higherprecision (90% precision cutoff).

MIML-RE The MIML-RE system, as in the2013 submission, but with the addition of learnedpatterns and the new model from Angeli et al.(2014).

MIML-RE (-patterns) The MIML-RE system,with all patterns removed. This is the expectedperformance of a system based entirely and onlyon MIML-RE and a coreference-based alternatenames detector.

MIML-RE (-active) The MIML-RE system,with the model from Stanford’s 2013 submission.

The best performing system based on DeepDiveachieved an end-to-end F1 of 36.8; the best per-forming system based on MIML-RE achieved anF1 of 32.0.

5 Conclusion

We have described Stanford’s two systems in the2014 KBP competition: one based on DeepDive,and another around the MIML-RE relation extrac-tor. We hope to convey at least two central mes-sages: First, that there is an undervalued benefit toscaling computing resources and software systemsproportional to the size of the task being solved.That is to say, using DeepDive to effectively har-ness large computers to quickly iterate featuresand debug a KBP system is very valuable for cre-ating top-performing systems. Second, that care-fully human intervention on selected examples orpatterns can, with relatively little effort, yield largeimprovements in relation extraction accuracy.

Acknowledgments

Stanford University gratefully acknowledges thesupport of the Defense Advanced ResearchProjects Agency (DARPA) Deep Exploration andFiltering of Text (DEFT) Program under AirForce Research Laboratory (AFRL) contract no.FA8750-13-2-0040. Any opinions, findings, andconclusion or recommendations expressed in thismaterial are those of the authors and do not nec-essarily reflect the view of the DARPA, AFRL, orthe US government.

ReferencesGabor Angeli, Julie Tibshirani, Jean Y. Wu, and

Christopher D. Manning. 2014. Combining dis-tant and partial supervision for relation extraction.In EMNLP.

Leo A Goodman. 1961. Snowball sampling. The an-nals of mathematical statistics, pages 148–170.

Sonal Gupta and Christopher D. Manning. 2014. Im-proved pattern learning for bootstrapped entity ex-traction. In Proceedings of the Eighteenth Confer-ence on Computational Natural Language Learning(CoNLL).

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In COLING.

Raphael Hoffmann, Congle Zhang, and Daniel S.Weld. 2010. Learning 5000 relational extractors.In ACL.

Christopher D Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J Bethard, and David Mc-Closky. 2014. The stanford corenlp natural lan-guage processing toolkit. In Proceedings of 52ndAnnual Meeting of the Association for Computa-tional Linguistics: System Demonstrations.

Andrew McCallum and Kamal Nigam. 1998. Employ-ing EM and pool-based active learning for text clas-sification. In ICML.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In ACL.

Feng Niu, Christopher Re, AnHai Doan, and JudeShavlik. 2011a. Tuffy: Scaling up statistical infer-ence in Markov logic networks using an RDBMS.PVLDB.

Feng Niu, Benjamin Recht, Christopher Re, andStephen J. Wright. 2011b. Hogwild!: A lock-free approach to parallelizing stochastic gradient de-scent. In NIPS.

Feng Niu, Ce Zhang, Christopher Re, and Jude WShavlik. 2012. Deepdive: Web-scale knowledge-base construction using statistical learning and in-ference. In VLDS, pages 25–28.

Shanan E. Peters et al. 2014. A machine read-ing system for assembling synthetic Paleontologicaldatabases. PloS ONE.

Christopher Re, Amir Abbas Sadeghian, Zifei Shan,Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang.2014. Feature engineering for knowledge base con-struction. IEEE Data Eng. Bull.

Matthew Richardson and Pedro Domingos. 2006.Markov logic networks. Machine learning, 62(1-2):107–136.

Maytal Saar-Tsechansky and Foster Provost. 2004.Active sampling for class probability estimation andranking. Machine Learning, 54(2):153–178.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallap-ati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction.In EMNLP.

Ce Zhang and Christopher Re. 2013. Towards high-throughput Gibbs sampling at scale: A study acrossstorage managers. In SIGMOD.

Ce Zhang and Christopher Re. 2014. DimmWitted: Astudy of main-memory statistical analytics. PVLDB.

Ce Zhang, Vidhya Govindaraju, Jackson Borchardt,Tim Foltz, Christopher Re, and Shanan Peters.2013. GeoDeepDive: statistical inference using fa-miliar data-processing languages. In SIGMOD.

Stanford’s Distantly Supervised Slot Filling Systems for KBP 2014 · Stanford submitted two runs...

Documents

Transcript of Stanford’s Distantly Supervised Slot Filling Systems for KBP 2014 · Stanford submitted two runs...