Named Entity Recognition without Labelled Data: A Weak … · 2020. 6. 20. · Named Entity...

16
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1518–1533 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 1518 Named Entity Recognition without Labelled Data: A Weak Supervision Approach Pierre Lison 1 , Jeremy Barnes 2 , Aliaksandr Hubin 1 , and Samia Touileb 2 1 Norwegian Computing Center, Oslo, Norway 2 Language Technology Group, University of Oslo, Norway {plison,ahu}@nr.no, {jeremycb,samiat}@ifi.uio.no Abstract Named Entity Recognition (NER) perfor- mance often degrades rapidly when applied to target domains that differ from the texts ob- served during training. When in-domain la- belled data is available, transfer learning tech- niques can be used to adapt existing NER mod- els to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak su- pervision. The approach relies on a broad spec- trum of labelling functions to automatically an- notate texts from the target domain. These an- notations are then merged together using a hid- den Markov model which captures the vary- ing accuracies and confusions of the labelling functions. A sequence labelling model can fi- nally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news arti- cles from Reuters and Bloomberg) and demon- strate an improvement of about 7 percentage points in entity-level F 1 scores compared to an out-of-domain neural NER model. 1 Introduction Named Entity Recognition (NER) constitutes a core component in many NLP pipelines and is employed in a broad range of applications such as information extraction (Raiman and Raiman, 2018), question answering (Moll ´ a et al., 2006), document de-identification (Stubbs et al., 2015), machine translation (Ugawa et al., 2018) and even conversational models (Ghazvininejad et al., 2018). Given a document, the goal of NER is to identify and classify spans referring to an entity belonging to pre-specified categories such as persons, organi- sations or geographical locations. NER models often rely on convolutional or re- current neural architectures, sometimes completed by a CRF layer (Chiu and Nichols, 2016; Lample et al., 2016; Yadav and Bethard, 2018). More re- cently, deep contextualised representations relying on bidirectional LSTMS (Peters et al., 2018), trans- formers (Devlin et al., 2019; Yan et al., 2019) or contextual string embeddings (Akbik et al., 2019) have also been shown to achieve state-of-the-art performance on NER tasks. These neural architectures require large corpora annotated with named entities, such as Ontonotes (Weischedel et al., 2011) or ConLL 2003 (Tjong Kim Sang and De Meulder, 2003). When only mod- est amounts of training data are available, transfer learning approaches can transfer the knowledge ac- quired from related tasks into the target domain, us- ing techniques such as simple transfer (Rodriguez et al., 2018), discriminative fine-tuning (Howard and Ruder, 2018), adversarial transfer (Zhou et al., 2019) or layer-wise domain adaptation approaches (Yang et al., 2017; Lin and Lu, 2018). However, in many practical settings, we wish to apply NER to domains where we have no la- belled data, making such transfer learning methods difficult to apply. This paper presents an alterna- tive approach using weak supervision to bootstrap named entity recognition models without requir- ing any labelled data from the target domain. The approach relies on labelling functions that automati- cally annotate documents with named-entity labels. A hidden Markov model (HMM) is then trained to unify the noisy labelling functions into a single (probabilistic) annotation, taking into account the accuracy and confusions of each labelling function. Finally, a sequence labelling model is trained using a cross-entropy loss on this unified annotation. As in other weak supervision frameworks, the labelling functions allow us to inject expert knowl- edge into the sequence labelling model, which is often critical when data is scarce or non-existent (Hu et al., 2016; Wang and Poon, 2018). New la-

Transcript of Named Entity Recognition without Labelled Data: A Weak … · 2020. 6. 20. · Named Entity...

  • Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1518–1533July 5 - 10, 2020. c©2020 Association for Computational Linguistics

    1518

    Named Entity Recognition without Labelled Data:A Weak Supervision Approach

    Pierre Lison1, Jeremy Barnes2, Aliaksandr Hubin1, and Samia Touileb2

    1Norwegian Computing Center, Oslo, Norway2 Language Technology Group, University of Oslo, Norway

    {plison,ahu}@nr.no, {jeremycb,samiat}@ifi.uio.no

    Abstract

    Named Entity Recognition (NER) perfor-mance often degrades rapidly when applied totarget domains that differ from the texts ob-served during training. When in-domain la-belled data is available, transfer learning tech-niques can be used to adapt existing NER mod-els to the target domain. But what should onedo when there is no hand-labelled data for thetarget domain? This paper presents a simplebut powerful approach to learn NER models inthe absence of labelled data through weak su-pervision. The approach relies on a broad spec-trum of labelling functions to automatically an-notate texts from the target domain. These an-notations are then merged together using a hid-den Markov model which captures the vary-ing accuracies and confusions of the labellingfunctions. A sequence labelling model can fi-nally be trained on the basis of this unifiedannotation. We evaluate the approach on twoEnglish datasets (CoNLL 2003 and news arti-cles from Reuters and Bloomberg) and demon-strate an improvement of about 7 percentagepoints in entity-level F1 scores compared to anout-of-domain neural NER model.

    1 Introduction

    Named Entity Recognition (NER) constitutes acore component in many NLP pipelines and isemployed in a broad range of applications suchas information extraction (Raiman and Raiman,2018), question answering (Mollá et al., 2006),document de-identification (Stubbs et al., 2015),machine translation (Ugawa et al., 2018) and evenconversational models (Ghazvininejad et al., 2018).Given a document, the goal of NER is to identifyand classify spans referring to an entity belongingto pre-specified categories such as persons, organi-sations or geographical locations.

    NER models often rely on convolutional or re-current neural architectures, sometimes completed

    by a CRF layer (Chiu and Nichols, 2016; Lampleet al., 2016; Yadav and Bethard, 2018). More re-cently, deep contextualised representations relyingon bidirectional LSTMS (Peters et al., 2018), trans-formers (Devlin et al., 2019; Yan et al., 2019) orcontextual string embeddings (Akbik et al., 2019)have also been shown to achieve state-of-the-artperformance on NER tasks.

    These neural architectures require large corporaannotated with named entities, such as Ontonotes(Weischedel et al., 2011) or ConLL 2003 (TjongKim Sang and De Meulder, 2003). When only mod-est amounts of training data are available, transferlearning approaches can transfer the knowledge ac-quired from related tasks into the target domain, us-ing techniques such as simple transfer (Rodriguezet al., 2018), discriminative fine-tuning (Howardand Ruder, 2018), adversarial transfer (Zhou et al.,2019) or layer-wise domain adaptation approaches(Yang et al., 2017; Lin and Lu, 2018).

    However, in many practical settings, we wishto apply NER to domains where we have no la-belled data, making such transfer learning methodsdifficult to apply. This paper presents an alterna-tive approach using weak supervision to bootstrapnamed entity recognition models without requir-ing any labelled data from the target domain. Theapproach relies on labelling functions that automati-cally annotate documents with named-entity labels.A hidden Markov model (HMM) is then trainedto unify the noisy labelling functions into a single(probabilistic) annotation, taking into account theaccuracy and confusions of each labelling function.Finally, a sequence labelling model is trained usinga cross-entropy loss on this unified annotation.

    As in other weak supervision frameworks, thelabelling functions allow us to inject expert knowl-edge into the sequence labelling model, which isoften critical when data is scarce or non-existent(Hu et al., 2016; Wang and Poon, 2018). New la-

  • 1519

    belling functions can be easily inserted to leveragethe knowledge sources at our disposal for a giventextual domain. Furthermore, labelling functionscan often be ported across domains, which is notthe case for manual annotations that must be reiter-ated for every target domain.

    The contributions of this paper are as follows:

    1. A broad collection of labelling functions forNER, including neural models trained on vari-ous textual domains, gazetteers, heuristic func-tions, and document-level constraints.

    2. A novel weak supervision model suited forsequence labelling tasks and able to includeprobabilistic labelling predictions.

    3. An open-source implementation of these la-belling functions and aggregation model thatcan scale to large datasets 1.

    2 Related Work

    Unsupervised domain adaptation: Unsuper-vised domain adaptation attempts to adapt knowl-edge from a source domain to predict new instancesin a target domain which often has substantially dif-ferent characteristics. Earlier approaches often tryto adapt the feature space using pivots (Blitzer et al.,2006, 2007; Ziser and Reichart, 2017) to createdomain-invariant representations of predictive fea-tures. Others learn low-dimensional transformationfeatures of the data (Guo et al., 2009; Glorot et al.,2011; Chen et al., 2012; Yu and Jiang, 2016; Barneset al., 2018). Finally, some approaches divide thefeature space into general and domain-dependentfeatures (Daumé III, 2007). Multi-task learningcan also improve cross-domain performance (Pengand Dredze, 2017).

    Recently, Han and Eisenstein (2019) proposeddomain-adaptive fine-tuning, where contextualisedembeddings are first fine-tuned to both the sourceand target domains with a language modelling lossand subsequently fine-tuned to source domain la-belled data. This approach outperforms severalstrong baselines trained on the target domain of theWNUT 2016 NER task (Strauss et al., 2016).

    Aggregation of annotations: Approaches thataggregate annotations from multiples sources havelargely concentrated on noisy data from crowdsourced annotations, with some annotators possibly

    1https://github.com/NorskRegnesentral/weak-supervision-for-NER.

    being adversarial. The Bayesian Classifier Combi-nation approach of Kim and Ghahramani (2012)combines multiple independent classifiers usinga linear combination of predictions. Hovy et al.(2013) learn a generative model able to aggregatecrowd-sourced annotations and estimate the trust-worthiness of annotators. Rodrigues et al. (2014)present an approach based on Conditional RandomFields (CRFs) whose model parameters are learnedjointly using EM. Nguyen et al. (2017b) propose aHidden Markov Model to aggregate crowd-sourcedsequence annotations and find that explicitly mod-elling the annotator leads to improvements for POS-tagging and NER. Finally, Simpson and Gurevych(2019) proposed a fully Bayesian approach to theproblem of aggregating multiple sequential anno-tations, using variational EM to compute posteriordistributions over the model parameters.

    Weak supervision: The aim of weakly super-vised modelling is to reduce the need for hand-annotated data in supervised training. A particularinstance of weak supervision is distant supervision,which relies on external resources such as knowl-edge bases to automatically label documents withentities that are known to belong to a particularcategory (Mintz et al., 2009; Ritter et al., 2013;Shang et al., 2018). Ratner et al. (2017, 2019) gen-eralised this approach with the Snorkel frameworkwhich combines various supervision sources usinga generative model to estimate the accuracy (andpossible correlations) of each source. These ag-gregated supervision sources are then employed totrain a discriminative model. Current frameworksare, however, not easily adaptable to sequence la-belling tasks, as they typically require data points tobe independent. One exception is the work of Wangand Poon (2018), which relies on deep probabilisticlogic to perform joint inference on the full dataset.Finally, Fries et al. (2017) presented a weak super-vision approach to NER in the biomedical domain.However, unlike the model proposed in this paper,their approach relies on an ad-hoc mechanism forgenerating candidate spans to classify.

    The approach most closely related to this paperis Safranchik et al. (2020), which describe a similarweak supervision framework for sequence labellingbased on an extension of HMMs called linked hid-den Markov models. The authors introduce a newtype of noisy rules, called linking rules, to deter-mine how sequence elements should be groupedinto spans of same tag. The main differences be-

    https://github.com/NorskRegnesentral/weak-supervision-for-NERhttps://github.com/NorskRegnesentral/weak-supervision-for-NER

  • 1520

    x1

    y1

    h1

    x2

    y2

    h2

    x3

    y3

    h3

    xt

    yt

    ht...

    Step 1:labellingfunctions

    Step 2:labelaggregation

    Step 3: Training of sequencelabelling model on aggregated labels

    Figure 1: Illustration of the weak supervision approach.

    tween their approach and this paper are the linkingrules, which are not employed here, and the choiceof labelling functions, in particular the document-level relations detailed in Section 3.1.

    Ensemble learning: The proposed approach isalso loosely related to ensemble methods suchbagging, boosting and random forests (Sagi andRokach, 2018). These methods rely on multipleclassifiers run simultaneously and whose outputsare combined at prediction time. In contrast, our ap-proach (as in other weak supervision frameworks)only requires labelling functions to be aggregatedonce, as an intermediary step to create training datafor the final model. This is a non-trivial differ-ence as running all labelling functions at predictiontime is computationally costly due to the need torun multiple neural models along with gazetteersextracted from large knowledge bases.

    3 Approach

    The proposed model collects weak supervisionfrom multiple labelling functions. Each labellingfunction takes a text document as input and out-puts a series of spans associated with NER labels.These outputs are then aggregated using a hiddenMarkov model (HMM) with multiple emissions(one per labelling function) whose parameters areestimated in an unsupervised manner. Finally, theaggregated labels are employed to learn a sequencelabelling model. Figure 1 illustrates this process.The process is performed on documents from thetarget domain, e.g. a corpus of financial news.

    Labelling functions are typically specialised todetect only a subset of possible labels. For instance,a gazetteer based on Wikipedia will only detectmentions of persons, organisations and geograph-ical locations and ignore entities such as dates orpercents. This marks a departure from existing ag-gregation methods, which are originally designedfor crowd-sourced data and where annotators aresupposed to make use of the full label set. In addi-tion, unlike previous weak supervision approaches,

    we allow labelling functions to produce probabilis-tic predictions instead of deterministic values. Theaggregation model described in Section 3.2 directlycaptures these properties in the emission model as-sociated with each labelling function.

    We first briefly describe the labelling functionsintegrated into the current system. We review inSection 3.2 the aggregation model employed tocombine the labelling predictions. The final la-belling model is presented in Section 3.3. Thecomplete list of 52 labelling functions employed inthe experiments is available in Appendix A.

    3.1 Labelling functionsOut-of-domain NER models The first set of la-belling functions are sequence labelling modelstrained in domains from which labelled data isavailable. In the experiments detailed in Section4, we use four such models, respectively trainedon Ontonotes (Weischedel et al., 2011), CoNLL2003 (Tjong Kim Sang and De Meulder, 2003)2,the Broad Twitter Corpus (Derczynski et al., 2016)and a NER-annotated corpus of SEC filings (Sali-nas Alvarado et al., 2015).

    For the experiments in this paper, all afore-mentioned models rely on a transition-based NERmodel (Lample et al., 2016) which extracts featureswith a stack of four convolutional layers with filtersize of three and residual connections. The modeluses attention features and a multi-layer percep-tron to select the next transition. It is initialisedwith GloVe embeddings (Pennington et al., 2014)and implemented in Spacy (Honnibal and Montani,2017). However, the proposed approach does notimpose any constraints on the model architectureand alternative approaches based on e.g. contextu-alised embeddings can also be employed.

    Gazetteers As in distant supervision approaches,we include a number of gazetteers from largeknowledge bases to identify named entities. Con-cretely, we use resources from Wikipedia (Geißet al., 2018), Geonames (Wick, 2015), the Crunch-base Open Data Map, DBPedia (Lehmann et al.,2015) along with lists of countries, languages, na-tionalities and religious or political groups.

    To efficiently search for occurrences of these en-tities in large text collections, we first convert eachknowledge base into a trie data structure. Prefixsearch is then applied to extract matches (using

    2The ConLL 2003 NER model is of course deactivated forthe experimental evaluation on ConLL 2003.

  • 1521

    both case-sensitive and case-insensitive mode, asthey have distinct precision-recall trade-offs).

    Heuristic functions We also include variousheuristic functions, each specialised in the recog-nition of specific types of named entities. Severalfunctions are dedicated to the recognition of propernames based on casing, part-of-speech tags or de-pendency relations. In addition, we integrate avariety of handcrafted functions relying on regularexpressions to detect occurrences of various enti-ties (see Appendix A for details). A probabilisticparser specialised in the recognition of dates, times,money amounts, percents, and cardinal/ordinal val-ues (Braun et al., 2017) is also incorporated.

    Document-level relations All labelling func-tions described above rely on local decisions ontokens or phrases. However, texts are not loosecollections of words, but exhibit a high degree ofinternal coherence (Grosz and Sidner, 1986; Groszet al., 1995) which can be exploited to further im-prove the annotations.

    We introduce one labelling function to capturelabel consistency constraints in a document. Asnoted in (Krishnan and Manning, 2006; Wang et al.,2018), named entities occurring multiple timesthrough a document have a high probability of be-longing to the same category. For instance, whileKomatsu may both refer to a Japanese town or amultinational corporation, a text including this men-tion will either be about the town or the company,but rarely both at the same time. To capture thesenon-local dependencies, we define the followinglabel consistency model: given a text span e occur-ring in a given document, we look for all spans Zein the document that contain the same string as e.The (probabilistic) output of the labelling functionthen corresponds to the relative frequency of eachlabel l for that string in the document:

    Pdoc majority(e)(l) =

    ∑z∈Ze Plabel(z)(l)

    |Ze|(1)

    The above formula depends on a distributionPlabel(z), which can be defined on the basis ofother labelling functions. Alternatively, a two-stagemodel similar to (Krishnan and Manning, 2006)could be employed to first aggregate local labellingfunctions and subsequently apply document-levelfunctions on aggregated predictions.

    Another insight from Grosz and Sidner (1986) isthe importance of the attentional structure. When

    introduced for the first time, named entities areoften referred to in an explicit and univocal manner,while subsequent mentions (once the entity is a partof the focus structure) frequently rely on shorterreferences. The first mention of a person in a giventext is for instance likely to include the person’s fullname, and is often shortened to the person’s lastname in subsequent mentions. As in Ratinov andRoth (2009), we determine whether a proper nameis a substring of another entity mentioned earlier inthe text. If so, the labelling function replicates thelabel distribution of the first entity.

    3.2 Aggregation modelThe outputs of these labelling functions are thenaggregated into a single layer of annotation throughan aggregation model. As we do not have accessto labelled data for the target domain, this model isestimated in a fully unsupervised manner.

    Model We assume a list of J labelling functions{λ1, ...λJ} and a list of S mutually exclusive NERlabels {l1, ...lS}. The aggregation model is repre-sented as an HMM, in which the states correspondto the true underlying labels. This model has multi-ple emissions (one per labelling function) assumedto be mutually independent conditional on the la-tent underlying label.

    Formally, for each token i ∈ {1, ..., n} and la-belling function j, we assume a Dirichlet distribu-tion for the probability labels Pij . The parametersof this Dirichlet are separate vectors αsij ∈ RS[0,1],for each of the latent states si ∈ {1, ..., S}. Thelatent states are assumed to have a Markovian de-pendence structure between the tokens {1, ..., n}.This results in the HMM represented by a depen-dent mixtures of Dirichlet model:

    Pij |αsijind∼ Dirichlet

    (αsij

    ), (2)

    p(si|si−1) = logit−1(ω(si,si−1)

    ), (3)

    logit−1(ω(si,si−1)

    )= e

    ω(si,si−1)

    1+eω(si,si−1)

    . (4)

    Here, ω(si,si−1) ∈ R are the parameters of thetransition probability matrix controlling for a givenstate si−1 the probability of transition to state si.Figure 2 illustrates the model structure.

    Parameter estimation The learnable parametersof this HMM are (a) the transition matrix betweenstates and (b) the α vectors of the Dirichlet distri-bution associated with each labelling function. The

  • 1522

    The plugged wells have ...

    si−1 si si+1 si+2 ...

    αsij P ij

    Labelling function j ∈ {1, ...J}

    Figure 2: Aggregation model using a hidden Markovmodel with multiple probabilistic emissions.

    transition matrix is of size |S|× |S|, while we have|S| × |J | α vectors, each of size |S|. The parame-ters are estimated with the Baum-Welch algorithm,which is a variant of EM algorithm that relies onthe forward-backward algorithm to compute thestatistics for the expectation step.

    To ensure faster convergence, we introduce anew constraint to the likelihood function: for eachtoken position i, the corresponding latent label simust have a non-zero probability in at least onelabelling function (the likelihood of this label isotherwise set to zero for that position). In otherwords, the aggregation model will only predict aparticular label if this label is produced by leastone labelling function. This simple constraint facil-itates EM convergence as it restricts the state spaceto a few possible labels at every time-step.

    Prior distributions The HMM described abovecan be provided with informative priors. In particu-lar, the initial distribution for the latent states canbe defined as a Dirichlet based on counts δ for themost reliable labelling function3:

    p(si)d= Dirichlet(δ). (5)

    The prior for each row k of the transition probabili-ties matrix is also a Dirichlet based on the frequen-cies of transitions between the observed classes forthe most reliable labelling function κk:

    p(si|si−1 = k)d= Dirichlet(κk). (6)

    Finally, to facilitate convergence of the EM algo-rithm, informative starting values can be specifiedfor the emission model of each labelling function.

    3The most reliable labelling function was found in ourexperiments to be the NER model trained on Ontonotes 5.0.

    Assuming we can provide rough estimates of the re-call rjk and precision ρjk for the labelling functionj on label k, the initial values for the parameters ofthe emission model are expressed as:

    αsijk ∝

    {rjk, if si = k,(1− rsik) (1− ρjk) δk, if si 6= k.

    The probability of observing a given label k emit-ted by the labelling function j is thus proportionalto its recall if the true label is indeed k. Otherwise(i.e. if the labelling function made an error), theprobability of emitting k is inversely proportionalto the precision of the labelling function j.

    Decoding Once the parameters of the HMMmodel are estimated, the forward-backward algo-rithm can be employed to associate each tokenmarginally with a posterior probability distributionover possible NER labels (Rabiner, 1990).

    3.3 Sequence labelling modelOnce the labelling functions are aggregated on doc-uments from the target domain, we can train a se-quence labelling model on the unified annotations,without imposing any constraints on the type ofmodel to use. To take advantage of the posteriormarginal distribution p̃s over the latent labels, theoptimisation should seek to minimise the expectedloss with respect to p̃s:

    θ̂ = argminθ

    n∑i

    Ey∼p̃s [loss(hθ(xi), y)] (7)

    where hθ(·) is the output of the sequence labellingmodel. This is equivalent to minimising the cross-entropy error between the outputs of the neuralmodel and the probabilistic labels produced by theaggregation model.

    4 Evaluation

    We evaluate the proposed approach on two English-language datasets, namely the CoNLL 2003 datasetand a collection of sentences from Reuters andBloomberg news articles annotated with namedentities by crowd-sourcing. We include a seconddataset in order to evaluate the approach with amore fine-grained set of NER labels than the onesin CoNLL 2003. As the objective of this paperis to compare approaches to unsupervised domainadaptation, we do not rely on any labelled datafrom these two target domains.

  • 1523

    4.1 DataCoNLL 2003 The CoNLL 2003 dataset (TjongKim Sang and De Meulder, 2003) consists of 1163documents, including a total of 35089 entitiesspread over 4 labels: ORG, PER, LOC and MISC.

    Reuters & Bloomberg We additionally crowdannotate 1054 sentences from Reuters andBloomberg news articles from Ding et al. (2014).We instructed the annotators to tag sentences withthe following 9 Ontonotes-inspired labels: PER-SON, NORP, ORG, LOC, PRODUCT, DATETIME, PER-CENT, MONEY, QUANTITY. Each sentence was an-notated by at least two annotators, and a qualifyingtest with gold-annotated questions was conductedfor quality control. Cohen’s κ for sentences withtwo annotators is 0.39, while Krippendorff’s α forthree annotators is 0.44. We had to remove QUAN-TITY labels from the annotations as the crowd re-sults for this label were highly inconsistent.

    4.2 BaselinesOntonotes-trained NER The first baseline cor-responds to a neural sequence labelling modeltrained on the Ontonotes 5.0 corpus. We use herethe same model from Section 3.1, which is thesingle best-performing labelling function (that is,without aggregating multiple predictions).

    We also experimented with other neural architec-tures but these performed similar or worse than thetransition-based model, presumably because theyare more prone to overfitting on the source domain.

    Majority voting (MV) The simplest method foraggregating outputs is majority voting, i.e. out-putting the most frequent label among the onespredicted by each labelling function. However, spe-cialised labelling functions will output O for mosttokens, which means that the majority label is typ-ically O. To mitigate this problem, we first lookat tokens that are marked with a non-O label byat least T labelling functions (where T is a hyper-parameter tuned experimentally), and then applymajority voting on this set of non-O labels.

    Snorkel model The Snorkel framework (Ratneret al., 2017) does not directly support sequencelabelling tasks as data points are required to beindependent. However, heuristics can be used toextract named-entity candidates and then apply la-belling functions to infer their most likely labels(Fries et al., 2017). For this baseline, we use the

    three functions nnp detector, proper detector and com-pound detector (see Appendix A) to generate candi-date spans. We then create a matrix expressing theoutput of each labelling function for each span (in-cluding a specific ”abstain” value to denote the ab-sence of prediction) and run the matrix-completion-style approach of Ratner et al. (2019) to aggregatethe predictions from all functions.

    mSDA is a strong domain adaptation baseline(Chen et al., 2012) which augments the featurespace of a model with intermediate representationslearned using stacked denoising autoencoders. Inour case, we learn the mSDA representations onthe unlabeled source and target domain data. These800 dimensional vectors are concatenated to 300dimensional word embeddings and fed as input toa two-layer LSTM with a skip connection. Finally,we train the LSTM on the labeled source data andtest on the target domain.

    AdaptaBERT This baseline corresponds to astate-of-the-art unsupervised domain adaptation ap-proach (AdaptaBERT) (Han and Eisenstein, 2019).The approach first uses unlabeled data from boththe source and target domains to domain-tune apretrained BERT model. The model is finally task-tuned in a supervised fashion on the source domainlabelled data (Ontonotes). At inference time, themodel makes use of the pretraining and domaintuning to predict entities in the target domain. Inour experiments, we use the cased-version of thebase BERT model and perform three fine-tuningepochs for both domain-tuning and task-tuning. Weadditionally include an ensemble model, which av-erages the predictions of five BERT models fine-tuned with different random seeds.

    Mixtures of multinomialsFollowing the notation from Section 3.2, we de-fine Yi,j,k = I(Pi,j,k = maxk′∈{1,...,S} Pi,j,k′) tobe the most probable label for word i by source j.One can model Yij with a Multinomial probabil-ity distribution. The first four baselines (the fifthone assumes Markovian dependence between thelatent states) listed below use the following inde-pendent, i.e. p(si, si−1) = p(si)p(si−1), mixturesof Multinomials model for Yij :

    Yij |psijind∼ Multinomial(psij ),

    siind∼ Multinomial(σ).

  • 1524

    Accuracy model (ACC) (Rodrigues et al., 2014)assumes the following constraints on psij :

    psijk =

    {πj , if si = k,1−πjJ−1 si 6= k.

    Here, for each labelling function it is assumed tohave the same accuracy πj for all of the tokens.

    Confusion vector (CV) (Nguyen et al., 2017a)extends ACC by relying on separate success prob-abilities for each token label:

    psijk =

    {πjk, if si = k,1−πjkJ−1 si 6= k.

    Confusion matrix (CM) (Dawid and Skene,1979) allows for distinct accuracies conditional onthe latent states, which results in:

    psijk = πsijk. (8)

    Sequential Confusion Matrix (SEQ) extendsthe CM model of Simpson and Gurevych (2019),where an ”auto-regressive” component is includedin the observed part of the model. We assume de-pendence on a covariate indicating that the labelhas not changed for a given source, i.e.:

    psijk = logit−1(µsijk + I(Y

    Ti−1,j,k = Y

    Ti,j,k)β

    sijk).

    Dependent confusion matrix (DCM) combinesthe CM-distinct accuracies conditional on the latentstates of (8) and the Markovian dependence of (3).

    4.3 ResultsThe evaluation results are shown in Tables 1 and2, respectively for the CoNLL 2003 data and thecrowd-annotated sentences. The metrics are the(micro-averaged) precision, recall and F1 scoresat both the token-level and entity-level. In addi-tion, we indicate the token-level cross-entropy er-ror (in log-scale). As the labelling functions aredefined on a richer annotation scheme than the fourlabels of ConLL 2003, we map GPE to LOC andEVENT, FAC, LANGUAGE, LAW, NORP, PRODUCTand WORK OF ART to MISC. The results for theACC and CV baselines are not included as the pa-rameter estimation did not converge and hence didnot provide reliable posteriors over parameters.

    Table 1 further details the results for subsets oflabelling functions. Of particular interest is the con-tribution of document-level functions, boosting the

    entity-level F1 from 0.702 to 0.716. This highlightsthe importance of these relations in NER.

    The last line of the two tables reports the per-formance of the sequence labelling model (Section3.3) trained on the aggregated labels. We observethat its performance remains close to the HMM-aggregated labels. This shows that the knowledgefrom the labelling functions can be injected into astandard neural model without substantial loss.

    4.4 DiscussionAlthough not shown in the results due to spaceconstraints, we also analysed whether the informa-tive priors described in Section 3.2 influenced theperformance of the aggregation model. We foundinformative and non-informative priors to yield sim-ilar performance for CoNLL 2003. However, theperformance of non-informative priors was verypoor on the Reuters and Bloomberg sentences (F1at 0.12), thereby demonstrating the usefulness ofinformative priors for small datasets.

    We provide in Figure 3 an example with a fewselected labelling functions. In particular, we canobserve that the Ontonotes-trained NER model mis-takenly labels ”Heidrun” as a product. This erro-neous label, however, is counter-balanced by otherlabelling functions, notably a document-level func-tion looking at the global label frequency of thisstring through the document. We do, however, no-tice a few remaining errors, e.g. the labelling of”Status Weekly” as an organisation.

    Figure 4 illustrates the pairwise agreement anddisagreement between labelling functions on theCoNLL 2003 dataset. If both labelling functionsmake the same prediction on a given token, wecount this as an agreement, whereas conflicting pre-dictions (ignoring O labels), are seen as disagree-ment. Large differences may exist between thesefunctions for specific labels, especially MISC. Thefunctions with the highest overlap are those makingpredictions on all labels, while labelling functionsspecialised to few labels (such as legal detector) of-ten have less overlap. We also observe that thetwo gazetteers from Crunchbase and Geonamesdisagree in about 15% of cases, presumably dueto company names that are also geographical loca-tions, as in the earlier Komatsu example.

    In terms of computational efficiency, the estima-tion of HMM parameters is relatively fast, requir-ing less than 30 mins on the entire CoNLL 2003data. Once the aggregation model is estimated, it

  • 1525

    Token-level Entity-levelModel: P R F1 CEE P R F1

    Ontonotes-trained NER 0.719 0.706 0.712 2.671 0.694 0.620 0.654

    Majority voting (MV) 0.815 0.675 0.738 2.047 0.751 0.619 0.678Confusion Matrix (CM) 0.786 0.746 0.766 1.964 0.713 0.700 0.706Sequential Confusion Matrix (SEQ) 0.736 0.716 0.726 2.254 0.642 0.668 0.654Dependent Confusion Matrix (DCM) 0.785 0.744 0.764 1.983 0.710 0.698 0.704Snorkel-aggregated labels 0.710 0.661 0.684 2.264 0.714 0.621 0.664

    mSDA (OntoNotes) 0.640 0.569 0.603 2.813 0.560 0.562 0.561AdaptaBERT (OntoNotes) 0.693 0.733 0.712 2.280 0.652 0.736 0.691AdaptaBERT (Ensemble) 0.704 0.754 0.729 2.103 0.684 0.743 0.712

    HMM-agg. labels (only NER models) 0.658 0.720 0.688 2.653 0.642 0.599 0.620HMM-agg. labels (only gazetteers) 0.759 0.394 0.518 3.678 0.687 0.367 0.478HMM-agg. labels (only heuristics) 0.722 0.771 0.746 1.989 0.718 0.683 0.700HMM-agg. labels (all but doc-level) 0.714 0.778 0.744 1.878 0.713 0.693 0.702HMM-agg. labels (all functions) 0.719 0.794 0.754 1.812 0.721 0.713 0.716

    Neural net trained on HMM-agg. labels 0.712 0.790 0.748 2.282 0.715 0.707 0.710

    Table 1: Evaluation results on CoNLL 2003. MV=Majority Voting, P=Precision, R=Recall, CEE=Cross-entropyError (lower is better). The results are micro-averaged on all labels (PER, ORG, LOC and MISC).

    Token-level Entity-levelModel: P R F1 CEE P R F1

    OntoNotes-trained NER 0.793 0.791 0.792 2.648 0.694 0.635 0.664

    Majority voting (MV) 0.832 0.713 0.768 2.454 0.699 0.644 0.670Confusion Matrix (CM) 0.816 0.702 0.754 2.708 0.667 0.636 0.652Sequential Confusion Matrix (SEQ) 0.741 0.630 0.682 3.261 0.535 0.547 0.540Dependent Confusion Matrix (DCM) 0.819 0.706 0.758 2.702 0.673 0.641 0.656

    mSDA (OntoNotes) 0.749 0.751 0.750 2.501 0.618 0.684 0.649AdaptaBERT (OntoNotes) 0.799 0.801 0.800 2.351 0.668 0.734 0.699AdaptaBERT (Ensemble) 0.813 0.815 0.814 2.265 0.682 0.748 0.713

    HMM-aggregated labels (all functions) 0.804 0.823 0.814 2.219 0.749 0.697 0.722

    Neural net trained on HMM-agg. labels 0.805 0.827 0.816 2.448 0.749 0.701 0.724

    Table 2: Evaluation results on 1094 crowd-annotated sentences from Reuters and Bloomberg news articles. Theresults are micro-averaged on 8 labels (PERSON, NORP, ORG, LOC, PRODUCT, DATE, PERCENT, and MONEY).

    can be directly applied to new texts with a singleforward-backward pass, and can therefore scale todatasets with hundreds of thousands of documents.This runtime performance is an important advan-tage compared to approaches such as AdaptaBERT(Han and Eisenstein, 2019) which are relativelyslow at inference time. The proposed approach canalso be ported to other languages than English, al-though heuristic functions and gazetteers will needto be adapted to the target language.

    5 Conclusion

    This paper presented a weak supervision modelfor sequence labelling tasks such as Named EntityRecognition. To leverage all possible knowledgesources available for the task, the approach usesa broad spectrum of labelling functions, includ-ing data-driven NER models, gazetteers, heuristicfunctions, and document-level relations betweenentities. Labelling functions may be specialisedto recognise specific labels while ignoring oth-

  • 1526

    Well repairs to lift HeidrunPRODUCT

    LOC

    oil output - StatoilCOMPANY

    . OSLOGPE

    1996-08-22DATE

    CARDINAL

    ThreeCARDINAL

    plugged water injection wells on the HeidrunPRODUCT

    LOC

    COMPANY

    oilfield off mid-Norway will be reopened over the next monthDATE

    , operator Den Norske StatsCOMPANY

    Oljeselskap

    PERSON

    AS

    ORG

    ( StatoilCOMPANY

    ) said on Thursday

    DATE

    .

    The plugged wells have accounted for a dip of 30,000CARDINAL

    barrels

    QUANTITY

    per day ( bpd ) in HeidrunLOC

    output to roughly 220,000CARDINAL

    bpd

    QUANTITY

    , according

    to the company ’s Status WeeklyORG

    newsletter . The wells will be reperforated and gravel will be pumped into the reservoir through oneCARDINAL

    TIME

    of the wells to avoid plugging problems in the future , it said . – OsloGPE

    newsroom

    Neural models: Ontonotes-trained NER ;Gazetteers: company uncased ; Heuristic functions: date detector, snips, and number detector ;Document level functions: doc majority uncased ; Aggregated predictions: HMM-aggregated model

    Figure 3: Extended example showing the outputs of 6 labelling functions, along with the HMM-aggregated model.

    prop

    ernn

    pcompo

    und

    misc

    lega

    lcompa

    nyfull_na

    me

    BTC

    SEC

    wiki

    geo

    crun

    chba

    seprod

    uct

    doc_history

    doc_majority

    propernnp

    compoundmisclegal

    companyfull_name

    BTCSECwikigeo

    crunchbaseproduct

    doc_historydoc_majority

    prop

    ernn

    pcompo

    und

    misc

    lega

    lcompa

    nyfull_na

    me

    BTC

    SEC

    wiki

    geo

    crun

    chba

    seprod

    uct

    doc_history

    doc_majority

    propernnp

    compoundmisclegal

    companyfull_name

    BTCSECwikigeo

    crunchbaseproduct

    doc_historydoc_majority0.00

    0.25

    0.50

    0.75

    1.00

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    0.14

    Figure 4: Pairwise agreement (left) and disagreement (right) between the labelling functions on the CoNLL 2003data with labels PER, ORG, LOC, MISC, normalized by total number of labelled examples.

    ers. Furthermore, unlike previous weak supervi-sion approaches, labelling functions may produceprobabilistic predictions. The outputs of these la-belling functions are then merged together using ahidden Markov model whose parameters are esti-mated with the Baum-Welch algorithm. A neuralsequence labelling model can finally be learned onthe basis of these unified predictions.

    Evaluation results on two datasets (CoNLL 2003and news articles from Reuters and Bloomberg)show that the method can boost NER performanceby about 7 percentage points on entity-level F1.In particular, the proposed model outperforms theunsupervised domain adaptation approach throughcontextualised embeddings of Han and Eisenstein(2019). Of specific linguistic interest is the con-tribution of document-level labelling functions,which take advantage of the internal coherence andnarrative structure of the texts.

    Future work will investigate how to take intoaccount potential correlations between labelling

    functions in the aggregation model, as done ine.g. (Bach et al., 2017). Furthermore, some of thelabelling functions can be rather noisy and modelselection of the optimal subset of the labelling func-tions might well improve the performance of ourmodel. Model selection approaches that can beadapted are discussed in Adams and Beling (2019);Hubin (2019). We also wish to evaluate the ap-proach on other types of sequence labelling tasksbeyond Named Entity Recognition.

    Acknowledgements

    The research presented in this paper was conductedas part of the innovation project ”FinAI: ArtificialIntelligence tool to monitor global financial mar-kets” in collaboration with Exabel AS4. Addition-ally, this work is supported by the SANT project(Sentiment Analysis for Norwegian Text), fundedby the Research Council of Norway.

    4www.exabel.com

    www.exabel.com

  • 1527

    ReferencesStephen Adams and Peter A Beling. 2019. A survey

    of feature selection methods for Gaussian mixturemodels and hidden Markov models. Artificial Intel-ligence Review, 52(3):1739–1779.

    Alan Akbik, Tanja Bergmann, and Roland Vollgraf.2019. Pooled contextualized embeddings for namedentity recognition. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 724–728, Minneapolis, Minnesota. As-sociation for Computational Linguistics.

    Stephen H. Bach, Bryan He, Alexander Ratner, andChristopher Ré. 2017. Learning the structure of gen-erative models without labeled data. In Proceed-ings of the 34th International Conference on Ma-chine Learning - Volume 70, ICML’17, pages 273–282. JMLR.org.

    Jeremy Barnes, Roman Klinger, and Sabine Schulte imWalde. 2018. Projecting embeddings for domainadaption: Joint modeling of sentiment analysis indiverse domains. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 818–830, Santa Fe, New Mexico, USA. Asso-ciation for Computational Linguistics.

    John Blitzer, Mark Dredze, and Fernando Pereira. 2007.Biographies, Bollywood, boom-boxes and blenders:Domain adaptation for sentiment classification. InProceedings of the 45th Annual Meeting of the As-sociation of Computational Linguistics, pages 440–447, Prague, Czech Republic. Association for Com-putational Linguistics.

    John Blitzer, Ryan McDonald, and Fernando Pereira.2006. Domain adaptation with structural correspon-dence learning. In Proceedings of the 2006 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 120–128, Sydney, Australia. As-sociation for Computational Linguistics.

    Daniel Braun, Adrian Hernandez Mendez, FlorianMatthes, and Manfred Langen. 2017. Evaluatingnatural language understanding services for conver-sational question answering systems. In Proceed-ings of the 18th Annual SIGdial Meeting on Dis-course and Dialogue, pages 174–185, Saarbrücken,Germany. Association for Computational Linguis-tics.

    Minmin Chen, Zhixiang Xu, Kilian Q. Weinberger,and Fei Sha. 2012. Marginalized denoising autoen-coders for domain adaptation. In Proceedings ofthe 29th International Coference on InternationalConference on Machine Learning, ICML’12, pages1627–1634, USA. Omnipress.

    Jason P.C. Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional LSTM-CNNs. Trans-actions of the Association for Computational Lin-guistics, 4:357–370.

    Hal Daumé III. 2007. Frustratingly easy domain adap-tation. In Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics, pages256–263, Prague, Czech Republic. Association forComputational Linguistics.

    A. P. Dawid and A. M. Skene. 1979. Maximum likeli-hood estimation of observer error-rates using the emalgorithm. Applied Statistics, 28(1):20–28.

    Leon Derczynski, Kalina Bontcheva, and Ian Roberts.2016. Broad twitter corpus: A diverse named entityrecognition resource. In Proceedings of COLING2016, the 26th International Conference on Compu-tational Linguistics: Technical Papers, pages 1169–1179, Osaka, Japan. The COLING 2016 OrganizingCommittee.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

    Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2014. Using structured events to predict stock pricemovement: An empirical investigation. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1415–1425, Doha, Qatar. Association for Computa-tional Linguistics.

    Jason Fries, Sen Wu, Alex Ratner, and Christopher Ré.2017. Swellshark: A generative model for biomedi-cal named entity recognition without labeled data.

    Johanna Geiß, Andreas Spitz, and Michael Gertz. 2018.Neckar: A named entity classifier for wikidata. InLanguage Technologies for the Challenges of theDigital Age, pages 115–129, Cham. Springer Inter-national Publishing.

    Marjan Ghazvininejad, Chris Brockett, Ming-WeiChang, Bill Dolan, Jianfeng Gao, Scott Wen-tau Yih,and Michel Galley. 2018. A knowledge-groundedneural conversation model. In AAAI.

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Domain adaptation for large-scale senti-ment classification: A deep learning approach. InProceedings of the 28th International Conferenceon International Conference on Machine Learning,ICML’11, pages 513–520, USA. Omnipress.

    Barbara J. Grosz, Aravind K. Joshi, and Scott Wein-stein. 1995. Centering: A framework for model-ing the local coherence of discourse. ComputationalLinguistics, 21(2):203–225.

    Barbara J. Grosz and Candace L. Sidner. 1986. Atten-tion, intentions, and the structure of discourse. Com-putational Linguistics, 12(3):175–204.

    https://doi.org/10.18653/v1/N19-1078https://doi.org/10.18653/v1/N19-1078http://dl.acm.org/citation.cfm?id=3305381.3305410http://dl.acm.org/citation.cfm?id=3305381.3305410https://www.aclweb.org/anthology/C18-1070https://www.aclweb.org/anthology/C18-1070https://www.aclweb.org/anthology/C18-1070https://www.aclweb.org/anthology/P07-1056https://www.aclweb.org/anthology/P07-1056https://www.aclweb.org/anthology/W06-1615https://www.aclweb.org/anthology/W06-1615https://doi.org/10.18653/v1/W17-5522https://doi.org/10.18653/v1/W17-5522https://doi.org/10.18653/v1/W17-5522http://dl.acm.org/citation.cfm?id=3042573.3042781http://dl.acm.org/citation.cfm?id=3042573.3042781https://doi.org/10.1162/tacl_a_00104https://doi.org/10.1162/tacl_a_00104https://www.aclweb.org/anthology/P07-1033https://www.aclweb.org/anthology/P07-1033http://links.jstor.org/sici?sici=0035-9254%281979%2928%3A1%3C20%3AMLEOOE%3E2.0.CO%3B2-0http://links.jstor.org/sici?sici=0035-9254%281979%2928%3A1%3C20%3AMLEOOE%3E2.0.CO%3B2-0http://links.jstor.org/sici?sici=0035-9254%281979%2928%3A1%3C20%3AMLEOOE%3E2.0.CO%3B2-0https://www.aclweb.org/anthology/C16-1111https://www.aclweb.org/anthology/C16-1111https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.18653/v1/N19-1423https://doi.org/10.3115/v1/D14-1148https://doi.org/10.3115/v1/D14-1148http://arxiv.org/abs/1704.06360http://arxiv.org/abs/1704.06360https://www.microsoft.com/en-us/research/publication/knowledge-grounded-neural-conversation-model/https://www.microsoft.com/en-us/research/publication/knowledge-grounded-neural-conversation-model/http://dl.acm.org/citation.cfm?id=3104482.3104547http://dl.acm.org/citation.cfm?id=3104482.3104547https://www.aclweb.org/anthology/J95-2003https://www.aclweb.org/anthology/J95-2003https://www.aclweb.org/anthology/J86-3001https://www.aclweb.org/anthology/J86-3001

  • 1528

    Honglei Guo, Huijia Zhu, Zhili Guo, Xiaoxun Zhang,Xian Wu, and Zhong Su. 2009. Domain adaptationwith latent semantic association for named entityrecognition. In Proceedings of Human LanguageTechnologies: The 2009 Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics, pages 281–289, Boulder, Col-orado. Association for Computational Linguistics.

    Xiaochuang Han and Jacob Eisenstein. 2019. Unsu-pervised domain adaptation of contextualized em-beddings for sequence labeling. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 4237–4247, Hong Kong,China. Association for Computational Linguistics.

    Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

    Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,and Eduard Hovy. 2013. Learning whom to trustwith MACE. In Proceedings of the 2013 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 1120–1130, Atlanta, Georgia.Association for Computational Linguistics.

    Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 328–339, Melbourne, Australia.Association for Computational Linguistics.

    Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, EduardHovy, and Eric Xing. 2016. Harnessing deep neu-ral networks with logic rules. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages2410–2420, Berlin, Germany. Association for Com-putational Linguistics.

    Aliaksandr Hubin. 2019. An adaptive simulatedannealing EM algorithm for inference on non-homogeneous hidden Markov models. In Proceed-ings of the International Conference on Artificial In-telligence, Information Processing and Cloud Com-puting, pages 1–9.

    Hyun-Chul Kim and Zoubin Ghahramani. 2012.Bayesian classifier combination. In Proceedings ofthe Fifteenth International Conference on ArtificialIntelligence and Statistics, volume 22 of Proceed-ings of Machine Learning Research, pages 619–627,La Palma, Canary Islands. PMLR.

    Vijay Krishnan and Christopher D. Manning. 2006. Aneffective two-stage model for exploiting non-localdependencies in named entity recognition. In Pro-ceedings of the 21st International Conference on

    Computational Linguistics and 44th Annual Meet-ing of the Association for Computational Linguis-tics, pages 1121–1128, Sydney, Australia. Associa-tion for Computational Linguistics.

    Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270, San Diego, California. Associationfor Computational Linguistics.

    Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,Dimitris Kontokostas, Pablo N. Mendes, SebastianHellmann, Mohamed Morsey, Patrick van Kleef,Sören Auer, and Christian Bizer. 2015. Dbpedia -a large-scale, multilingual knowledge base extractedfrom wikipedia. Semantic Web, 6(2):167–195.

    Bill Yuchen Lin and Wei Lu. 2018. Neural adapta-tion layers for cross-domain named entity recogni-tion. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 2012–2022, Brussels, Belgium. Associationfor Computational Linguistics.

    Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-rafsky. 2009. Distant supervision for relation ex-traction without labeled data. In Proceedings ofthe Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP,pages 1003–1011, Suntec, Singapore. Associationfor Computational Linguistics.

    Diego Mollá, Menno van Zaanen, and Daniel Smith.2006. Named entity recognition for question an-swering. In Proceedings of the Australasian Lan-guage Technology Workshop 2006, pages 51–58,Sydney, Australia.

    An T Nguyen, Byron C Wallace, Junyi Jessy Li, AniNenkova, and Matthew Lease. 2017a. Aggregat-ing and predicting sequence labels from crowd an-notations. In Proceedings of the conference. Asso-ciation for Computational Linguistics. Meeting, vol-ume 2017, page 299. NIH Public Access.

    An Thanh Nguyen, Byron Wallace, Junyi Jessy Li, AniNenkova, and Matthew Lease. 2017b. Aggregatingand predicting sequence labels from crowd annota-tions. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 299–309, Vancouver,Canada. Association for Computational Linguistics.

    Nanyun Peng and Mark Dredze. 2017. Multi-task do-main adaptation for sequence tagging. In Proceed-ings of the 2nd Workshop on Representation Learn-ing for NLP, pages 91–100, Vancouver, Canada. As-sociation for Computational Linguistics.

    https://www.aclweb.org/anthology/N09-1032https://www.aclweb.org/anthology/N09-1032https://www.aclweb.org/anthology/N09-1032https://doi.org/10.18653/v1/D19-1433https://doi.org/10.18653/v1/D19-1433https://doi.org/10.18653/v1/D19-1433https://www.aclweb.org/anthology/N13-1132https://www.aclweb.org/anthology/N13-1132https://doi.org/10.18653/v1/P18-1031https://doi.org/10.18653/v1/P18-1031https://doi.org/10.18653/v1/P16-1228https://doi.org/10.18653/v1/P16-1228http://proceedings.mlr.press/v22/kim12.htmlhttps://doi.org/10.3115/1220175.1220316https://doi.org/10.3115/1220175.1220316https://doi.org/10.3115/1220175.1220316https://doi.org/10.18653/v1/N16-1030http://dblp.uni-trier.de/db/journals/semweb/semweb6.html#LehmannIJJKMHMK15http://dblp.uni-trier.de/db/journals/semweb/semweb6.html#LehmannIJJKMHMK15http://dblp.uni-trier.de/db/journals/semweb/semweb6.html#LehmannIJJKMHMK15https://doi.org/10.18653/v1/D18-1226https://doi.org/10.18653/v1/D18-1226https://doi.org/10.18653/v1/D18-1226https://www.aclweb.org/anthology/P09-1113https://www.aclweb.org/anthology/P09-1113https://www.aclweb.org/anthology/U06-1009https://www.aclweb.org/anthology/U06-1009https://doi.org/10.18653/v1/P17-1028https://doi.org/10.18653/v1/P17-1028https://doi.org/10.18653/v1/P17-1028https://doi.org/10.18653/v1/W17-2612https://doi.org/10.18653/v1/W17-2612

  • 1529

    Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics.

    Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

    Lawrence R. Rabiner. 1990. A tutorial on hiddenmarkov models and selected applications in speechrecognition. In Alex Waibel and Kai-Fu Lee, edi-tors, Readings in Speech Recognition, pages 267–296. Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA.

    Jonathan Raiman and Olivier Raiman. 2018. Deep-type: Multilingual entity linking by neural type sys-tem evolution. In Proceedings of the Thirty-SecondAAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial In-telligence (IAAI-18), and the 8th AAAI Symposiumon Educational Advances in Artificial Intelligence(EAAI-18), New Orleans, Louisiana, USA, February2-7, 2018, pages 5406–5413.

    Lev Ratinov and Dan Roth. 2009. Design chal-lenges and misconceptions in named entity recog-nition. In Proceedings of the Thirteenth Confer-ence on Computational Natural Language Learning(CoNLL-2009), pages 147–155, Boulder, Colorado.Association for Computational Linguistics.

    Alexander Ratner, Stephen H. Bach, Henry Ehrenberg,Jason Fries, Sen Wu, and Christopher Ré. 2017.Snorkel: Rapid training data creation with weak su-pervision. Proc. VLDB Endow., 11(3):269–282.

    Alexander Ratner, Stephen H. Bach, Henry Ehrenberg,Jason Fries, Sen Wu, and Christopher Ré. 2019.Snorkel: rapid training data creation with weak su-pervision. The VLDB Journal.

    Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Et-zioni. 2013. Modeling missing data in distant su-pervision for information extraction. Transactionsof the Association for Computational Linguistics,1:367–378.

    Filipe Rodrigues, Francisco Pereira, and BernardeteRibeiro. 2014. Sequence labeling with multiple an-notators. Mach. Learn., 95(2):165–181.

    Juan Diego Rodriguez, Adam Caldwell, and AlexanderLiu. 2018. Transfer learning for entity recognitionof novel classes. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,

    pages 1974–1985, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.

    Esteban Safranchik, Shiying Luo, and Stephen H. Bach.2020. Weakly supervised sequence tagging fromnoisy rules. In AAAI Conference on Artificial Intelli-gence (AAAI).

    Omer Sagi and Lior Rokach. 2018. Ensemble learn-ing: A survey. WIREs Data Mining and KnowledgeDiscovery, 8(4):e1249.

    Julio Cesar Salinas Alvarado, Karin Verspoor, and Tim-othy Baldwin. 2015. Domain adaption of named en-tity recognition to support credit risk assessment. InProceedings of the Australasian Language Technol-ogy Association Workshop 2015, pages 84–90, Par-ramatta, Australia.

    Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren,Teng Ren, and Jiawei Han. 2018. Learning namedentity tagger using domain-specific dictionary. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages2054–2064, Brussels, Belgium. Association forComputational Linguistics.

    Edwin D. Simpson and Iryna Gurevych. 2019. ABayesian approach for sequence tagging withcrowds. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages1093–1104, Hong Kong, China. Association forComputational Linguistics.

    Benjamin Strauss, Bethany Toma, Alan Ritter, Marie-Catherine de Marneffe, and Wei Xu. 2016. Resultsof the WNUT16 named entity recognition sharedtask. In Proceedings of the 2nd Workshop on NoisyUser-generated Text (WNUT), pages 138–144, Os-aka, Japan. The COLING 2016 Organizing Commit-tee.

    Amber Stubbs, Christopher Kotfila, and Özlem Uzuner.2015. Automated systems for the de-identificationof longitudinal clinical narratives. Journal ofBiomedical Informatics, 58(S):S11–S19.

    Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

    Arata Ugawa, Akihiro Tamura, Takashi Ninomiya, Hi-roya Takamura, and Manabu Okumura. 2018. Neu-ral machine translation incorporating named entity.In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 3240–3250,Santa Fe, New Mexico, USA. Association for Com-putational Linguistics.

    https://doi.org/10.3115/v1/D14-1162https://doi.org/10.3115/v1/D14-1162https://doi.org/10.18653/v1/N18-1202https://doi.org/10.18653/v1/N18-1202http://dl.acm.org/citation.cfm?id=108235.108253http://dl.acm.org/citation.cfm?id=108235.108253http://dl.acm.org/citation.cfm?id=108235.108253https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17148https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17148https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17148https://www.aclweb.org/anthology/W09-1119https://www.aclweb.org/anthology/W09-1119https://www.aclweb.org/anthology/W09-1119https://doi.org/10.14778/3157794.3157797https://doi.org/10.14778/3157794.3157797https://doi.org/10.1007/s00778-019-00552-1https://doi.org/10.1007/s00778-019-00552-1https://doi.org/10.1162/tacl_a_00234https://doi.org/10.1162/tacl_a_00234https://doi.org/10.1007/s10994-013-5411-2https://doi.org/10.1007/s10994-013-5411-2https://www.aclweb.org/anthology/C18-1168https://www.aclweb.org/anthology/C18-1168https://doi.org/10.1002/widm.1249https://doi.org/10.1002/widm.1249https://www.aclweb.org/anthology/U15-1010https://www.aclweb.org/anthology/U15-1010https://doi.org/10.18653/v1/D18-1230https://doi.org/10.18653/v1/D18-1230https://doi.org/10.18653/v1/D19-1101https://doi.org/10.18653/v1/D19-1101https://doi.org/10.18653/v1/D19-1101https://www.aclweb.org/anthology/W16-3919https://www.aclweb.org/anthology/W16-3919https://www.aclweb.org/anthology/W16-3919https://doi.org/10.1016/j.jbi.2015.06.007https://doi.org/10.1016/j.jbi.2015.06.007https://www.aclweb.org/anthology/W03-0419https://www.aclweb.org/anthology/W03-0419https://www.aclweb.org/anthology/C18-1274https://www.aclweb.org/anthology/C18-1274

  • 1530

    Hai Wang and Hoifung Poon. 2018. Deep probabilis-tic logic: A unifying framework for indirect super-vision. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 1891–1902, Brussels, Belgium. Associationfor Computational Linguistics.

    Limin Wang, Shoushan Li, Qian Yan, and GuodongZhou. 2018. Domain-specific named entity recogni-tion with document-level optimization. ACM Trans.Asian Low-Resour. Lang. Inf. Process., 17(4):33:1–33:15.

    R. Weischedel, E. Hovy, M. Marcus, Palmer M.,R. Belvin, S. Pradhan, L. Ramshaw, and N. Xue.2011. OntoNotes: A large training corpus forenhanced processing. In Handbook of NaturalLanguage Processing and Machine Translation:DARPA Global Autonomous Language Exploitation.Springer.

    Marc Wick. 2015. Geonames ontology.

    Vikas Yadav and Steven Bethard. 2018. A survey on re-cent advances in named entity recognition from deeplearning models. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 2145–2158, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.

    Hang Yan, Bocao Deng, Xiaonan Li, and XipengQiu. 2019. Tener: Adapting transformer en-coder for name entity recognition. arXiv preprintarXiv:1911.04474.

    Zhilin Yang, Ruslan Salakhutdinov, and William W.Cohen. 2017. Transfer learning for sequence tag-ging with hierarchical recurrent networks. In Inter-national Conference on Learning Representations.

    Jianfei Yu and Jing Jiang. 2016. Learning sentence em-beddings with auxiliary tasks for cross-domain senti-ment classification. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 236–246, Austin, Texas. Associa-tion for Computational Linguistics.

    Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu,Meng Fang, Rick Siow Mong Goh, and KennethKwok. 2019. Dual adversarial neural transfer forlow-resource named entity recognition. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3461–3471,Florence, Italy. Association for Computational Lin-guistics.

    Yftah Ziser and Roi Reichart. 2017. Neural structuralcorrespondence learning for domain adaptation. InProceedings of the 21st Conference on Computa-tional Natural Language Learning (CoNLL 2017),pages 400–410, Vancouver, Canada. Association forComputational Linguistics.

    https://doi.org/10.18653/v1/D18-1215https://doi.org/10.18653/v1/D18-1215https://doi.org/10.18653/v1/D18-1215https://doi.org/10.1145/3213544https://doi.org/10.1145/3213544http://www.geonames.org/about.htmlhttps://www.aclweb.org/anthology/C18-1182https://www.aclweb.org/anthology/C18-1182https://www.aclweb.org/anthology/C18-1182https://openreview.net/pdf?id=ByxpMd9lxhttps://openreview.net/pdf?id=ByxpMd9lxhttps://doi.org/10.18653/v1/D16-1023https://doi.org/10.18653/v1/D16-1023https://doi.org/10.18653/v1/D16-1023https://doi.org/10.18653/v1/P19-1336https://doi.org/10.18653/v1/P19-1336https://doi.org/10.18653/v1/K17-1040https://doi.org/10.18653/v1/K17-1040

  • 1531

    A Labelling functions

    Group Function name Description

    NeuralNERmodels

    BTC Model trained on the Broad Twitter CorpusBTC+c Model trained on the Broad Twitter Corpus + postprocessingSEC Model trained on SEC-filingsSEC+c Model trained on SEC-filings + postprocessingconll2003 Model trained on CoNLL 2003conll2003+c Model trained on CoNLL 2003 + postprocessingcore web md Model trained on Ontonotes 5.0core web md+c Model trained on Ontonotes 5.0 + postprocessing

    Gazetteers

    wiki cased Gazetteer (case-sensitive) using Wikipedia entriesmultitoken wiki cased Same as above, but restricted to multitoken entitieswiki uncased Gazetteer (case-insensitive) using Wikipedia entriesmultitoken wiki uncased Same as above, but restricted to multitoken entitieswiki small cased Gazetteer (case-sensitive) using Wikipedia entries with non-empty descriptionmultitoken wiki small cased Same as above, but restricted to multitoken entitieswiki small uncased Gazetteer (case-insensitive) using Wikipedia entries with non-empty descriptionmultitoken wiki small uncased Same as above, but restricted to multitoken entitiescompany cased Gazetteer (case-sensitive) using a large list of company namesmultitoken company cased Same as above, but restricted to multitoken entitiescompany uncased Gazetteer from a large list of company names (case-insensitive)multitoken company uncased Same as above, but restricted to multitoken entitiescrunchbase cased Gazetteer (case-sensitive) using the Crunchbase Open Data Mapmultitoken crunchbase cased Same as above, but restricted to multitoken entitiescrunchbase uncased Gazetteer (case-insensitive) using the Crunchbase Open Data Mapmultitoken crunchbase uncased Same as above, but restricted to multitoken entitiesgeo cased Gazetteer (case-sensitive) using the Geonames databasemultitoken geo cased Same as above, but restricted to multitoken entitiesgeo uncased Gazetteer (case-insensitive) using the Geonames databasemultitoken geo uncased Same as above, but restricted to multitoken entitiesproduct cased Gazetteer (case-sensitive) using products extracted from DBPediamultitoken product cased Same as above, but restricted to multitoken entitiesproduct uncased Gazetteer (case-insensitive) using products extracted from DBPediamultitoken product uncased Same as above, but restricted to multitoken entities

    Heuristicfunctions

    date detector Detection of entities of type DATEtime detector Detection of entities of type TIMEmoney detector Detection of entities of type MONEYnumber detector Detection of entities CARDINAL, ORDINAL, PERCENT and QUANTITYlegal detector Detection of entities of type LAWmisc detector Detection of entities of type NORP, LANGUAGE, FAC or EVENTfull name detector Heuristic function to detect full person namescompany type detector Detection of companies with a legal type suffixnnp detector Detection of sequences of tokens with NNP as POS-taginfrequent nnp detector Detection of sequences of tokens with NNP as POS-tag

    + including at least one infrequent token (rank > 15000 in vocabulary)proper detector Detection of proper names based on casinginfrequent proper detector Detection of proper names based on casing + at least one infrequent tokenproper2 detector Detection of proper names based on casinginfrequent proper2 detector Detection of proper names based on casing + at least one infrequent tokencompound detector Detection of proper noun phrases with compound dependency relationsinfrequent compound detector Detection of proper noun phrases with compound dependency relations

    + including at least one infrequent tokensnips Probabilistic parser specialised in the recognition of dates, times, money

    amounts, percents, and cardinal/ordinal values

    Doc-levelfunctions

    doc history Entity classification based on already introduced entities in the documentdoc majority cased Entity classification based on majority labels in document (case-sensitive)doc majority uncased Entity classification based on majority labels in document (case-insensitive)

    Table 3: Full list of labelling functions employed in the experiments. The neural NER models are provided in twoversions: one that directly outputs the raw model predictions, and one that runs a shallow postprocessing step onthe model predictions to correct known recognition errors (for instance, ensuring that a numeric amount that iseither preceded or followed by a currency symbol is always classified as an entity of type MONEY).

  • 1532

    B Label matching problem

    The baseline models relying on mixtures of multinomials have to address the so-called label matchingproblem, which needs some extra care.

    The following approach was employed in the experiments from Section 4:

    • First, we put strong initial values to the probabilities σ of individual classes based on the frequencyof appearance of these classes in the most reliable labelling function. This is expected to increase theprobability of EM exploring the mode around the initialised values.

    • Second, we perform post-processing and set the labels to the states corresponding to the labels withthe highest pairwise correlations to the latent labels from one of the three options:

    1. the most reliable labelling function (Ontonotes-trained NER);2. the majority voting labelling function;3. the suggested Dirichlet dependent mixture model.

    Additionally, if this highest correlation is below the threshold of 0.1 the O label is assigned to thecorresponding state. We empirically observed that the label matching technique that performed best wasto map the states to the labels produced by the majority voter (based on the pairwise correlations).

  • 1533

    C Detailed results

    In Table 4, we provide the detailed results distributed by NER label for the CoNLL data 2003 which werepresented in micro-averaged form in Table 1 of the main paper.

    Label Frequency Model Token-level Entity-levelP R F1 P R F1

    LOC 30.3 % Ontonotes-trained NER 0.767 0.812 0.788 0.764 0.800 0.782Majority voting (MV) 0.740 0.839 0.786 0.739 0.828 0.780Confusion Matrix 0.721 0.895 0.798 0.714 0.890 0.792Sequential Confusion Matrix 0.681 0.856 0.758 0.664 0.848 0.744Dependent Confusion Matrix 0.718 0.890 0.794 0.710 0.886 0.788Snorkel-aggregated labels 0.634 0.855 0.728 0.676 0.747 0.710HMM (only NER models) 0.601 0.825 0.696 0.650 0.733 0.690HMM (only gazetteers) 0.707 0.632 0.668 0.694 0.630 0.660HMM (heuristics) 0.715 0.870 0.784 0.745 0.832 0.786HMM (all but doc-level) 0.701 0.862 0.774 0.724 0.838 0.776HMM (all functions) 0.726 0.859 0.786 0.738 0.839 0.786NN trained on HMM 0.736 0.851 0.790 0.734 0.850 0.788

    PER 28.7 % Ontonotes-trained NER 0.850 0.833 0.842 0.787 0.741 0.764Majority voting (MV) 0.915 0.871 0.892 0.831 0.775 0.802Confusion Matrix 0.891 0.921 0.906 0.806 0.834 0.820Sequential Confusion Matrix 0.849 0.879 0.864 0.730 0.789 0.758Dependent Confusion Matrix 0.892 0.920 0.906 0.806 0.834 0.820Snorkel-aggregated labels 0.816 0.903 0.858 0.769 0.717 0.742HMM (only NER models) 0.837 0.860 0.848 0.770 0.744 0.756HMM (only gazetteers) 0.917 0.452 0.606 0.835 0.391 0.532HMM (heuristics) 0.836 0.933 0.882 0.791 0.799 0.794HMM (all but doc-level) 0.859 0.917 0.888 0.814 0.782 0.798HMM (all functions) 0.857 0.947 0.900 0.820 0.826 0.822NN trained on HMM 0.856 0.946 0.898 0.814 0.824 0.818

    ORG 26.6 % Ontonotes-trained NER 0.536 0.517 0.526 0.437 0.306 0.360Majority voting (MV) 0.725 0.512 0.600 0.610 0.434 0.508Confusion Matrix 0.698 0.613 0.652 0.571 0.537 0.554Sequential Confusion Matrix 0.632 0.590 0.610 0.485 0.515 0.500Dependent Confusion Matrix 0.696 0.613 0.652 0.567 0.536 0.552Snorkel-aggregated labels 0.512 0.639 0.568 0.519 0.496 0.508HMM (only NER models) 0.516 0.549 0.532 0.425 0.333 0.374HMM (only gazetteers) 0.648 0.304 0.414 0.512 0.235 0.322HMM (heuristics) 0.566 0.625 0.594 0.549 0.501 0.524HMM (all but doc-level) 0.565 0.631 0.596 0.551 0.494 0.520HMM (all functions) 0.542 0.665 0.598 0.545 0.527 0.536NN trained on HMM 0.539 0.665 0.596 0.537 0.519 0.528

    MISC 14.4 % Ontonotes-trained NER 0.676 0.599 0.636 0.702 0.583 0.636Majority voting (MV) 0.861 0.187 0.308 0.809 0.193 0.312Confusion Matrix 0.895 0.319 0.470 0.850 0.332 0.478Sequential Confusion Matrix 0.850 0.320 0.464 0.791 0.333 0.468Dependent Confusion Matrix 0.893 0.318 0.468 0.844 0.330 0.474Snorkel-aggregated labels 0.852 0.398 0.542 0.863 0.400 0.546HMM (only NER models) 0.667 0.544 0.600 0.708 0.518 0.598HMM (only gazetteers) 0.745 0.011 0.022 0.594 0.008 0.016HMM (heuristics) 0.842 0.499 0.626 0.850 0.478 0.612HMM (all but doc-level) 0.714 0.596 0.650 0.781 0.575 0.662HMM (all functions) 0.814 0.571 0.672 0.830 0.565 0.672NN trained on HMM 0.852 0.577 0.688 0.866 0.583 0.696

    Table 4: Detailed evaluation results on the CoNLL2003 dataset.