Hybrid Neural Tagging Model for Open Relation Extraction

1

Hybrid Neural Tagging Model for Open Relation Extraction

SHENGBIN JIA and YANG XIANG∗, Tongji University

Open relation extraction (ORE) remains a challenge to obtain a semantic representation by discoveringarbitrary relation tuples from the unstructured text. Conventional methods heavily depend on featureengineering or syntactic parsing, they are ine�cient or error-cascading. Recently, leveraging supervised deeplearning structures to address the ORE task is an extraordinarily promising way. However, there are two mainchallenges: (1) �e lack of enough labeled corpus to support supervised training; (2) �e exploration of speci�cneural architecture that adapts to the characteristics of open relation extracting. In this paper, to overcomethese di�culties, we build a large-scale, high-quality training corpus in a fully automated way, and designa tagging scheme to assist in transforming the ORE task into a sequence tagging processing. Furthermore,we propose a hybrid neural network model (HNN4ORT) for open relation tagging. �e model employs theOrdered Neurons LSTM to encode potential syntactic information for capturing the associations among thearguments and relations. It also emerges a novel Dual Aware Mechanism, including Local-aware A�entionand Global-aware Convolution. �e dual aware nesses complement each other so that the model can takethe sentence-level semantics as a global perspective, and at the same time implement salient local featuresto achieve sparse annotation. Experimental results on various testing sets show that our model can achievestate-of-the-art performances compared to the conventional methods or other neural models.CCS Concepts: •Computing methodologies→ Arti�cial intelligence; Natural language processing; Infor-mation extraction; •Information systems→Web mining; Data extraction and integration;

Additional Key Words and Phrases: Open relation extraction, neural sequence tagging, syntactics, supervised,corpusACM Reference format:Shengbin Jia and Yang Xiang. 2019. Hybrid Neural Tagging Model for Open Relation Extraction. 1, 1, Article 1(November 2019), 18 pages.DOI: 0000001.0000001

1 INTRODUCTIONOpen relation extraction (ORE) [3, 14] is an important NLP task for discovering knowledge fromunstructured text. It does not identify pre-de�ned relation types like the Traditional relationextraction (TRE) [13, 36, 65]. Instead, it aims to obtain a semantic representation that comprisesa arbitrary-type relational phrase and two argument phrases. Such as tracking a �erce sportscompetition, the news reporting works should �nd unpredictable relation descriptions (e.g., (Gatlin,again won, the Championships). ORE systems can satisfy such personalized scenarios by discoveringresourceful relations without pre-de�ned taxonomies. A robust open relation extractor, especially∗�is is the corresponding author

�is work is supported by the National Natural Science Foundation of China under Grant No.71571136, and the 2019 TencentMarketing Solution Rhino-Bird Focused Research Program.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and thefull citation on the �rst page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permi�ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior speci�c permission and/or a fee. Request permissions from [email protected].© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. XXXX-XXXX/2019/11-ART1 $15.00DOI: 0000001.0000001

, Vol. 1, No. 1, Article 1. Publication date: November 2019.

arX

iv:1

908.

0176

1v3

[cs

.CL

] 1

3 Fe

b 20

20

1:2 S. Jia et al.

working with limited well-annotated data, can greatly prompt the research not only in knowledgebase construction [7], but also in a wide variety of applications [31], for example, intelligentquestions and answers [15, 50], text comprehension [29, 43, 60]. ORE has gained consistenta�ention.

However, this task is constrained by the lack of labeled corpus or ine�cient extraction models.Conventional methods are usually based on pa�ern matching to carry out unsupervised or weak-supervised learning. �e pa�erns are usually syntax-relevant. �ey are handcra�ed by linguists [1,12, 14, 25] or bootstrapped depend on syntax analysis tools [5, 31, 32]. As for the former, the manualcost is high and the scalability is poor. �e la�er heavily depends on syntax analysis tools, and thecascading e�ect caused by parsing errors is serious.

Deep learning-based methods are popular and have achieved good accomplishments in variousinformation extraction tasks [24, 51, 52, 63]. Recently, a few people are trying to solve the ORE taskwith supervised neural network methods, especially the neural sequence tagging models [10, 49].Nevertheless, there are many challenges and di�culties. (1) �ere are few public large-scale labeledcorpora for training supervised ORE learning models. And labeling a training set manually witha large number of relations is heavily costly. (2) It’s always challenging for open-ended tasks toproduce supervised systems. �e speci�c neural architecture that adapts to the characteristics ofORE should be explored urgently. Normal sequence labeling models encode word-level contextinformation as assigning each word a tag to indicate the boundary of adjacent segments [58, 66].However, relation extraction is to only mark relational words corresponding to arguments, multiplerelations involved in one sentence only focus on fractional aspects of the text. How to get therelevant information among the arguments and a relation, and how to focus on the key parts of asentence according to a certain relation, are worthy of special a�ention.

We overcome the above challenges by the following works. As for the �rst challenge, we builda large-scale, high-quality corpus in a fully automated way. We verify that the dataset has gooddiversi�cation and can motivate models to achieve promising performances. Besides, we designa tagging scheme to transform the extraction task into a sequence tagging problem. A triplecorresponds to its unique tag sequence, thus overlapping triples in a sentence can be presentedsimultaneously and separately.

As for the second challenge, to adapt to the characteristics of open relation extraction, wepresent a hybrid neural network model (HNN4ORT) for open relation tagging. �e model has twohighlights: (a) It employs the Ordered Neurons LSTM (ON-LSTM) to learn temporal semantics,while capturing potential syntactic information involved in natural language, where syntax isimportant for relation extraction to acquire the associations between arguments and relations;(b) We propose the Dual Aware Mechanism, including Local-aware A�ention and Global-awareConvolution. �e two complement each other and realize the model focusing on salient localfeatures to achieve sparse annotation while considering a sentence-level semantics as a globalperspective.

In summary, the main contributions of this work are listed as follows:• We transform the open relation extraction task into a sequence tagging processing and

design a tagging scheme to address multiple overlapping relation triples.• We present a hybrid neural network model (HNN4ORT), especially the Dual Aware Mecha-

nism, to adapt to the characteristics of the ORE task.• We construct an automatic labeled corpus 1 in favor of adopting the supervised approaches

for the ORE task. It is simple to construct and is larger-scale than other existing OREcorpora.

1�e dataset can be obtained from h�ps://github.com/TJUNLP/NSL4OIE.


Neural Open Relation Tagging Model 1:3

• Experimental results on various testing sets show that our model can produce state-of-the-art performances compared to the conventional methods or other neural models.

�e rest of this article is organized as follows. In Section 2, we review some related work. �enwe detail the process of preparing the training corpus in Section 3. In Section 4, we present ourmodel in detail. In Section 5 and 6, we conduct and analyze experiments on multiple datasets.Section 7, concludes this work and discusses future research directions.

2 RELATEDWORKSRelation extraction can be divided into two branches, including traditional relation extraction andopen relation extraction.

Traditional relation extraction (TRE) [11, 13, 36, 65] can be regarded as a classi�cation task whichis commi�ed to identifying pre-de�ned relation taxonomies between two arguments, and unde�nedrelations will not be found. �e mainstream methods were based on neural systems to recognizerelations by using the large-scale knowledge bases to provide distant supervision [46, 52, 55].

As for the Open relation extraction (ORE) [18, 38], most of the existing methods used pa�erns-based matching approaches by leveraging linguistic analysis. In general, the pa�erns were gen-eralized by handcra�ing [1, 12, 25] or semi-supervised learning [5, 31, 32] such as bootstrapping.�ese manual pa�erns were higher-accuracy than those produced automatically. However, theywere heavy-cost and poor-e�ciency.

Many extractors, such as TextRunner [3], WOEpos [53], Reverb [14], focused on e�ciency byrestricting the shallow syntactic parsing to part-of-speech tagging and chunking. Meanwhile,many approaches designed complex pa�erns from complicated syntactic processing, especiallydependency parsers, such as WOEparse [53], PATTY [37], OLLIE [32], Open IE-4.x [31], MinIE [17]and so on. �ese extractors could get signi�cantly be�er results than the extractors based onshallow syntax. However, they heavily relied on the syntactic parsers. Many papers [12, 32]analyzed the errors made by their extractors and found that parser errors account for a large eventhe largest part of the whole. Parsing errors restrained the extracting performances and wouldproduce a serious error cascade e�ect. Taking a certain strategy to re�ne the extraction results,should be an e�ective way to gather high-quality relation triples.

�e sequence tagging tasks, such as Chinese word segmentation (CWS), Part-of-speech tagging(POS), and Named entity recognition (NER), require to assign representative labels for each wordin a sentence. Conventional models were linear statistical models, which included Hidden Markovmodels [4], Maximum entropy models [33], and Conditional random �elds (CRF) [27] and so on.Neural methods mapped input sequences to obtain �xed dimensional vector representations byvarious neural networks [9, 20, 24, 59, 61], then predicted the target sequences from the vectorsusing a layer with So�max activation function [8, 24] or a special CRF layer [39].

�ere were a few examples that applied neural models to open information extraction tasks.According to the Machine translation mechanism, the extraction process was converted into textgeneration. Zhang et al. [62] extracted predicate-argument structure phrases by using a sequenceto sequence model. Cui et al. [10] proposed a multi-layered encoder-decoder framework to generaterelation tuples related sequences with special placeholders as a marker. Similarly, Bhutani etal. [6] used the encoder-decoder method to generate relation triples, but only from the limitedquestion and answer pairs. In addition, Zheng et al. [63] creatively designed the model to transformthe traditional relation extraction into relation sequence tagging. Later, Stanovsky et. al. [49]formulated the ORE task as a sequence tagging problem. �ey applied the LSTM with a So�maxlayer to tag each word. However, we try to design more e�ective semantic learning frameworks to


1:4 S. Jia et al.

Input Sentence: The America President Trump will visit the Apple founded by Steven Paul Jobs

Tagging in (Stano-

vsky et al., 2018) O O O E0-B R-B R-I E2-B E2-I O O O O O

Tags Sequence1: E1-B E1-E R-S E2-S O O O O O O O O O

(The America, President, Trump)

Tags Sequence2: O O O E1-S R-B R-E E2-B E2-E O O O O O

(Trump, will visit, the Apple)

Tags Sequence3: O O O O O O E1-B E1-E R-B R-E E2-B E2-I E2-E

(the Apple, founded by, Steven Paul Jobs)

Figure 2. Standard annotation for an example sentence based on our tagging scheme, where “E1”, “R”, “E2” respectively represent argument1, relational phrase, and argument2.

Fig. 1. Stanovsky et al. [49] annotates the sequence of relations as shown in the 2nd line. A sentence producesonly one tags sequence. However, the tagging based on our scheme is shown in lines 3-5. Multiple, overlappingtriples can be represented, where “E1”, “R”, “E2” respectively represent Argument1, Relation, and Argument2.“Tags Sequence” is independent of each other.

annotate relations. Besides, we have a larger training set of about several hundred thousand, butthey have only a few thousand.

3 TRAINING CORPUS AND TAGGING SCHEME�ere are few public large-scale labeled datasets for the ORE task. Stanovsky and Dagan [48]created an evaluation corpus by an automatic translation from QA-SRL annotations [21]. It onlycontains 10,359 tuples over 3200 sentences. �en, Stanovsky et al. [49] further expand 17,163 labeledopen tuples from the QAMR corpus [34]. However, the accurate has declined.

�erefore, we adopt a mechanism of bootstrapping by using multiple existing open relationextractors without having to resort to expensive annotation e�orts. Currently, the training setcontains 477,701 triples. And it is easy to get expanded. We also design a tagging scheme toautomatically annotate them.

3.1 The Correct Relation Triples CollectingWe use three existing excellent and popular extractors (OLLIE [32], ClausIE [12], and Open IE-4.x [31]) to extract relation triples from the raw text2. �ese extractors have their expertise, soto ensure the diversity of extraction results. If a triple is obtained simultaneously by the threeextractors, we should believe that the triple is correct and add it to our corpus.

All the extractors are based on the dependency parsing trees, so the extracted relational phrasemay be a combination product. �at is, the distance between the adjacent words in the phrasemay be distant in the original sentence sequence. Moreover, the adjacency order of the words ina triple may be di�erent from that in the sentence. For example, from the sentence “He thoughtthe current would take him out, then he could bring help to rescue me”, we can get a triple (thecurrent, would take out, him). �us, we de�ne some word order constraints: �e arguments in atriple (Argument1, Relation, Argument2) are ordered. All words in Argument1 must appear beforeall words in Relation. All the words contained in Argument2 must appear a�er the Relation words.�e order of words appearing in Relation must be consistent with them appearing in the originalsentence sequence. In addition, each relational word must have appeared in the original sentence

2�e original text is produced from the WMT 2011 News Crawl data, at h�p://www.statmt.org/lm-benchmark/.



and is not the modi�ed word or the added word. �ese constraints is to make a model easy to usesequence tagging.

We randomly sample 100 triples from the dataset to test the accuracy. �e accuracy performanceis up to 0.95. �e sampling result veri�es the validity of the above operating. In addition, itrepresents that the extraction noises caused by syntactic analysis errors can also be well �lteredout.

In order to build a high-quality corpus for model training, we are commi�ed to the high-accuracyof extractions at the expense of the recall. �e constructed dataset is imperfect, but it still hasacceptable scalability. �e experimental sections prove the e�ectiveness of the dataset.

3.2 Automatically Sequence TaggingTagging is to assign a special label to each word in a sentence [63]. We use the BIOES annotation(Begin, Inside, Outside, End, Single) that indicates the position of the token in an argument orthe relational phrase. It has been reported that this annotation is more expressive than otherssuch as BIO [40]. �e words appearing outside the scope of a relation triple will be labeled as “O”.As for a relation triple, the arguments and relational phrase could span several tokens within asentence respectively. �us, the arguments and relation-phrase need to be division and taggedindependently.

In the general tagging scheme [49], only one label will be assigned to each word in sentences.However, A sentence may contain multiple, overlapping triples. An argument can have multiplelabels when it belongs to di�erent triples. As a result, the normal tagging program cannot handlethe case where a relation is involved in multiple triples, or an argument is related to multiplerelations.

To cope with such multiplicity, we design an overlap-aware tagging scheme that can assignmultiple labels for each word. Each triple will correspond to its unique tag sequence. Figure 1 is anexample that a sentence is tagged by our scheme. Giving an argument pair, there can only be onerelation between the pair in a sentence 3. �erefore, a�er the argument tags being pre-identi�ed,the sequence of relational tags is uniquely determined.

When using models to predict, we extract the candidate argument pairs in advance 4, thentransform them into the argument embedding as model inputs to identify the relationship betweenthem.

3.3 From Tag Sequence to Extracted ResultsFrom the tag sequence Tags Sequence1 in Figure 1, “�e America” and “Trump” can be combinedinto a triple whose relation is “President”. Because the relation role of “�e America” is “1” and“Trump” is “2”, the �nal result is (�e America, President, Trump). �e same applies to (Trump, willvisit, the Apple), (the Apple, founded by, Steven Paul Jobs).

4 METHODOLOGYWe provide a detailed description of our model (HNN4ORT). It is illustrated in Figure 2. Our modelhas the following superiorities:

1 Employ the Ordered Neurons LSTM (ON-LSTM) to learn temporal semantics, whilecapturing potential syntactic information in natural language.

3�e coordination of verbs in a sentence should be considered as one complete relational phrase.4Candidate arguments recognition is easy to do through many existing methods, so we can add it to the pre-processingprocess. We use the methods in the ClausIE [12] and Open IE-4.x [31] to pre-identify the arguments in sentences.


1:6 S. Jia et al.

N*K

rep

rese

nta

tion o

f

sente

nce

Convolu

tional

lay

erM

ax-P

ooli

ng l

ayer

Trump will thevisit IncApple

LSTM

LSTM

E1-S

Lb Lf C2 C3

LSTM

LSTM

R-B

Lb Lf C2 C3

LSTM

LSTM

R-E

Lb Lf C2 C3

LSTM

LSTM

E2-B

Lb Lf C2 C3

LSTM

LSTM

E2-I

Lb Lf C2 C3

LSTM

LSTM

E2-E

Lb Lf C2 C3

... ...

CNNs sub-network CNNs sub-network CNNs sub-network CNNs sub-network CNNs sub-network CNNs sub-network

Embedding

Layer

BiLSTM

Layer

Concatenation

Layer

CRF

Layer

Trump will thevisit... ...

Embedding

Layer

BiLSTM

Layer

Concatenation

Layer

CRF

Layer

Lb Lf C3 Lb Lf C3 Lb Lf C3 Lb Lf C3

BiLSTM BiLSTM BiLSTM BiLSTM

E1-S R-B R-E E2-B

(b) BiLSTM-CNN-CRF network(a) CNNs sub-network

Word

embedding

POS

embedding

Argument

embedding

N*K

rep

rese

nta

tion o

f

sente

nce

Convolutional Feature Map

kernel_size = 3

Global Max-Pooling

...(A) Triplet Embedding Layer

Concatenation

Layer

CRF

Layer

Lb Lf C3 Lb Lf C3 Lb Lf C3

R-B R-E E2-B

Trump will thevisit ...

Word

embedding

POS

embedding

Argument

embedding

...

...

...

Apple

+


kernel_size = 5


kernel_size = 7

(B-1) Local-aware Attention

(B-2) Global-aware Convolution

Softmax


kernel_size = 3

Global Max-Pooling

...

(A) Triplet Embedding Layer

F(l)

ON-LSTM ON-LSTM ON-LSTM ON-LSTM

E1-S

Trump will thevisit ...

Word

embedding

POS

embedding

Argument

embedding

...

...

...

ON-LSTM

Apple

x

+

x x x x

F(g)F(l)

R-B

F(g)F(l)

R-E

F(g) F(l)

E2-B

F(g) F(l)

E2-I

F(g)

ɑ1 ɑ2 ɑ3 ɑ4 ɑ5


kernel_size = 5


kernel_size = 7

(B-1) Local-aware Attention

(B-2) Global-aware Convolution

(C) CRF decoding

Softmax

F(l)

ON-LSTM ON-LSTM ON-LSTM ON-LSTM ON-LSTM

x x x x x

F(g)F(l) F(g)

F(l) F(g) F(l) F(g) F(l) F(g)

ɑ1 ɑ2 ɑ3 ɑ4 ɑ5

E1-S R-B R-E E2-B E2-I

(C) CRF decoding

Fig. 2. An illustration of our model.

2 Design Dual Aware Mechanism, including Local-aware A�ention and Global-awareConvolution. It emphasizes focusing on local tagging but with a sentence-level semanticsas a global perspective.

4.1 ON-LSTM NetworkNatural language can be de�ned as sequence. However, the underlying structure of language is notstrictly sequential. It should be hierarchically structured that be determined by a set of rules orsyntax [41].

�is syntactic information is critical to the relation extraction task. �ere are strict semanticassociations and formal constraints that the head argument is the agent of the relation and thetail argument is the object of the relation. Integrating syntax-semantics into a neural network canencode be�er representations of natural language sentences.

Recurrent neural network (RNN), and in particular it’s variant the Long short-term memorynetwork (LSTM), is good at grasping the temporal semantics of a sequence. �ey are useful forall kinds of tagging applications [39, 63, 64]. However, RNNs explicitly impose a chain structureon a sentence. �e chain structure is contrary to the potential hierarchical structure of language,resulting in that RNN (including LSTM) models may be di�cult to e�ectively learn potentialsyntactic information [44].

�erefore, we employ a new RNN unit, ordered neurons LSTM (ON-LSTM) [44]. It enablesmodel to performing tree-like syntactic structure composition operations without destroying itssequence form. For a given sentence S = {x1,x2, . . . ,xt }, the ON-LSTM returns the representationH = {h1, h2, . . . , ht } about the sequence S .



�e ON-LSTM uses a special inductive bias for the standard LSTM. �is inductive bias promotesdi�erentiation of the life cycle of information stored inside each neuron: high-ranking neuronswill store long-term information, while low-ranking neurons will store short-term information.

�e ON-LSTM uses an architecture similar to the standard LSTM. An LSTM unit is composed ofa cell memory and three multiplicative gates (i.e. input gate, forget gate, and output gate) whichcontrol the proportions of information to forget and to pass on to the next time step [22, 30]. �eforget gate ft and input gate it in ON-LSTM are identical with them in LSTM, where they are usedto control the erasing and writing operation on cell states ct , as,

ft = σ(Wf xt + Uf ht−1 + bf

)(1)

it = σ (Wixt + Uiht−1 + bi ) . (2)As for the cell memory ct , it may be di�cult for a standard LSTM to discern the hierarchy

information, since the gates in the LSTM act independently on each neuron. �erefore, the ON-LSTM enforces the order where the cell in each neuron should be updated. �e updated informationin cells is allocated by the activation function cumax(). �e contribution of the function is toproduce the vectors of master input gate it and master forget gate ft ensuring that when a givenneuron is updated, all of the neurons that follow it in the ordering are also updated.

ct = tanh (Wcxt + Ucht−1 + bc ) (3)ft = cumax(Wf xt + Uf ht−1 + bf ) (4)

it = cumax(Wixt + Uiht−1 + bi ) (5)ωt = ft ◦ it (6)

ct = ωt ◦ (ft ◦ ct−1 + it ◦ ct ) +(ft − ωt

)◦ ct−1 +

(it − ωt

)◦ ct (7)

According to the output gate ot and cell memory ct , the output of the hidden state ht is

ot = σ (Woxt + Uoht−1 + bo) (8)ht = ot ◦ tanh (ct ) . (9)

In the above formulas, W and U are the trainable parameter matrixes, b is the bias. σ is thelogistic sigmoid function, ◦ denotes the Hadamard product.

4.2 Local-aware A�ention�e RNN-based encoder generates the equal-level representation for each token of the entiresentence regardless of its contexts. However, a sentence may present an argument from variousaspects, and various relations only focus on fractional aspects of the text.

Given the target arguments, not all of words/phrases in its text description are useful to modela speci�c fact. Some of them may be important for predicting relation, but may be useless forother relations. Empirically, relational words tend to appear in the neighborhoods of arguments.�erefore, it is necessary for a model to pay more a�ention to the relevant parts of a sentence,according to di�erent target relation. We propose a Local-aware A�ention Network according tothe a�ention mechanism [2, 23, 54], as shown in Figure 2(B-1).

A�er being given a candidate argument pair a, the a�ention weight for each token xt of thesentence is de�ned as αt , which is

αt =exp (Vt )∑Ti=1 exp (Vi )

(10)


1:8 S. Jia et al.

Vt (a) = tanh (Waht + Uaa) . (11)In the above formula, Wa , Ua are parameters matrices. �e a is the embeddings of the candidate

argument pair. It is the key information that tells the model what the expected destination is, sothat the same sentence can get di�erent a�ention states for di�erent predicting relations. �e ht isobtained by the bidirectional ON-LSTM network. �e ON-LSTM processes the sequence S in bothdirections and encodes each token xt into a �xed-size vector representation H = [h1, h2, . . . , ht ],by being calculated as

ht = ON-LSTMf w (−−−→ht−1, ct ) ⊕ ON-LSTMbw (

←−−−ht+1, ct ), (12)

where−−−→ht−1 is the hidden representation of the forward ON-LSTM (ON-LSTMf w ) at position t − 1,

and←−−−ht+1 are the hidden representation of the backward ON-LSTM (ON-LSTMbw ) at position t + 1,

⊕ denotes concatenation operation.We apply the a�ention weight αt to ht , resulting in an updated representation H(a), as

H(a) = {α1h1,α2h2, . . . ,αtht }. (13)

4.3 Global-aware ConvolutionNormal sequence labeling models cannot fully encode sentence-level information but encodeword-level context information as assigning each word a tag to indicate the boundary of adjacentsegments [58, 66]. However, we believe that the Local-aware A�ention mechanism can play abe�er role only when it is provided with a comprehensive sentence-level context. �erefore, inthis section, we propose a Global-aware Convolution network. It and the Local-aware A�entionnetwork complement each other and form the Dual Aware Mechanism.

We use the stacked Convolutional neural network to learn global information from a wholesentence since it owns powerful spatial perception ability [9, 45]. In the process of recognizinga piece of text, the human brain does not recognize the entire text at the same time, but �rstlyperceives each local feature in the text, and then performs a comprehensive operation at a higherlevel to obtain global information. To simulate this learning process of the human brain, we designthe Global-aware Convolution network, as shown in Figure 2(B-2).

Fragment-level feature maps can be derived from a set of convolutions that operate on theembeddings of a sentence S . Convolution is an operation between a vector of weights and theembeddings. �e weight matrix is regarded as the �lter for the convolution [57]. We apply aset of convolutional �lters Wl and bias terms bl to the sentence as per equation 14, to learn arepresentation.

hlt = RELU (Wl · [xt , . . . , xt+l ] + bl ) , l = 2i + 1 i = 1, 2, ... (14)

where hlt is used to represent vectors at time t , which includes all phrases of length l started withxt . i indicates the id of layers.

By stacking layers of convolutions of increasing �lter width, we can expand the size of thee�ective input width to cover the most of length of a sequence. Each iteration takes as input theresult of the upper convolution layer. �e kernel size of each current layer is 2 more than that ofthe upper layer.

A�er convoluting, a global max-pooling operation is applied that stores only the highest activa-tion of each �lter. It is used for feature compression and extracts the main features. So far, the �naloutput vector with a �xed length can be obtained as sentence-level global features.



4.4 Hybrid FrameworkFigure 2 illustrates the framework of our model. �e main components of the framework are theTriplet Embedding Layer, Dual Aware Encoder, and Tag Decoder.

We adopt triplet embeddings to map each token to a vector, as shown in Figure 2(A). Givena sentence S as a sequence of tokens (i.e., words of the sentence), each token xt is representedby a set of embeddings (x(w )t , x

(a)t , x

(p)t ), where x(w )t represents the word embedding, x(a)t denotes

the embedding of candidate argument pair, and x(p)t shows the Part-of-speech (POS) embedding.We consider the POS information into our model since it plays an important role in the relationextraction processing.

Next, the token representation vectors S = {x1, x2, . . . , xt } are sent to the Dual Aware Encoder.One side, these embeddings are fed into the module of Local-aware A�ention, as shown in Figure 2(B-1), outpu�ing the features F(l ). Meanwhile, a�er giving input S, we execute the deep convolutionmodule (B-2) to output global-aware features F(д). �erefore, the encoder returns a representationF =

[F(l ); F(д)

].

Subsequently, the upper output F is fed into the Tag Decoder to jointly yield the �nal predictionsfor each token, as shown in Figure 2(C). We take advantage of the sentence-level tag informationby a Conditional random �elds (CRF) decoding layer.

Notably, there are strong dependencies across output labels. It is bene�cial to consider thecorrelations between labels in neighborhoods. �e CRF can e�ciently use past and future tags topredict the current tag. �erefore, it is a common way to model label sequence jointly using a CRFlayer [28, 30, 56].

In detail, for an output label sequence y = (y1,y2, . . . ,yt ), we de�ne its score as

s (S,y) =T∑i=0

Ayi ,yi+1 +

T∑i=1

Pi,yi , (15)

where A is a matrix of transition scores such that Ai, j represents the score of a transition fromthe tag i to tag j. y0 and yn that separately means the start and the end symbol of a sentence. Weregard P as the matrix of scores outpu�ed by the upper layer. Pi, j corresponds to the score of thejth tag of the ith word in a sentence.

We predict the output sequence that gets the maximum score given by

y∗ = argmaxy∈YS

s (S, y) , (16)

where YS represents all possible tag sequences including those that do not obey the BIOES formatconstraints.

5 EXPERIMENTS AND DISCUSSIONSIn this section, we present the experiments in detail. We evaluate various models with Precision(P), Recall (R) and F-measure (F1).

5.1 Experimental Se�ingTesting set. To satisfy the openness and e�ectiveness of the experiments, we gather four testingsets from the previously published works. �ey should be close to nature and independent of thetraining set. Table 1 presents the details of the four datasets.

Firstly, the Reverb dataset is obtained from [14] which consists of 500 sentences with manuallylabeled 1,765 extractions. �e sentences are obtained from Yahoo. Next, the Wikipedia datasetincludes 200 random sentences extracted from Wikipedia. And we collect 605 extractions manual


1:10 S. Jia et al.

Table 1. The datasets for test in this work.

Id Dataset Source Sent. Triples1 Reverb dataset Yahoo 500 1,7652 Wikipedia dataset Wikipedia 200 6053 NYT dataset New York Times 200 5784 OIE2016 dataset QA-SRL annotations 3200 10,359

labeled by Del Corro [12]. �en, the NYT dataset contains 578 triples extracted from 200 randomsentences in the New York Times collection. It is also created by Del Corro et al. In addition,Stanovsky and Dagan [48] present the OIE2016 dataset which is an opened benchmark for ORE.It contains 10,359 tuples over 3200 sentences.Hyperparameters. We implement the neural network by using the Keras library5. �e trainingset and validation set contain 395,715 and 81,986 records, respectively. �e batch size is �xed to 256.We use early stopping [19] based on performance on the validation set. �e number of LSTM unitsis 200 and the number of feature maps for each convolutional �lter is 200. Parameter optimizationis performed with Adam optimizer [26]. �e initial learning rate is 0.001, and it should be reducedby a factor of 0.1 if no improvement of the loss function is seen for some epochs. Besides, tomitigate over-��ing, we apply the dropout method [47] to regularize models. We use three typesof embeddings as inputs. Word embedding is pre-trained by word2vec [35] on the corpora. Itsdimension is 300. Part-of-speech (POS) embedding is obtained by using the TreeTagger [42] whichis widely adopted to annotate the POS category, containing 59 di�erent tags. It is 59 dimensionsone-hot vectors. Besides, we represent the tag information of arguments as argument embedding,by using 10 dimensions one-hot vectors.

5.2 Experimental ResultsWe report the results 67 of various models to work on the Reverb dataset, Wikipedia dataset, andNYT dataset, as shown in Table 2 and Figure 3. We can draw the following conclusions:

(1) We discuss various popular neural structures and do lots of experiments to adapt to the openrelation tagging task. It is valuable to understand the e�ectiveness of di�erent neural architec-tures for the ORE task and to help readers reproduce experiments. �ese methods include theunidirectional LSTM network (UniLSTM) [30]; the bidirectional LSTM network with a So�maxclassi�er (LSTM), or with a CRF classi�er (LSTM-CRF ) [8, 24, 28, 30]; the CNN network with aSo�max classi�er (CNN ), or with a CRF classi�er (CNN-CRF ) [9, 45]; the combination of the LSTMand CNN (LSTM-CNN-CRF ) [16, 59] and so on.

Our model HNN4ORT outperforms all the above methods according to F1. It shows the ef-fectiveness of its hybrid neural network architecture. Remarkably, previous models (such asLSTM-CNN-CRF [16, 59]) have improved performances by integrating CNN and LSTM. �ey usu-ally simply concatenate the feature vectors learned from the CNN and LSTM at each time-step.However, we propose the Dual Aware Mechanism which achieves be�er performances.

5h�ps://github.com/keras-team/keras6In order to avoid the distortion of argument recognition errors to the �nal performance, we only recognize the correctnessof the relational phrases in a triple.7As for a relation extracted by Reverb, OLLIE, ClausIE, and Open IE-4.x, only when its con�dence is greater than 0.5 can itbe judged that the relations are correct.



Table 2. The predicted results of di�erent models on the Reverb dataset, Wikipedia dataset, and NYT dataset.The bolds indicate the best value when our model HNN4ORT compares to other sequence tagging models.And comparing with the conventional ORE extractors, we highlight the best value with the underline.

Model Wikipedia dataset NYT dataset Reverb dataset AverageP R F P R F P R F P R F

CNN 0.886 0.527 0.661 0.931 0.493 0.645 0.915 0.504 0.650 0.912 0.506 0.651UniLSTM 0.559 0.463 0.506 0.607 0.481 0.537 0.632 0.530 0.576 0.612 0.506 0.554LSTM 0.766 0.683 0.722 0.807 0.637 0.712 0.784 0.716 0.748 0.784 0.693 0.736ON-LSTM 0.808 0.714 0.758 0.812 0.658 0.726 0.816 0.728 0.769 0.812 0.708 0.756CNN-CRF 0.878 0.574 0.694 0.854 0.578 0.689 0.892 0.568 0.694 0.882 0.571 0.693LSTM-CRF 0.804 0.734 0.768 0.781 0.654 0.712 0.810 0.743 0.775 0.804 0.724 0.762LSTM-CNN-CRF 0.821 0.729 0.772 0.837 0.702 0.764 0.831 0.760 0.794 0.830 0.743 0.784ON-LSTM-CRF 0.858 0.719 0.782 0.825 0.663 0.735 0.849 0.731 0.786 0.846 0.715 0.775+Local-Aware 0.830 0.749 0.787 0.818 0.690 0.749 0.830 0.760 0.794 0.828 0.744 0.784+Global-Aware 0.869 0.744 0.801 0.838 0.656 0.736 0.867 0.748 0.803 0.862 0.729 0.790HNN4ORT 0.841 0.759 0.798 0.825 0.678 0.745 0.859 0.778 0.817 0.849 0.754 0.799

Reverb 0.770 0.210 0.330 0.557 0.144 0.228 0.595 0.133 0.217 0.641 0.162 0.259OLLIE 0.994 0.279 0.436 0.986 0.249 0.398 0.975 0.198 0.329 0.985 0.242 0.389ClausIE 0.795 0.526 0.633 0.656 0.481 0.555 0.953 0.585 0.725 0.801 0.531 0.638Open IE-4.x 0.766 0.340 0.471 0.801 0.341 0.478 0.810 0.312 0.451 0.792 0.331 0.467

In addition, we analyze the e�ects of various networks. Compared to unidirectional LSTM(UniLSTM), bidirectional LSTM (LSTM) is obviously superior to UniLSTM in F1 on average, sinceit can capture richer temporal semantic information. LSTM is be�er than CNN in recall and F1,however, CNN takes be�er precision. In addition, from the comparison between LSTM and LSTM-CRF, ON-LSTM and ON-LSTM-CRF, CNN and CNN-CRF, we can get a unanimous conclusion thatthe CRF layer can greatly improve model performances than a So�max layer.

(2) We compare neural sequence models with conventional methods. When the three modelsused to construct corpus (OLLIE, ClausIE, and Open IE-4.x) being regarded as baselines, our modelHNN4ORT achieves be�er recall and F1. Especially, our model achieves a 16.1% improvement in F1over the best baseline model ClausIE.

In addition, many of the neural sequence learning methods outperform conventional syntax-based methods. �e conventional methods may be of high accuracy, but they can’t learn richenough pa�erns resulting in a lower recall. In addition, pa�erns is hard and in�exible, however,the syntax structures of sentences are ever-changing. �e neural network-based models can learndeep sentence semantics and syntactic information, to achieve be�er precision and recall.

(3) We verify the e�ectiveness of each module of our model HNN4ORT. First, we test thecapability of the ON-LSTM. To our knowledge, it is the �rst time to be used for sequence tagging.�e experimental results show that the ON-LSTM is be�er than LSTM about 2 % in F1 on average.�is shows that the potential syntactic information captured by ON-LSTM is very useful for openrelation annotation.

Second, we use the ON-LSTM with the CRF classi�er (ON-LSTM-CRF ) as the basic model, to testthe performances of Dual Aware Mechanism. A�er the Local-aware A�ention network is integratedinto the ON-LSTM-CRF network (+Local-Aware), the e�ect is increased by about 1 %. A�er the


1:12 S. Jia et al.

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

ClausIEHNN4ORTOLLIEEn-DecoderOpenIE4

Fig. 3. The Precision-recall curves of di�erent ORE systems on the OIE2016 dataset. The model En-Decodercomes from [10].

Global-aware Convolution network is integrated into the ON-LSTM-CRF network (+Global-Aware),the e�ect is increased by about 1.5 %. �is proves that both modules are valid.

Furthermore, when the dual modules are adopted at the same time, the model increases by morethan 2 % on the average of F1. It shows that the Dual Aware Mechanism is highly reasonablethat global sentence-level information and local focusing information complement each other forhandling open relation extraction.

(4) We use the OIE2016 dataset to evaluate the precision and recall of di�erent systems 8. �eprecision-recall curves are shown in Figure 3, and the Area under precision-recall curve (AUC) foreach system is shown in Figure 4. It is observed that our model HNN4ORT has a be�er precisionthan the three conventional models in the most recall range. �e HNN4ORT is learned from thebootstrapped outputs of the extractions of the three conventional systems, while the AUC score isbe�er than theirs. It shows that the HNN4ORT has �xed generalization ability a�er training on thetraining set. Although the precision of the neural model En-Decoder [10] is be�er than that of theHNN4ORT, the recall of it has been maintained in a lower range than that of the HNN4ORT. Inaddition, the HNN4ORT achieves the best AUC score of 0.487, which is signi�cantly be�er thanother systems. In particular, the AUC score of the HNN4ORT is two times more than that of theEn-Decoder model.

(5) To sum up, the performances of our model HNN4ORT on the four datasets are stable andsuperior. It indicates that the model has good robustness and scalability.

8�e model proposed in [49] isn’t shown here, because it used the OIE2016 dataset as a training set. We have evaluated it inTable 2. Its model structure is the same as the BiLSTM network with a So�max output layer (LSTM, i.e. the 4th line).



ClausIE HNN4ORT OLLIE En-Decoder OpenIE40.0

0.1

0.2

0.3

0.4

0.5

0.6

Are

a U

nder

PR

Curv

e 0.379

0.487

0.1890.215

0.392

Fig. 4. The Area under precision-recall curve (AUC) shown in Figure 3.

6 PERFORMANCE ANALYSIS AND DISCUSSIONS6.1 Evaluation of Training CorpusAccording to whether the three extractors that are used to construct the training set can correctlyidentify each test instance, we classify the test instances into four parts: all three extractorsidentify correctly (ATE), existing two extractors identify correctly (ETE), only one extractor identifycorrectly (OOE), no one extractor identify correctly (NOE).

As shown in Figure 5, the model HNN4ORT identi�es these instances in the ATE with an accuracyclose to 1. It implies that the model can learn the training set data features well. In addition, fromthe perspective of a single extractor, the model HNN4ORT can acquire the extracting ability ofthis single extractor a�er training on our training set. �e correct results extracted by this singleextractor, in addition to appearing in the ATE, will also appear in the ETE and OOE. �ese instancesoutside the ATE own the cognate regularity (data distribution) with those in the ATE. �erefore,the HNN4ORT has a certain recognition in three parts (ATE, EYTE, and OOE). �e accuracy ishighest on the ATE, follows by it on the ETE, and worse on the OOE (but still above 0.7).

In particular, the model can produce certain results in the NOE where such kind of instances maybe li�le or barely appear in the training set. It shows that the model has strong generalization abilityand can obtain more powerful recognition capability beyond the any single extractor of the threeextraction tools. In addition, as shown in Table 2, the model has achieved remarkable performancesin various testing sets, although these testing sets and training set come from di�erent sources.Based on the above phenomena, it can be concluded that the quality of training corpus should beacceptable.

Other than this, to maintain the high quality of the training set, we only select triples from theATE. Although the ATE occupies a small proportion of the output of the extractors, the instancesin the training set are considerable and are extracted from a large-scale text. �is ensures that thetraining set has good diversi�cation and passable quality.


1:14 S. Jia et al.

AT E

ET EO

OE

NO

E

200 400 600 800 1000 1200 1400 1600

0.986

0.444

0.848

0.702

Fig. 5. The performance of the model HNN4ORT in each part of the testing sets. Here, the black bar indicatesthe number of correct identifications of the model. And the gray bar represents the total number of this part.

Table 3. The results of evaluating the influence of embeddings used on the model HNN4ORT. The results areaverage on the Reverb dataset, Wikipedia dataset, and NYT dataset.

Model P R F(word embeddings) 0.783 0.708 0.744(word, POS embeddings) 0.822 0.751 0.785(word, POS embeddings)train 0.849 0.754 0.799

6.2 Analysis of EmbeddingsWe evaluate the e�ect of various embeddings, as shown in Table 3. When the model HNN4ORTonly uses word embeddings, the results are worse than that of the model with word embeddingsand POS embeddings. We certi�cate that the POS features play a great role in promoting modelperformances and improve the F1 by 4.1%. In addition, when the input embeddings are adjustedalong with model training, the e�ect is be�er and the F1 is increased by 1.4%.

6.3 Analysis of Overlapping TriplesAs shown in Table 4, �e statistical results show that overlapping triples account for a largeproportion in various datasets, all of which are above 30%. By using our overlap-aware taggingscheme, models can identify multiple, over-lapping triples, so that they have a be�er recall capacity.We perform the model HNN4ORT to identify overlapping triples, with good results on all threedatasets.

6.4 Error AnalysisTo �nd out the factors that a�ect the performance of ORE tagging models, we analyze the taggingerrors of the model HNN4ORT as Figure 6 shown. �ere are four major types of errors. �e 30.3%relations are missed by the model. And 22.1% of the extractions are abandoned because theircorresponding tag sequences violate the tagging scheme. �e two types of errors mainly limit theincrease in recall. In addition, the model may wrongly determine the start or end position of a



Table 4. The performances of the model HNN4ORT on the sub-dataset which contains only overlappingtriples of each dataset. The second column shows the proportion of overlapping triples in the total.

Datasets Proportion P R FWikipedia 38.7% 0.849 0.650 0.736NYT 31.1% 0.779 0.606 0.681Reverb 44.3% 0.794 0.649 0.714

No identification error

30.3 %

Violating tagging scheme error

22.1 %

Startingposition identifing error

21.3 %

End position identifing error

26.2 %

Fig. 6. Error Results Analysis of the model HNN4ORT. There are four major types of errors.

relational phrase. As a result, the relation will be recognized as falseness. Such phenomena a�ectmodel precision.

7 CONCLUSIONSOpen relation extraction is an important NLP task. �e e�ect of conventional methods is notsatisfactory. �erefore, we are commi�ed to using advanced deep learning models to solve thistask. In this paper, we construct a training set automatically and design a neural sequence taggingmodel to extract open relations. Taking the ON-LSTM and Dual Aware Mechanism as the essence,we propose a hybrid neural sequence tagging model (NHH4ORT). �e experimental results showthe e�ectiveness of our method. Compared with conventional extractors or other neural models,our approach achieves state-of-the-art performances on multiple testing sets. We also analyzethe quality of the automatically constructed training set, and conclude that it should be highlyacceptable. In future work, we will consider to study a more e�cient annotation scheme and use itto deal with n-ary relation tuples.

REFERENCES[1] Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015. Leveraging Linguistic Structure For

Open Domain Information Extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational


1:16 S. Jia et al.

Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). DOI:h�p://dx.doi.org/10.3115/v1/P15-1034

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning toalign and translate. arXiv preprint arXiv:1409.0473 (2014).

[3] Michele Banko, Mj Cafarella, and Stephen Soderland. 2007. Open information extraction for the web. In the 16thInternational Joint Conference on Arti�cial Intelligence (IJCAI 2007). 2670–2676. DOI:h�p://dx.doi.org/10.1145/1409360.1409378

[4] Leonard E. Baum and Ted Petrie. 1966. Statistical Inference for Probabilistic Functions of Finite State Markov Chains.�e Annals of Mathematical Statistics 37, 6 (1966), 1554–1563. DOI:h�p://dx.doi.org/10.1214/aoms/1177699147

[5] Nikita Bhutani, H V Jagadish, and Dragomir Radev. 2016. Nested propositions in open information extraction. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 55–64.

[6] Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, and HV Jagadish. 2019. Open Information Extractionfrom �estion-Answer Pairs. In Proceedings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2294–2305.

[7] Andrew Carlson, Justin Be�eridge, Bryan Kisiel, Burr Se�les, Estevam R Hruschka, and Tom M Mitchell. 2010. Towardan architecture for never-ending language learning. In Twenty-Fourth AAAI Conference on Arti�cial Intelligence.

[8] Jason P. C. Chiu and Eric Nichols. 2015. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions ofthe Association for Computational Linguistics 4 (2015), 357–370. h�p://arxiv.org/abs/1511.08308

[9] Ronan Collobert, Jason Weston, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural LanguageProcessing (Almost) from Scratch. Journal of Machine Learning Research 12, 1 (2011), 2493–2537.

[10] Lei Cui, Furu Wei, and Ming Zhou. 2018. Neural Open Information Extraction. arXiv preprint arXiv:1805.04270 (2018).h�p://arxiv.org/abs/1805.04270

[11] Aron Culo�a, Andrew Mccallum, and Jonathan Betz. 2006. Integrating probabilistic extraction models and data miningto discover relations and pa�erns in text. In Main Conference on Human Language Technology Conference of the NorthAmerican Chapter of the Association of Computational Linguistics. 296–303.

[12] Luciano Del Corro and Rainer Gemulla. 2013. Clausie: clause-based open information extraction. In Proceedings of the22nd international conference on World Wide Web. 355–366.

[13] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie Strassel, and Ralph MWeischedel. 2004. �e Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation.. In �e fourthinternational conference on Language Resources and Evaluation, Vol. 2. 1.

[14] Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. InProceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011). 1535–1545.DOI:h�p://dx.doi.org/10.1234/12345678

[15] Anthony Fader, Luke Ze�lemoyer, and Oren Etzioni. 2014. Open question answering over curated and extractedknowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and datamining - KDD ’14. 1156–1165. DOI:h�p://dx.doi.org/10.1145/2623330.2623677

[16] Xiaocheng Feng, Bing Qin, and Ting Liu. 2018. A language-independent neural network for event detection. ScienceChina Information Sciences 61, 9 (2018), 092106.

[17] Kiril Gashteovski, Rainer Gemulla, and Luciano Del Corro. 2017. MinIE: Minimizing Facts in Open InformationExtraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2620–2630.h�ps://www.aclweb.org/anthology/D17-1277

[18] Rafael Glauber and Daniela Barreiro Claro. 2018. A systematic mapping study on open information extraction. ExpertSystems with Applications 112 (2018), 372–387.

[19] Alex Graves, Abdel-rahman Mohamed, and Geo�rey Hinton. 2013. Speech Recognition with Deep Recurrent NeuralNetworks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 6645–6649.DOI:h�p://dx.doi.org/10.1109/ICASSP.2013.6638947

[20] Guoliang Guan and Min Zhu. 2019. New Research on Transfer Learning Model of Named Entity Recognition. InJournal of Physics: Conference Series, Vol. 1267. IOP Publishing, 012017.

[21] Luheng He, Mike Lewis, and Luke Ze�lemoyer. 2015. �estion-Answer Driven Semantic Role Labeling: Using NaturalLanguage to Annotate Natural Language. In EMNLP.

[22] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.[23] Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering.

In Proceedings of the Twel�h ACM International Conference on Web Search and Data Mining. ACM, 105–113.[24] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint

arXiv:1508.01991 (2015). DOI:h�p://dx.doi.org/10.18653/v1/P16-1101[25] Shengbin Jia, Shijia E, Maozhen Li, and Yang Xiang. 2018. Chinese Open Relation Extraction and Knowl- edge Base

Establishment. ACM Transactions on Asian and Low-Resource Language Information Processing 17 (2018), 15:1–22.


http://dx.doi.org/10.3115/v1/P15-1034

http://dx.doi.org/10.1145/1409360.1409378

http://dx.doi.org/10.1145/1409360.1409378

http://dx.doi.org/10.1214/aoms/1177699147

http://arxiv.org/abs/1511.08308


http://dx.doi.org/10.1234/12345678

http://dx.doi.org/10.1145/2623330.2623677

https://www.aclweb.org/anthology/D17-1277

http://dx.doi.org/10.1109/ICASSP.2013.6638947

http://dx.doi.org/10.18653/v1/P16-1101


[26] Diederik Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014), 1–13. DOI:h�p://dx.doi.org/10.1109/ICCE.2017.7889386

[27] John La�erty, Andrew McCallum, and Fernando C N Pereira. 2001. Conditional random �elds: Probabilistic modelsfor segmenting and labeling sequence data. ICML ’01 Proceedings of the Eighteenth International Conference on MachineLearning 8, June (2001), 282–289. DOI:h�p://dx.doi.org/10.1038/nprot.2006.61

[28] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. NeuralArchitectures for Named Entity Recognition. arXiv preprint arXiv:1603.01360 (2016). DOI:h�p://dx.doi.org/10.18653/v1/N16-1030

[29] Hongyu Lin, Le Sun, and Xianpei Han. 2017. Reasoning with Heterogeneous Knowledge for Commonsense MachineComprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2022–2033.h�p://aclanthology.info/papers/D17-1215/d17-1215

[30] Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv preprintarXiv:1603.01354 (2016). DOI:h�p://dx.doi.org/10.18653/v1/P16-1101

[31] Mausam. 2016. Open Information Extraction Systems and Downstream Applications. In Proceedings of the Twenty-Fi�hInternational Joint Conference on Arti�cial Intelligence. AAAI Press, 4074–4077.

[32] Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. 2012. Open Language Learning forInformation Extraction. In In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning. 523–534.

[33] Andrew Mccallum, Dayne Freitag, and Fernando Pereira. 2002. Maximum Entropy Markov Models for InformationExtraction and Segmentation. In Machine Learning, Vol. 17. 1–26.

[34] Julian Michael, Gabriel Stanovsky, Luheng He, Ido Dagan, and Luke S. Ze�lemoyer. 2018. Crowdsourcing �estion-Answer Meaning Representations. In NAACL-HLT.

[35] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Je�rey Dean. 2013. Distributed Representations ofWords and Phrases and their Compositionality. In Advances in neural information processing systems. 3111–3119. DOI:h�p://dx.doi.org/10.1162/jmlr.2003.3.4-5.951

[36] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction withoutlabeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP ’09, Vol. 2. Association forComputational Linguistics, 1003. DOI:h�p://dx.doi.org/10.3115/1690219.1690287

[37] Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A Taxonomy of Relational Pa�ernswith Semantic Types.. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning (EMNLP-CoNLL 2012). 1135–1145. h�p://dl.acm.org/citation.cfm?id=2391076

[38] Christina Niklaus, Ma�hias Ce�o, Andre Freitas, and Siegfried Handschuh. 2018. A Survey on Open InformationExtraction. In Proceedings of the 27th International Conference on Computational Linguistics. 3866–3878.

[39] Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017. Neural Sequence Learning Models for WordSense Disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.1167–1178. h�p://www.aclweb.org/anthology/D/D17/D17-1121.pdf

[40] Lev Ratinov and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedingsof the �irteenth Conference on Computational Natural Language Learning. 147–155.

[41] Dominiek Sandra and Marcus Ta�. 2014. Morphological Structure, Lexical Representation and Lexical Access (RLELinguistics C: Applied Linguistics): A Special Issue of Language and Cognitive Processes. (2014).

[42] Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of InternationalConference on New Methods in Language Processing (1994).

[43] Haseeb Shah, Johannes Villmow, Adrian Ulges, Ulrich Schwanecke, and Faisal Shafait. 2019. An Open-World Extensionto Knowledge Graph Completion Models. arXiv preprint arXiv:1906.08382 (2019).

[44] Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2018. Ordered neurons: Integrating tree structuresinto recurrent neural networks. arXiv preprint arXiv:1810.09536 (2018).

[45] Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017. Deep activelearning for named entity recognition. arXiv preprint arXiv:1707.05928 (2017).

[46] Yong Shi, Yang Xiao, and Lingfeng Niu. 2019. A Brief Survey of Relation Extraction Based on Distant Supervision. InInternational Conference on Computational Science. Springer, 293–303.

[47] Nitish Srivastava, Geo�rey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: ASimple Way to Prevent Neural Networks from Over��ing. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.DOI:h�p://dx.doi.org/10.1214/12-AOS1000

[48] Gabriel Stanovsky and Ido Dagan. 2016. Creating a Large Benchmark for Open Information Extraction. In Conferenceon Empirical Methods in Natural Language Processing. 2300–2305.


http://dx.doi.org/10.1109/ICCE.2017.7889386

http://dx.doi.org/10.1038/nprot.2006.61

http://dx.doi.org/10.18653/v1/N16-1030

http://dx.doi.org/10.18653/v1/N16-1030

http://aclanthology.info/papers/D17-1215/d17-1215

http://dx.doi.org/10.18653/v1/P16-1101

http://dx.doi.org/10.1162/jmlr.2003.3.4-5.951

http://dx.doi.org/10.3115/1690219.1690287

http://dl.acm.org/citation.cfm?id=2391076

http://dl.acm.org/citation.cfm?id=2391076

http://www.aclweb.org/anthology/D/D17/D17-1121.pdf

http://dx.doi.org/10.1214/12-AOS1000

1:18 S. Jia et al.

[49] Gabriel Stanovsky, Luke Ze�lemoyer, and Ido Dagan. 2018. Supervised Open Information Extraction. In Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume.885–895.

[50] Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. PullNet: Open Domain �estion Answering withIterative Retrieval on Knowledge Bases and Text. arXiv preprint arXiv:1904.09537 (2019).

[51] Sutskever, Vinyals, and Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in NeuralInformation Processing Systems 27. 3104–3112. h�p://arxiv.org/pdf/1409.3215.pdf

[52] Linlin Wang, Zhu Cao, Gerard De Melo, and Zhiyuan Liu. 2016. Relation Classi�cation via Multi-Level A�entionCNNs. Acl (2016). DOI:h�p://dx.doi.org/10.1162/153244303322533223

[53] Fei Wu and Daniel S Weld. 2010. Open Information Extraction using Wikipedia. In Proceedings of the 48th AnnualMeetingof the Association for Computational Linguistics (ACL 2010). 118–127. h�p://www.aclweb.org/anthology/P10-1013

[54] Jiacheng Xu, Xipeng Qiu, Kan Chen, and Xuanjing Huang. 2017. Knowledge graph representation with jointlystructural and textual encoding. In Proceedings of the 26th International Joint Conference on Arti�cial Intelligence. AAAIPress, 1318–1324.

[55] Yan Xu, Lili Mou, Ge Li, and Yunchuan Chen. 2015. Classifying Relations via Long Short Term Memory Networksalong Shortest Dependency Paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP 2015). 1785–1794. DOI:h�p://dx.doi.org/10.18653/v1/D15-1206

[56] Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design Challenges and Misconceptions in Neural Sequence Labeling.In Proceedings of the 27th International Conference on Computational Linguistics. 3879–3889.

[57] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation Classi�cation via ConvolutionalDeep Neural Network. In the 25th International Conference on Computational Linguistics (COLING 2014). 2335–2344.DOI:h�p://dx.doi.org/anthology/C/C14/C14-1220.pdf

[58] Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017a. Neural Models for Sequence Chunking. (2017).[59] Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017b. Neural Models for Sequence Chunking. In �e 31

AAAI Conference on Arti�cial Intelligence. 3365–3371. h�p://arxiv.org/abs/1701.04027[60] Dongxu Zhang, Subhabrata Mukherjee, Colin Lockard, Xin Luna Dong, and Andrew McCallum. 2019. OpenKI:

Integrating Open Information Extraction and Knowledge Bases with Relation Inference. arXiv preprint arXiv:1904.12606(2019).

[61] Linhao Zhang and Houfeng Wang. 2019. Using Bidirectional Transformer-CRF for Spoken Language Understanding.In CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 130–141.

[62] Sheng Zhang, Kevin Duh, Benjamin Van Durme, Sheng Zhang, Kevin Duh, and Benjamin Van Durme. 2017. MT/IE:Cross-lingual Open Information Extraction with Neural Sequence-to-Sequence Models. In Conference of the EuropeanChapter of the Association for Computational Linguistics: Volume 2, Short Papers. 64–70.

[63] Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entitiesand Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), Vol. 1. 1227–1236. DOI:h�p://dx.doi.org/10.18653/v1/P17-1113

[64] Hao Zhou, Zhenting Yu, Yue Zhang, Shujian Huang, Xinyu Dai, and Jiajun Chen. 2017. Word-Context CharacterEmbeddings for Chinese Word Segmentation. In Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing. 771–777. h�p://aclweb.org/anthology/D/D17/D17-1080.pdf

[65] Hao Zhu, Yankai Lin, Zhiyuan Liu, Jie Fu, Tat-seng Chua, and Maosong Sun. 2019. Graph Neural Networks withGenerated Parameters for Relation Extraction. arXiv preprint arXiv:1902.00756 (2019).

[66] Jingwei Zhuo, Cao Yong, Jun Zhu, Zhang Bo, and Zaiqing Nie. 2016. Segment-Level Sequence Modeling using GatedRecursive Semi-Markov Conditional Random Fields. In Meeting of the Association for Computational Linguistics.


http://arxiv.org/pdf/1409.3215.pdf

http://dx.doi.org/10.1162/153244303322533223

http://www.aclweb.org/anthology/P10-1013

http://dx.doi.org/10.18653/v1/D15-1206

http://dx.doi.org/anthology/C/C14/C14-1220.pdf


http://dx.doi.org/10.18653/v1/P17-1113

http://aclweb.org/anthology/D/D17/D17-1080.pdf

Hybrid Neural Tagging Model for Open Relation Extraction

Documents

Transcript of Hybrid Neural Tagging Model for Open Relation Extraction