An intensive case study on kernel-based relation extraction

27
An intensive case study on kernel-based relation extraction Sung-Pil Choi & Seungwoo Lee & Hanmin Jung & Sa-kwang Song # Springer Science+Business Media New York 2013 Abstract Relation extraction refers to a method of efficiently detecting and identifying predefined semantic relationships within a set of entities in text documents. Numerous relation extractionfc techniques have been developed thus far, owing to their innate impor- tance in the domain of information extraction and text mining. The majority of the relation extraction methods proposed to date is based on a supervised learning method requiring the use of learning collections; such learning methods can be classified into feature-based, semi- supervised, and kernel-based techniques. Among these methods, a case analysis on a kernel- based relation extraction method, considered the most successful of the three approaches, is carried out in this paper. Although some previous survey papers on this topic have been published, they failed to select the most essential of the currently available kernel-based relation extraction approaches or provide an in-depth comparative analysis of them. Unlike existing case studies, the study described in this paper is based on a close analysis of the operation principles and individual characteristics of five vital representative kernel-based relation extraction methods. In addition, we present deep comparative analysis results of these methods. In addition, for further research on kernel-based relation extraction with an even higher performance and for general high-level kernel studies for linguistic processing and text mining, some additional approaches including feature-based methods based on various criteria are introduced. Keywords Relation extraction . Kernel methods . Text mining . Information extraction . Machine learning Multimed Tools Appl DOI 10.1007/s11042-013-1380-5 S.-P. Choi : S. Lee : H. Jung : S.-k. Song (*) Information S/W Research Center, Korea Institution of Science and Technology Information (KISTI), 335 Gwahangno Yuseong-gu, Daejeon, South Korea e-mail: [email protected] S.-P. Choi e-mail: [email protected] S. Lee e-mail: [email protected] H. Jung e-mail: [email protected]

Transcript of An intensive case study on kernel-based relation extraction

An intensive case study on kernel-basedrelation extraction

Sung-Pil Choi & Seungwoo Lee & Hanmin Jung &

Sa-kwang Song

# Springer Science+Business Media New York 2013

Abstract Relation extraction refers to a method of efficiently detecting and identifyingpredefined semantic relationships within a set of entities in text documents. Numerousrelation extractionfc techniques have been developed thus far, owing to their innate impor-tance in the domain of information extraction and text mining. The majority of the relationextraction methods proposed to date is based on a supervised learning method requiring theuse of learning collections; such learning methods can be classified into feature-based, semi-supervised, and kernel-based techniques. Among these methods, a case analysis on a kernel-based relation extraction method, considered the most successful of the three approaches, iscarried out in this paper. Although some previous survey papers on this topic have beenpublished, they failed to select the most essential of the currently available kernel-basedrelation extraction approaches or provide an in-depth comparative analysis of them. Unlikeexisting case studies, the study described in this paper is based on a close analysis of theoperation principles and individual characteristics of five vital representative kernel-basedrelation extraction methods. In addition, we present deep comparative analysis results ofthese methods. In addition, for further research on kernel-based relation extraction with aneven higher performance and for general high-level kernel studies for linguistic processingand text mining, some additional approaches including feature-based methods based onvarious criteria are introduced.

Keywords Relation extraction . Kernel methods . Text mining . Information extraction .

Machine learning

Multimed Tools ApplDOI 10.1007/s11042-013-1380-5

S.-P. Choi : S. Lee : H. Jung : S.-k. Song (*)Information S/W Research Center, Korea Institution of Science and Technology Information (KISTI),335 Gwahangno Yuseong-gu, Daejeon, South Koreae-mail: [email protected]

S.-P. Choie-mail: [email protected]

S. Leee-mail: [email protected]

H. Junge-mail: [email protected]

1 Introduction

Relation extraction refers to a method of efficient detection and identification of predefinedsemantic relationships within a set of entities in text documents [39, 42]. The importance ofthis method was first recognized at the Message Understanding Conference [30], which washeld from 1987 to 1997 under the supervision of DARPA1. From 1999 to 2008, theAutomatic Content Extraction [1] Workshop facilitated numerous researches promoted asa new project by the NIST2. The workshop is currently held each year, and is considered thebest world forum for the comparison and evaluation of new technologies in the field ofinformation extraction, such as named entity recognition, relation extraction, event extrac-tion, and temporal information extraction. This workshop is conducted as a sub-field of theText Analytics Conference [36], which is currently under the supervision of the NIST.

According to ACE, an entity within a text is a representation naming a real object.Exemplary entities include the names of persons, locations, facilities, and organizations. Asentence including such entities can express their semantic relationships. For example, in thesentence, “President Clinton was in Washington today,” there is a “Located” relation between“Clinton” and “Washington.” In the sentence, “Steve Balmer, CEO of Microsoft, said…” therelation of “Role (CEO, Microsoft)” can be extracted.

Several relation extraction techniques have been developed within the framework of thevarious technological workshops mentioned above. Most relation extraction methods devel-oped thus far have been based on supervised learning requiring learning collections. Thesemethods are classified into feature-based methods, semi-supervised learning methods, boot-strapping methods, and kernel-based methods [4, 9]. Feature-based methods rely on classi-fication models for automatically specifying the category where a relevant feature vectorbelongs. In these approaches, surrounding contextual features are mainly used to identifysemantic relations between the two entities in a specific sentence and represent them as afeature vector. The major drawback of supervised learning-based methods, however, is thatthey require learning collections. Semi-supervised learning and bootstrapping methods, onthe other hand, use a large corpora or Web documents based on reduced learning collectionsthat are progressively expanded to overcome the above disadvantage. Kernel-based methods[11], in turn, devise kernel functions that are most appropriate for relation extraction andapply them for learning in the form of a kernel set optimized for syntactic analysis and part-of-speech tagging. The kernel function itself is used for measuring the similarity betweentwo instances, which are the main objects of machine learning. General kernel-based modelsare discussed in detail in Section 3.

As a representative approach of feature-based methods, the research in [23] combinesvarious types of lexical, syntactic, and semantic features required for relation extractionusing a maximum entropy model. Although based on the same type of composite featuresproposed by [23], the research in [44] makes use of support vector machines for relationextraction, allowing for a flexible kernel combination. The authors in [43] classified allfeatures that were available at the time to create individual linear kernels, and attemptedrelation extraction using composite kernels made up of individual linear kernels. Mostfeature-based methods aim at applying feature engineering algorithms for selecting theoptimal features for relation extraction, and the application of syntactic structures has beenvery limited.

1 Defense Advanced Research Projects Agency of the U.S.2 National Institute of Standards and Technology of the U.S.

Multimed Tools Appl

Exemplary semi-supervised learning and bootstrapping methods are Snowball [2] andDIPRE [5]. These methods rely on a few learning collections for using bootstrappingmethods similar to the Yarowsky algorithm [37] for gathering various syntactic patternsthat denote the relations between two entities in a large Web-based text corpus. Recentdevelopments include the KnowItAll [15] and TextRunner [38] methods for automaticallycollecting lexical patterns of target relations and entity pairs based on ample Web resources.Although this approach does not require large learning collections, its disadvantage is thatmany incorrect patterns are detected through expanding pattern collections, and that onlyone relation can be handled at a time.

Kernel-based relation extraction methods were first attempted by Zelenco et al. [39], whodevised both contiguous and sparse subtree kernels for recursively measuring the similarityof two parse trees for application to binary relation extraction, which demonstrated arelatively high performance. A variety of kernel functions for relation extraction have sincebeen suggested, e.g., dependency parse trees [14], convolution parse tree kernels [40], andcomposite kernels [9, 41], which show even better performances.

In this paper, a case analysis is carried out for kernel-based relation extraction methods,which are considered the most successful approach thus far.While some previous survey papersbased on the importance and effect of this methodology have been previously published [4, 28],they fail to fully analyze the particular functional principles or characteristics of currentlyavailable kernel-based relation extraction models, and simply cite the contents of individualarticles or describe limited analyses. Although the performance of most kernel-based relationextraction methods has been demonstrated based on ACE evaluation collections, a comparisonand analysis of the overall performance has not yet been made.

Unlike existing case studies, the study described herein is based on a close analysis of theoperation principles and individual characteristics of five kernel-based relation extractionmethods starting from the research in [39], which is the source of kernel-based relationextraction studies, and ending with the composite kernel, which is considered the mostadvanced kernel-based relation method currently available [9, 41]. For a comparison of theoverall performance of each method, the focus of this study is placed on the ACE collection.We hope this study will contribute to further research of kernel-based relation extraction ofan even higher performance and to high-level general kernel studies for linguistic processingand text mining.

Section 2 outlines supervised learning-based relation extraction methods, and in Section 3we discuss kernel-based machine learning. Section 4 closely analyzes five exemplary kernel-based relation extraction methods. As mentioned above, Section 5 then compares the perform-ances of these methods to analyze their advantages and disadvantages. Finally, Section 6provides some concluding remarks regarding this research.

2 Supervised learning-based relation extraction

As discussed above, relation extraction methods are classified into three categories. Thedifference between feature-based and kernel-based methods is shown in Fig. 1. These twomethods differ from semi-supervised learning methods with respect to their machine learn-ing procedures.

On the left side of Fig. 1, individual sentences that make up a learning collection have atleast two entities (black square), whose relation is manually extracted and predefined. Sincemost relation extraction methods studied thus far work with binary relations, learningexamples are modified for a convenient relation extraction from a pair of entities by

Multimed Tools Appl

preprocessing the original learning collection. Such modified learning examples are referredto as a relation instance. A relation instance is defined as an element of the learningcollection modified so that it can be efficiently applied to the relevant relation extractionmethods based on a specific sentence containing at least two entities.

The aforementioned modification is closely related to the feature information used inrelation extraction. Since most supervised learning-based methods use both the entity itselfand the contextual information about the entity, it is important to collect contextual infor-mation efficiently for an improved performance. Linguistic processing (part-of-speechtagging, base phrase recognition, syntactic analysis, etc.) for individual learning sentencesin the pre-processing step contributes as a base for effective feature selection and extraction.For example, when a sentence goes through the syntactic analysis shown in Fig. 1, onerelation instance is composed of a parse tree and the locations of the entities indicated in thetree [18, 42, 45]. A single sentence can be represented as a feature vector or syntactic graph[21, 25, 40]. The type of such relation instances depends on the relation extraction methods,and can also involve various preprocessing tasks [31, 40].

In general, feature-based relation extraction methods follow the procedures shown in theupper part of Fig. 1. That is, feature collections that “can express individual learningexamples the best” are extracted (feature extraction.) The learning examples are feature-vectorized, and inductive learning is carried out using selected machine learning models. Onthe contrary, the kernel-based relation extraction shown in the lower part of Fig. 1 devises akernel function that “can calculate the similarity of any two learning examples the mosteffectively” to replace the feature extraction process. Here, the measurement of similaritybetween the learning examples is not a general similarity measurement. That is, the functionproperly computing similarity between the two sentences or instances that express the samerelation is the most effective kernel function in terms of relation extraction. For example, thetwo sentences, “Washington is in the U.S.” and “Seoul is located in Korea,” use differentobject entities but feature the same relation (“located”). Therefore, an efficient kernelfunction would detect a high similarity between these two sentences. On the other hand,since the sentences, “Washington is the capital of the United States” and “Washington islocated in the United States,” express the same object entities but different relations, thesimilarity between them should be determined as very low. As such, in kernel-based relationextraction methods, the selection and creation of kernel functions are the most fundamentalaspects that affect the overall performance.

Fig. 1 Learning process for supervised learning-based relation extraction

Multimed Tools Appl

As shown in Fig. 1, kernel functions (linear, polynomial, and sigmoid) can be used infeature-based methods as well. However, they are functions applied only to vector-expressedinstances. On the other hand, kernel-based methods are not limited in terms of the type ofinstance, and can therefore contain various kernel functions.

3 Overview of kernel-based machine learning methods

Most machine learning methods are carried out on a feature basis. That is, each instance towhich an answer (label) is attached is modified into a feature sequence or N-dimensionalvector, f1, f2, …, fN, for use in the learning process. For example, important features foridentifying the relation between two entities in a single sentence are the entity type;contextual information between, before, and after entities occur; part-of-speech informationfor contextual words; and dependency relation path information for the two entities [9, 23,25, 40]. These data are selected as single features represented through vectors for anautomatic classification of the relation between the entities.

In Section 2, we show that the essence of feature-based methods is to create a featurevector that can best express individual learning examples. However, in many cases, it is notpossible to express a feature vector reasonably. For example, a feature space is required forexpressing the syntactic information3 of a specific sentence as a feature vector, and it isalmost impossible to express it as a feature vector in a limited space in certain instances [13].Kernel-based methods used for learning calculate the kernel functions between two exam-ples while maintaining the original learning example without an additional feature expres-sion [13]. The kernel function is defined as the mapping K : X � X ! ½0;1Þ from theinput space X to the similarity score, K x; yð Þ ¼ fðxÞ � fðyÞ ¼P

ifiðxÞfiðyÞ . Here, fðxÞ is

the mapping function from the learning examples in input space X to the multidimensionalfeature space. The kernel function is symmetric, and exhibits positive semi-definite features.Using the kernel function, it is not necessary to calculate all features one by one, andmachine learning can thus be carried out based only on the similarity between the twolearning examples. Exemplary models in which learning is carried out on the basis of allsimilarity matrices between the learning examples include Perceptron [35], Voted Perceptron[17], and Support Vector Machines [12, 28]. Kernel-based matching learning methods aredrawing an increasingly greater amount of attention, and are widely used for patternrecognition, data and text mining, and Web mining. The performance of kernel methods,however, depends to a great extent on the kernel function selection or configuration [24].

Kernel-based learning methods are also used for natural language processing. Linear,polynomial, and Gaussian kernels are typically used in simple feature vector-based machinelearning. A convolution kernel [11] is used for the efficient learning of structural data such astrees or graphs. A convolution kernel is a type of kernel function featuring the idea ofsequence kernels [26], tree kernels [14, 34, 39, 40, 42], and graph kernels [19]. A convo-lution kernel can measure an overall similarity by defining “sub-kernels” for measuring thesimilarity between the components of an individual entity and for calculating the similarityconvolution among the components. For example, to measure their similarity, a sequencekernel divides two relevant sequences into subsequences, the overall similarity of which arecalculated using a similarity measurement. Likewise, the tree kernel divides a tree into itssub-trees, whose similarity is then calculated along with the convolution of these similarities.

3 Dependency grammar relation, parse tree, etc. between words in a sentence.

Multimed Tools Appl

As described above, another advantage of kernel methods is that learning is possible as asingle kernel function for input instance collections of a different type. For example, theauthors in [9, 41] demonstrated a composite kernel for which a convolution parse tree kernelis combined with an entity kernel for high-performance relation extraction.

4 Kernel-based relation extraction

In this section, five important research results regarding kernel-based relation extraction arediscussed. Of course, several other important studies have also drawn a significant amountof attention owing to their high performance. Most of the approaches, however, simplymodify or supplement these five methods discussed below.

4.1 Tree kernel-based method [39]

This study is known as the first application of a kernel method toward relation extraction. The parsetrees derived from shallow parsing are used for measuring the similarity between sentencescontaining certain entities. In [3], REES (Relation and Event Extraction System) is used to analyzethe parts-of-speech and types of individual words within a sentence, as well as the syntactic structureof the sentence in question. Figure 2 below shows an example result from the use of REES.

As shown in Fig. 2, when the syntactic analysis of the input sentence is completed, theparticular information regarding the words of the sentence is analyzed and extracted. Four typesof attribute information are attached to all words or entities other than articles and stop words.The Type represents the part-of-speech or entity type of the current word.While “John Smith” is

Fig. 2 Exemplary shallow parsing result for relation extraction

Multimed Tools Appl

the “Person” type, “scientist” is specified as a “PNP” representing a personal noun. The Headrepresents the presence of key words of composite nouns or preposition phrases. The Rolerepresents the relation between the two entities. In Fig. 2, “John Smith” is a “member” of“Hardcom C,” and in turn, “Hardcom C.” is an “affiliation” of “John Smith.”

As the figure shows, it is possible to identify and extract the relation between the two entitiesin a specific sentence at a given level if REES is used. However, since the REES system is rule-based, it has certain limitations in terms of scalability and generality [39]. Zelenco et al. [39]constructed tree kernels based on REES analysis results with a view of a better relationextraction to overcome the limitations of machine learning and a low performance. Zelencoet al. [39] proposed two types of tree kernel functions: a sparse subtree kernel (SSK) and acontinuous subtree kernel (CSK). Even though they are not continuously connected, an SSKincludes two node subsequences for comparison. For example, “Person, PNP” at the left-middle and “Person, PNP” at the right-middle of Fig. 3 are node sequences separated in theparse tree. An SSK includes such subsequences for comparison. On the contrary, a CSK doesnot approve of such subsequences and excludes them from comparison.

To measure the performance of the two proposed tree kernels, the authors in [39] used 60 %of the data manually constituted as a learning collection, and carried out a ten-fold crossvalidation. Only two relations, “Person-Affiliation” and “Organization-Location,” were tested.The test revealed that the kernel-based method offers a better performance than the feature-based method, and that the continuous subtree kernel excels over the sparse subtree kernel. Inparticular, the tree kernel proposed in [10] has inspired new research on tree kernels.

The study in [10] is generally recognized as an important contribution toward devisingkernels for efficient similarity measurements between very complex tree-type structuralentities to be used later for relation extraction. Since a lot of information, including syntacticinformation, is still required, the method depends highly on the performance of the REESsystem for creating parse trees. Because the quantity of data was not sufficient for their test,and a binary classification test was carried out on only two types of relations, the authors in[10] did not analyze the scalability or generality of the proposed kernel in detail.

4.2 Dependency tree kernel-based method [14]

Relation extraction has been carefully studied since the ACE collectionwas first constituted anddistributed beginning in 2004. Culotta and Sorensen [14] proposed a kernel-based relationextraction method, which uses a dependency parse tree structure based on the tree kernelproposed by Zelenco et al. [39], as described in Section 4.1. This special parse tree, called anAugmented Dependency Tree, usesMXPOST [33] to carry out parsing, and then modifies some

Fig. 3 Execution of a kernel computation

Multimed Tools Appl

syntactic rules based on the results. Exemplary rules include “subjects are dependent on verbs”and “adjectives” are dependent on the nouns they describe. To improve analysis results, thisstudy uses even more complex node features than those used in [39]. The hypernym extractedfrom WordNet [16] is applied to expand the matching function between nodes. A compositekernel is then constructed for application to relation extraction for the first time.

Figure 4 shows a sample of the augmented dependency tree used in this study. The root ofthe tree is the verb “advanced,” and the resultant subject and preposition are the child nodes ofthe root. The object of the preposition “Tikrit” is the child node of the preposition “near.” Eachnode has eight types of node feature information. The table below outlines each node feature.

In Table 1, the first four features (words, detailed/general parts-of-speech information,and phrase information) are the information obtained from parsing, and the rest are namedentity features from the ACE collection. Among them, a WordNet hypernym is the result ofextracting the highest node for the corresponding word from the WordNet database.

As discussed above, the tree kernel defined by [39] is used in this method. Since thefeatures of each node are added, the matching and similarity functions defined by Zelenco etal. [39] are modified and applied accordingly. In detail, from among eight features, those tobe applied to the matching function and those to be applied to the similarity function weredynamically divided for devising the following models for application.

ti : feature vector representing the node i:

tj : feature vector representing the node j:

tmi : subset of ti used in matching function

tsi : subset of ti used in similarity function

m ti; tj� � ¼ 1 if tmi ¼ tmj

0 otherwise

s ti; tj� � ¼X

vq2tsi

Xvr2tsj

C vq; vr� �

ð1Þ

Fig. 4 Sample augmented dependency tree (“Troops advanced near Tikrit”)

Multimed Tools Appl

where m is the matching function, s is the similarity function, and ti is a feature collectionshowing node i. In addition, C(•,•) is a function used for comparing two feature values basedon approximate matching, not simple perfect matching. For example, the recognition of“NN” and “NP”, shown in the particular part-of-speech information in Table 1, as the samepart-of-speech is implemented by modifying the internal rule of the function. Equations 3and 4 in Section 4.1 are applied as tree kernel functions for comparing the similarity of twoaugmented dependency trees based on two basic functions.

For the evaluation, the initial ACE collection version (2002) released in 2003 was used.This collection defines five entity types and 24 types of relations. Culotta & Sorensen [14]tested the relation extraction for only the five higher types of relation collections (“AT,”“NEAR,” “PART,” “ROLE,” and “SOCIAL”). The kernels tested were the sparse subtree kernel(K0), the continuous subtree kernel (K1), and the bag-of-words kernel (K2). In addition, twocomposite kernels for which the tree kernel was combined with the bag-of-word kernel, that isK3=K0+K2 and K4=K1+K2, were further constituted. A test consisting of two relationdetection4 steps and a relation classification5 revealed that all tree kernel methods, includingthe composite kernel, show a better performance than the bag-of-words kernel. Unlike theevaluation results in [39], although the performance of the continuous subtree kernel is higher,the reason for this was not clearly described.; the authors also failed to demonstrate theadvantage of using a dependency tree instead of a full parse tree in their experiment.

4.3 Shortest path dependency tree kernel method [6]

In Section 4.2, we discussed relation extraction using dependency parse trees for the treekernel proposed by Zelenco et al. [39]. Bunescu & Mooney [6] studied the dependency pathbetween the two named entities in the dependency parse tree with the intention of proposingthe shortest possible path dependency kernel for relation extraction. There is always adependency path between two named entities in a sentence, and Bunescu & Mooney [6]argued that the relation extraction performance is improved using syntactic paths. Figure 5below shows the dependency graph for a sample sentence.

The red nodes in Fig. 5 are named entities specified in the ACE collection. The separationof the entire dependency graph results in ten dependency syntax pairs. To construct adependency path, it is possible to select pairs that include named entities in the syntax pairs,as shown in Fig. 10.

4 Binary classification for identifying the possible relation between two named entities.5 Relation extraction for all instances with relations shown in the relation identification results.

Table 1 Node features in theaugmented dependency tree Feature Example

Words troops, Tikrit

Detailed POS (24) NN, NNP

General POS (5) Noun, Verb, Adjective

Chunking Information NP, VP, ADJP

Entity Type person, geo-political-entity

Entity Level name, nominal, pronoun

WordNet hypernyms social group, city

Relation Argument ARG_A, ARG_B

Multimed Tools Appl

As shown in Fig. 6, it is possible to construct dependency paths between the namedentities, “Protesters” and “stations,” and “workers” and “stations.” As discussed above, thedependency path for connecting two named entities in this sentence can be extendedinfinitely. Bunescu & Mooney [6] estimated that the shortest path contributes the most toestablishing the relation between two entities. Therefore, it is possible to use kernel-basedlearning for estimating the relation between the two named entities connected through adependency path. For example, to estimate the relation for the path “protesters → seized ←stations,” the relation is estimated as the PERSON entity (“protesters”) did a specific action(“seized”) for the FACILITY entity (“stations”), through which PERSON (“protesters”) islocated at FACILITY (“stations”) (“LOCATED_AT”). In another complex path, “workers →holding ← protesters → seized ← stations,” it is possible to estimate the relation in whichPERSON (“workers”) is located at FACILITY (“stations”) (“LOCATED_AT”) if PERSON(“protesters”) did a particular action (“holding”) to PERSON (“workers”), and PERSON(“protesters”) did an action (“seized”) to FACILITY (“stations”). As such, using a dependencyrelation path, it is possible to identify the semantic relation between two entities more intuitively.

For learning, Bunescu & Mooney [6] extracted the shortest dependency paths, includingtwo entities from individual training instances, as shown in Table 2 below.

Fig. 5 Dependency graph and dependency syntax pair list for a sample sentence

Fig. 6 Extracting dependency path including named entities from a dependency syntax pair collection

Multimed Tools Appl

As shown in Table 2, each relation instance is expressed as a dependency path, both endsof which are named entities. In terms of learning, however, it is not easy to extract sufficientfeatures from such instances. Therefore, as discussed in Section 4.2, various types ofsupplementary information are created, such as the part-of-speech, entity type, andWordNet synset. As a result, individual nodes that make up the dependency path comprisea plurality of information elements, and a variety of new paths are finally created, as shownin Fig. 7.

As shown in Fig. 7, with more information available for individual nodes, 48 newdependency paths can be created through the Cartesian product of the node values. Here,relation extraction is carried out by applying the dependency path kernel for calculating theredundancy of the information included in each node rather than comparing all newlycreated paths, as shown below.

x ¼ x1x2 � � � xm; y ¼ y1y2 � � � ynxi : the set of word classes corresponding to position i:

yi : the set of word classes corresponding to position i:

K x; yð Þ ¼ 0; m 6¼ n

Πni¼1c xi; yið Þ m ¼ n

�c xi; yið Þ ¼ xi \ yij j

ð2Þ

In Eq. 2 above, x and y represent extended individual instances, m and n denote thelengths of the dependency path, K(•,•) presents the dependency path kernel, and c(•,•) is afunction for calculating the level of information element redundancy between the two nodes.Figure 8 below shows the process of calculating the kernel value based on Eq. 2.

Based on the same test environment as the collection used by Culotta & Sorensen [14],two parsing systems, a CCG parser [20] and CFG parser [10] were used to construct theshortest dependency path. For a performance comparison, the test included K4 (bag-of-words kernel + continuous subtree kernel ), which showed the best performance for Culotta& Sorensen [14]. The test revealed that a CFG-based shortest dependency path kernel offersa better performance by using a CFG parser instead of a CCG parser for the same kernel.

With regard to relation extraction, the shortest dependency path information is consideredto be very useful, and is likely to be used in various fields. However, the kernel structure is

Table 2 Shortest path dependen-cy tree-based sample relation ex-traction instance

Relations Relation Instances

LOCATED_AT protesters → seized ← stations

LOCATED_AT workers → holding ← protesters → seized ←stations

LOCATED_AT detainees → abusing ← Jelisic → created ← at ←camp

[ ] [ ]

protesters stationsseized

NNS NNSVBD

Noun NounVerb

Person Facility

⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥× → × × ← ×⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦

⎣ ⎦ ⎣ ⎦

Fig. 7 New dependency path information created in a single instance

Multimed Tools Appl

too simple. Yet another limitation is that only paths of the same length are included incalculating the similarity of two dependency paths.

4.4 Subsequence kernel-based method [7]

The tree kernel presented by Zelenco et al. [39] is used to compare two sibling nodes at basically thesame level and uses the subsequence kernel. Bunescu & Mooney [7] introduced the subsequencekernel and attempted relation extraction using only a base phrase analysis (chunking), withoutapplying the syntactic structure. Since the kernel input does not have a complex syntactic structure,but instead utilizes base phrase sequences, the assumption was that the feature space can be dividedinto the following three different types with amaximumof four words for each type of feature, usingthe advantage of easy selection of contextual information essential for a relation extraction.

In Fig. 8, [FB] represents the words positioned before and between two entities, and [B] and[BA] indicate only the word between, and the word collections between and after the two entities,respectively. The three types of feature collections can accept individual relation expressions.Furthermore, various types of supplementary word information (part-of-speech, entity type,WordNet synset, etc.) are used to expand the features as in the methods described above.

Zelenco et al. [39] described how to calculate the subsequence kernel, which will bedescribed in detail again later. The kernel calculation function, Kn(s, t), is defined as shownbelow based on all n-length subsequences included in two sequences, s, t.

Kn s; t; lð Þ ¼Xi: ij j¼n

Xj: jj j¼n

ll ið Þþl jð Þ Πn

k¼1c sik ; tjk� � ð3Þ

where i and j represent subsequences contained in s and t respectively, c(•,•) is a function fordetermining the homogeneity of the two inputs, and λ is a weight given to matchingsubsequences. In addition, l(i) and l(j) are the values indicating how far each relevantsubsequence is positioned from the entire sequences. To calculate the weighted length,Equation 3 selects only n-length subsequences among s and t, which exist in both sequences.For a simple description, the following two sentences and base phrase analysis results will beused to explain the process used in calculating the kernel value.

As shown in Fig. 9, each of the nodes used in the analysis has eight types of lexicalinformation (word, part-of-speech, base phrase type, entity type, etc.) The kernel value, K3(s,t, 0.5), of two analysis sequences is calculated according to the process shown in Fig. 10,using the features of the subsequence, the length of which is 3.

There are three subsequence pairs in which it is determined that, among thesubsequences in s and t, the number of nodes of all subsequences is at least 0 by

• [FB] Fore-Between • ‘interaction of [P1] with [P2]’• ‘activation of [P1] by [P2]’

• [B] Between• ‘[P1] interacts with [P2]’• ‘[P1] is activated by [P2]’

• [BA] Between-After• ‘[P1] – [P2] complex’• ‘[P1] and [P2] interact’

Fig. 8 Contextual location information for feature extraction

Multimed Tools Appl

means of the homogeneity decision function c(•,•). The scores for each of the matchingsubsequences are derived by calculating the cumulative factorial of c(•,•) for eachsubsequence, and then multiplying it by the weight. For example, a similarity of0.84375 is obtained for “troops advanced near” and “forces moved toward.” On thecontrary, the similarity of “troops advanced …Tikrit” and “forces moved …Bagdhad” is0.21093. This results from the lowered weight since the two subsequences are posi-tioned apart. Finally, the similarities of the subsequences are introduced through Eq. 3,which gives 1.477.

Fig. 9 Two sentences and base phrase analysis results used to illustrate the process of calculating asubsequence kernel

Fig. 10 Process using in calculating K3(s, t, 0.5)

Multimed Tools Appl

As described above, it is possible to construct the subsequence kernel functionbased on the subsequences of all lengths using contextual location information.Figure 11 shows the surrounding contextual location where named entities occur inthe sentence.

In Fig. 11, xi and yi indicate the named entities, sf and tf denote the word lists before thenamed entities, sa and ta are the contextual word collections after two entities, and s

0b and t

0b

represent contextual information including two entities. Thus, the subsequence kernel for thetwo sequences s and t is defined using Eq. 4 as follows.

K s; tð Þ ¼ Kfb s; tð Þ þ Kb s; tð Þ þ Kba s; tð ÞKb;i s; tð Þ ¼ Ki sb; tb; 1ð Þ � c x1; y1ð Þ � c x2; y2ð Þ � ll s

0bþl t

0bð Þð Þ

Kfb s; tð Þ ¼X

i�1;j�1;iþj<fbmax

Kb;i s; tð Þ � K 0j sf ; tf� �

Kb s; tð Þ ¼X

1�i�bmax

Kb;i s; tð Þ

Kba s; tð Þ ¼X

i�1;j�1;iþj<bamax

Kb;i s; tð Þ � K 0j sf ; tf� �

ð4Þ

The subsequence kernel K(s, t) consists of the sum of the contextual kernel before andbetween entity pairs, Kfb; the intermediate contextual kernel, Kb, of the entities; and thecontextual kernel between and after entity pairs, Kba. The definition of an individualcontextual kernel is described from the third through the fifth lines of Eq. 4. Here, K

0n is

the same as Kn, with the exception that it specifies the length of the relevant subsequencefrom the location where the subsequence starts to the end of the entire sequence, and isdefined as follows.

K0n s; t; lð Þ ¼

Xi: ij j¼n

Xj: jj j¼n

l sj jþ tj j�i1�j1þ2 Πn

k¼1c sik ; tjk� � ð5Þ

In Eq. 5, i1 and j1 indicate the starting positions of subsequences i and j, respec-tively. Using Eq. 5, the individual contextual kernel calculates the similarity between thetwo sequences for the subsequences within the location divided based on the locationsspecified in Fig. 11, and sums all of the kernel values to determine the final total value.

Fig. 11 Specifying the contextual information depending on the named entity location and defining variables

Multimed Tools Appl

The performance evaluation under the same test environment described in Sections 4.2and 4.3 shows an increase in performance, even without a complicated pre-processing, suchas parsing, and without any syntactic information. In conclusion, the evaluation shows thatthis method is very fast in terms of learning speed, and has potential for an additionalimprovement in performance.

4.5 Composite kernel-based method [41]

The release of the ACE 2003 and 2004 versions contributed to a full-scale study on relationextraction. In particular, the collection is characterized by even richer information for taggedentities. For example, ACE 2003 provides various entity features, e.g., entity headwords,entity type, and entity subtype for a specific named entity, and the features have been used asan important clue for determining the relation between two entities in a specific sentence. Inthis context, Zhang et al. [40, 41] have built a composite kernel for which the convolutionparse tree kernel proposed by Collins & Duffy [11] is combined with the entity featurekernel. Equation 6 provides the entity kernel definition.

KL R1;R2ð Þ ¼X

i¼1;2KE R1:Ei;R2:Eið Þ

KE E1;E2ð Þ ¼X

iC E1:fi;E2:fið Þ

C f1; f2ð Þ ¼ 1; if f1 ¼ f20; otherwise

� ð6Þ

In Eq. 6, Ri indicates the relation instance, and Ri.Ej is the j-th entity of Ri. Additionally,Ei.fj is the j-th entity feature of Ei, and C(•,•) is a homogeneity function for the two entities. Itis possible to calculate the entity kernel KL through a summation based on the featureredundancy decision kernel KE for a pair of entities.

Second, the convolution parse tree kernel expresses one parse tree as an occurrence frequencyvector of a subtree to measure the similarity between the two parse trees as shown below.

f Tð Þ ¼ # subtree1 Tð Þ; . . . ; # subtreei Tð Þ; . . . ; # subtreen Tð Þð Þ ð7Þ

In Eq. 7, #subtreei(T) represents the occurrence frequency of the i-th subtree. All parsetrees are expressed using a vector as described above, and the kernel function is calculated asthe inner product of two vectors as follows.

K T1; T2ð Þ ¼ f T1ð Þ; f T2ð Þh i ð8Þ

Figure 12 shows all subtrees of a specific parse tree. There are nine subtrees in the figurealtogether, and each subtree is an axis of the vector, which expresses the left side parse tree.If the number of all unified parse trees that can be extracted for N parse trees is M, each ofthe extracted subtrees can be expressed as an M-dimension vector.

As shown in Fig. 12, there are two constraints for a subtree of a specific parse tree. First,there must be at least two subtree nodes, and the subtree should comply with the grammarproduction rule. For example, [VP → VBD → “got”] cannot become a subtree.

It is necessary to investigate and calculate the frequency of all subtrees in tree T so as tobuild a vector for each parse tree. This process is quite inefficient, however. Since we need toonly compute the similarity of two parse trees in the kernel-based method, we can come upwith indirect kernel functions without building a subtree vector for each parse tree, as with

Multimed Tools Appl

Eq. 7. The following Eq. 9, proposed by Collins & Duffy [11], is used to efficiently calculatethe similarity of two parse trees.

K T1; T2ð Þ ¼ f T1ð Þ; f T2ð Þh i¼X

i#subtreei T1ð Þ � #subtreei T2ð Þ

¼X

i

Xn12N1

Isubtreei n1ð Þ� �

�X

n22N2Isubtreei n2ð Þ

� �¼X

n12N1

Xn22N2

Δ n1; n2ð ÞN1;N2 ! the set of nodes in treesT1 and T2:

IsubtreeiðnÞ ¼1 if ROOT subtreeið Þ ¼ n

0 otherwise

Δ n1; n2ð Þ ¼X

iIsubtreei n1ð Þ � Isubtreei n2ð Þ

ð9Þ

In Eq. 9, Ti represents a specific parse tree, Ni represents all node collections of the parsetree, and Ist(n) is the function used for checking whether node n is the root node of thespecific subtree, st. The most time-consuming part in Eq. 9 is for calculating n1; n2ð Þ: Toenhance this, Collins & Duffy [11] came up with the following algorithm.

The function n1; n2ð Þ defined in (3) in Fig. 13 compares the child nodes of the input nodeto calculate the frequency of subtrees contained in both parse trees and the product thereof,

Fig. 12 Parsing tree and its subtree collection

Fig. 13 Algorithm used for calculating Δ n1; n2ð Þ

Multimed Tools Appl

until the end conditions defined in (1) and (2) are satisfied. In this case, the decay factor λ,which is a variable used for limiting large subtrees to address the fact that larger subtrees ofthe parse tree comprise other subtrees, can be applied repeatedly for calculating the innerproduct of the subtree vector.

Two kernels built as described above, that is, the entity kernel and the convolution parsetree kernel, are combined in the following two ways.

K1 R1;R2ð Þ ¼ a � KL R1;R2ð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiKL R1;R1ð Þ � KL R2;R2ð Þp þ 1� að Þ K T1; T2ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

K T1; T1ð Þ � K T2; T2ð Þp ð10–1Þ

K2 R1;R2ð Þ ¼ a � KL R1;R2ð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiKL R1;R1ð Þ � KL R2;R2ð Þp þ 1

!2

þ 1� að Þ K T1;T2ð ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiK T1; T1ð Þ � K T2; T2ð Þp ð10–2Þ

In Eqs. 10–1 and 10–2, KL represents the entity kernel and K stands for the convolutionparse tree kernel. Equation 10–1 shows that the composite kernel is a linear combination ofthe two kernels, and Eq. 10–2 defines the composite kernel constructed using a quadraticpolynomial combination.

For the evaluation, Zhang et al. [40, 41] used both ACE 2003 and ACE 2004. Theyparsed all available relation instances using Charniak’s Parser [8], and based on the parsingresults, carried out an instance conversion using the method described in Table 3. To thisend, Moschitti [29] developed a kernel tool, while SVMLight [22] was used for learning andclassification.

The evaluation shows that the composite kernel features a better performance than asingle syntactic kernel. The combination using a quadratic polynomial type shows thedifference in performance between the two kernels. This means that flat (entity type) andstructural (syntactic) features can be organically combined as a single kernel function.Considering that the Path-enclosed Tree method shows the best performance among allrelation instance pruning methods, it is possible to achieve the effect with only core-relatedsyntactic information, so as to estimate the relation of two entities in a specific sentence.

4.6 Other recent studies

Choi et al. [9] constructed and tested a composite kernel where various lexical and contextualfeatures are added by expanding the existing composite kernel [9]. In addition to the syntacticfeature, called a flat feature, they extended the combination range of the lexical feature from the

Table 3 Description of ACEcollection Items ACE-2002 ACE-2003 ACE-2004

# training documents 422 674 451

# training relation instances 6,156 9,683 5,702

# test documents 97 97 N/A

# test relation instances 1,490 1,386 N/A

# entity types 5 5 7

# major relation types 5 5 7

# relation sub-types 24 24 23

Multimed Tools Appl

entity feature to the contextual feature to achieve a higher performance. Mintz et al. [27]proposed a new method for using Freebase, which is a semantic database for thousands ofrelation collections used to gather exemplary sentences for a specific relation and generating arelation extraction based on the exemplary sentences obtained [27]. In their test, a collection of10,000 instances corresponding to 102 types of relations had an accuracy of 67.6%. In addition,T.-V. T. Nguyen et al. [32] designed a new kernel, which extends the existing convolution parsetree kernel [32], and Reichartz et al. [34] proposed a method that extends the dependency treekernel [34]. As described previously, most of the studies published thus far are based on thekernels described in Sections 4.1 through 4.5.

5 Comparison and analysis

In the previous section, five types of kernel-based relation extraction were analyzed in detail.Here, we discuss the results of a comparison and an analysis of these methods. Section 5.1briefly describes the criteria used for the comparison and analysis, and Section 5.2 comparesthe characteristics of the methods. Section 5.3 then covers the performance results in detail.Finally, Section 5.4 summarizes the advantages and disadvantages of each method.

5.1 Criteria for comparison and analysis

A large variety of criteria can generally be used for comparing kernel-based relationextraction methods. The following six criteria, however, were selected and used in thisstudy: (1) Linguistic analysis and the pre-processing method refer to the pre-processinganalysis methods and types of individual instances, which are composed of learningcollections and evaluation collections, e.g., the type of parsing method or parsing systemused. (2) The level of linguistic analysis, which is the criterion related to method (1), is areference to what level of linguistic analysis will be carried out for the pre-processing andanalyzing instances. Exemplary levels include part-of-speech tagging, base phrase analysis,and dependency or full parsing. (3) The method used for selecting a feature space is areference for determining if a substantial input of the kernel function is an entire sentence ora part thereof. (4) The applied lexical and supplementary feature information refers to thevarious supplementary feature information used for addressing the issue of sparse data. (5)The relation extraction method is a practical relation extraction method based on previouslyconstituted learning models. Exemplary relation extraction methods include multi-classclassification at a particular time, and a single mode method for separating instances withrelations from those without relations by means of processing multiple classifications at aparticular time or for a binary classification, and then carrying out relation classification onlyfor the instances with relations. (6) The manual work requirement is a reference used todetermine if the entire process is automatically carried out in full or if manual work isrequired in a particular step. These six criteria were used to analyze the kernel-based relationextraction methods, the results of which are shown in Table 6.

In addition, for the purpose of describing the characteristics of the kernel function, thedescription in Section 4 will be summarized, and the structure of each factor to be input intothe kernel function will be described. The modification of the kernel function for anoptimized speed will be included in the analysis criteria and described. For a performancecomparison of the individual methods, the types and scale of the tested collections and testedrelations will be analyzed and described in detail. Table 3 below describes the ACEcollection generally used for the relation extraction methods developed thus far.

Multimed Tools Appl

As shown in Table 3, the ACE collection is generally used and can be divided into threetypes. ACE-2002, however, is not widely used owing to its consistency and quality problems.There are five to seven types of entities used, e.g., Person, Organization, Facility, Location, andGeo-Political Entity. For the relations, all collections are structured to be at two levels,consisting of 23 to 24 types of particular relations corresponding to five entity types: Role,Part, Located, Near, and Social. As the method used for constructing these collections advancesand its quality improves, the tendency will be toward a reduction in the scale of the traininginstances. Although subsequent collections have previously been constituted, they have notbeen publicized according to the principle of non-disclosure, which is the policy of the ACEWorkshop. This policy should be improved for the benefit of more active studies.

5.2 Comparison of the characteristics

To conform to the six types of comparison and analysis criteria described in 5.1, Table 4 summarizeseach concept of each kernel function used in the kernel-based relation extraction method.

Table 5 shows the comparison and analysis of the kernel-based relation extractionmethods. The closer to the right side of the table the method is, the more recently it wasintroduced. The characteristic found in all of the above methods is that various types offeature and syntactic information are used. Such heterogeneous information was firstcombined and used in a single kernel function, but is separated from the composite kerneland applied.

Table 4 Summary of kernel-based relation extraction methods

Kernel types Description of concept

Tree Kernel (TK) ▪ Compares each node which consists of two trees to be compared.

▪ Based on BFS, applies subsequence kernel to the child nodes located at thesame level.

Dependency Tree Kernel(DTK)

▪ The decay factor is adjusted and applied to the similarity depending on thelength of subsequences at the same level or on the level itself.

Shortest Path DependencyKernel (SPDK)

▪ Compares each element of the two paths to cumulatively calculate thecommon values for the elements in order to compute the similarity bymultiplying all values.

▪ Similarity is 0 if the length of two paths is different.

Subsequence Kernel (SK) ▪ For the example of measuring similarity of two words, extracts only thesubsequences which exist in both words, and expresses the two words witha vector (Φ(x)) by using the subsequences, among all of the subsequenceswhich belong to two words and of which the length is n.

▪ Afterwards, obtains the inner product of the two vectors to calculate thesimilarity.

▪ Generalizes and uses SSK to compare planar information of the tree kernel(sibling node).

Composite Kernel (CK) ▪ Finds all subtrees in the typical CFG-type syntactic tree, and establishesthem as a coordinate axis to represent the parse tree as a vector (Φ(x)).

▪ In this case, the following constraints hold true: (1) the number of nodesmust be at least 2, and (2) the subtree should comply with the CFG creationrule.

▪ Since there can be multiple subtrees, each coordinate value can be at least 1,and similarity is calculated by obtaining the inner product of the two vectorscreated as such.

Multimed Tools Appl

Tab

le5

Com

parisonof

thecharacteristicsof

kernel-based

relatio

nextractio

n

TK

DTK

SPDK

SK

CK

Langu

ageProcessor

Shallo

wParser(REES)

Statistical

Parser(M

XPOST)

Statistical

Parser(Collin

s’Parser)

Chu

nker

(OpenN

LP)

Statistical

Parser

(Charniak’s

Parser)

Level

ofLingu

istic

Analysis

PLO

TaggingParsing

Parsing

Parsing

Chunking

Parsing

Feature

Space

Selectio

nExtractsfeatures

from

parsetreesmanually

Selectssm

allsub-tree

including

twoentitiesfrom

theentire

depend

ency

tree

Selectstheshortestdepend

ency

path

which

startswith

oneentityand

ends

with

theotheron

e

BeforeEntities

MCT

After

Entities

PT

BetweenEntities

CPT

FPT

FCPT

Feature

used

forKernel

Com

putatio

nEntity

Headw

ord

Word

Word

Word

Entity

Headw

ord

Entity

Role

POS

POS

POS

Entity

Type

Entity

Text

Chu

nkingInfo.

Chunk

ingInfo.

Chu

nkingInfo.

Mentio

nTy

pe

Entity

Type

Entity

Type

Entity

Type

LDCmentio

nTy

pe

Entity

Level

Chu

nkHeadw

ord

Chu

nkingInfo.

WordN

etHyp

erny

m

RelationParam

eter

RelationExtractionMetho

dSinglePhase

(Multiclass

SVM)

Cascade

Phase

(Relation

Detectio

nandClassification)

SinglePhase

SinglePhase

(Multiclass

SVM)

SinglePhase

(Multiclass

SVM)

Cascade

Phase

ManualProcess

Necessary

(InstanceExtraction)

Necessary

(DependencyTree)

Necessary

(DependencyPath)

N/A

N/A

Multimed Tools Appl

With respect to selecting the feature space, most of the sentences or portions of the parsetree are applied, rather than a tree kernel. Manual work was initially required for extractingrelation instances and building a parse tree. More recently developed methods, however,offer full automation. In relation extraction methods, multi-class classification is used, inwhich a case with no relations is included as having a single relation.

5.3 Comparison of performance

Table 6 shows the parameter type of each kernel function, along with the computationcomplexity in calculating the similarity between two inputs. Most of the kernels show acomplexity of O(N2), but SPDK clearly demonstrates a complexity on the order of O(N) andcan be considered the most efficient.

It should be noted that the complexity shown in Table 6 is simply the kernel calculationcomplexity. The overall complexity of the relation extraction can be much higher when theprocessing time for the parsing and learning is also considered Table 7.

ACE-2002, which was the first version of the ACE collection, had an issue with dataconsistency. This problem has been continuously addressed in the subsequent versions, andwas finally resolved in version ACE-2003. Starting with a performance rate of 52.8 %,achieved by Kambhatla [23], and based on the performance of ACE-2003 using 24 relationcollections, the performance was improved up to 59.6 %, as recently announced by Zhou etal. [45]. Similarly, the maximum relation extraction performance for 23 particular relationsin the ACE-2004 collection is currently 66 %.

Although each model has a different performance in differently sized relation collections,the composite kernel generally shows better results. In particular, Zhou et al. [45] demon-strated a high performance for all collections and relation collections based on extendedmodels initially proposed by Zhang et al. [40, 41]. As described above, it is thought thatvarious features for relation extraction, that is, a syntactic structure and vocabulary, can beefficiently combined in a composite kernel for an even better performance.

Table 6 Parameter structure and calculation complexity of each kernel

Kernels Parameter structure Time complexity

TK Shallow parse trees CSTK : O Ni;1

�� �� � Ni;2

�� ��� �SSTK : O Ni;1

�� �� � Ni;2

�� ��3� �N1j j : # 1stinput0s nodes in level ið ÞN2j j : # 2ndinput0s nodes in level i

� �DTK Dependency trees

SPDK Dependency pathsO N1j jð Þ

N1j j : # 1stinput0s nodesð Þ

SK Chunking results O n � N1j j � N2j jð Þn : subsequence lengthN1j j : # 1stinput0s nodesð ÞN2j j : # 2ndinput0s nodes

� �

CK Full parse trees O N1j j � N2j jð ÞN1j j : # 1stinput0s nodesð ÞN2j j : # 2ndinput0s nodes

� �

Multimed Tools Appl

5.4 Comparison and analysis of the advantages and disadvantages of each method

The advantages and disadvantages of the five kernel-based relation extraction methods arediscussed and outlined in Table 8.

As shown in Table 8, each method has certain advantages and disadvantages. A lot of effortis required for the process of feature selection used in general feature-based relation extraction.The kernel-based method does not have this disadvantage, but it does have various limitations.For example, although the shortest dependency path kernel has a variety of potential uses, it hasa low performance owing to the overly simple structure of the kernel function. Since thecomposite kernel constitutes and compares the subtree features only on the basis of the part-of-speech and vocabulary information of each node, the generality of the similarity measurement isnot high. A scheme to overcome this limitation is to use word classes or semantic information.

To overcome the shortcomings described above, a scheme for a new kernel design may alsobe suggested. For example, a scheme may be used for interworking various types of supple-mentary feature information (WordNet Synset, thesaurus, ontology, part-of-speech tag, thematicrole information, etc.), ensuring a general comparison between subtrees in the composite kernel.The performance can be improved by replacing the current simple linear kernel with asubsequence or other composite kernel and applying several kinds of supplementary featureinformation to address the shortcomings of the shortest path dependency kernel.

6 Conclusion

In this paper, we analyzed a kernel-based relation extraction method, which is consideredthe most efficient approach currently available. Previous case studies did not fully coverspecific operation principles of kernel-based relation extraction models, and simply cited

Table 7 Performance comparison of each kernel-based relation extraction model

Articles Year Methods Test Collection F1

(Zelenco et al. 2003 [39]) 2003 TK 200 News Articles (2-relations) 85.0

(Culotta & Sorensen 2004 [14]) 2004 DTK ACE-2002 (5-major relations) 45.8

(Kambhatla, 2004) [23] 2004 ME ACE-2003 (24-relation sub-types) 52.8

(Bunescu & Mooney 2005) [6] 2005 SPDK ACE-2002 (5-major relations) 52.5

(Zhou et al. 2005) [44] 2005 SVM ACE-2003 (5-major relations) 68.0

ACE-2003 (24-relation sub-types) 55.5

(Zhao & Grishman, 2005) [43] 2005 CK ACE-2004 (7-major relations) 70.4

(Bunescu & Mooney, 2006) [7] 2006 SK ACE-2002 (5-major relations) 47.7

(Zhang et al. 2006) [41] 2006 CK ACE-2003 (5-major relations) 70.9

ACE-2003 (24-relation sub-types) 57.2

ACE-2004 (7-major relations) 72.1

ACE-2004 (23-relation sub-types) 63.6

(Zhou et al. 2007) [45] 2007 CK ACE-2003 (5-major relations) 74.1

ACE-2003 (24-relation sub-types) 59.6

ACE-2004 (7-major relations) 75.8

ACE-2004 (23-relation sub-types) 66.0

(Jiang & Zhai, 2007) [21] 2007 ME/SVM ACE-2004 (7-major relations) 72.9

Multimed Tools Appl

the contents of individual studies or conducted a limited analysis. This paper, however,closely examined the operation principles and individual characteristics of five kernel-based relation extraction methods, starting from the original kernel-based relation extrac-tion studies [39], and ending with the composite kernel [9, 41], which is considered themost advanced kernel-based method. The overall performance of each method wascompared using ACE collections, and the particular advantages and disadvantages ofeach method were summarized. This study should contribute to further kernel studies onrelation extraction of a higher performance, and to general high-level kernel studies forlinguistic processing and text mining.

Table 8 Analysis of the advantages and disadvantages of kernel-based relation extraction

Method Advantage Disadvantage

Feature-basedSVM/ME

Applies the typical automatic sentenceclassification without modification.

A lot of effort is required for feature extractionand selection.

Performance can be further improved byapplying various feature information.

Performance can be improved only throughfeature combination.

Relatively high speed

Tree Kernel(TK)

Calculates particular similarity betweenshallow parse trees.

Very limited use of structural information(syntactic relations)

Uses both structural (parenthood) and planarinformation (brotherhood).

Slow similarity calculation speed in spite ofoptimized speed

Optimization for speed improvement.

DependencyTK (DK)

Addressed the issue of insufficient use ofstructural information, which is adisadvantage of TK.

Predicates and key words in a dependency treeare emphasized only by means of decayfactors (low emphasis capability)

Uses key words which are the core of relationexpression in the sentence, as featureinformation, on the basis of the structure thatthe predicate node is raised to a higherstatus, which is a structural characteristic ofa dependency tree.

Slow similarity calculation speed

ShortestPath DTK(SPTK)

Creates a path between two named entities bymeans of dependency relation to reducenoise not related to relation expression.

Too simple structure of the kernel function

Shows very fast computation speed becausethe kernel input is not trees, but paths,different from previous inputs.

Too strong constraints because the similarity is0 if the length of two input paths is different.

Adds various types of supplementary featureinformation to improve the performance ofsimilarity measurement, thanks to the simplestructure of paths.

SubsequenceKernel(SK)

Very efficient because syntactic analysisinformation is not used.

Can include many unnecessary features

Adds various types of supplementary featureinformation to improve the performance.

CompositeKernel(CK)

Makes all of constituent subtrees of a parsetree have a feature, to perfectly use structureinformation in calculating similarity.

Comparison is carried out only on the basis ofsentence component information of eachnode (phrase info.) (kernel calculation isrequired on the basis of composite featureinformation with reference to word class,semantic info, etc.)

Optimized for improved speed

Multimed Tools Appl

References

1. ACE, “Automatic Content Extraction,” 2009. [Online]. Available: http://www.itl.nist.gov/iad/mig//tests/ace/.2. Agichtein E, and Gravano L (2000) “Snowball: extracting relations from large plain-text collections,” in

Proceedings of the fifth ACM conference on Digital libraries, pp. 85–94.3. Aone C, and Ramos-Santacruz M (2000) “REES: a large-scale relation and event extraction system,” in

The 6th Applied Natural Language Processing Conference4. Bach N, and Badaskar S (2007) “A survey on relation extraction,” Literature review for Language and

Statistics II5. Brin S (1999) Extracting patterns and relations from the World Wide Web. Lect Notes Comput Sci 1590/

1999:172–1836. Bunescu R, and Mooney RJ (2005) “A shortest path dependency kernel for relation extraction,” in

Proceedings of the Human Language Technology Conference and Conference on Empirical Methods inNatural Language Processing, pp. 724–731

7. Bunescu R, and Mooney RJ (2006) “Subsequence kernels for relation extraction,” in Proceeding of theNinth Conference on Natural Language Learning (CoNLL-2005)

8. Charniak E (2001) “Immediate-head parsing for language models,” in Proceedings of the 39th AnnualMeeting of the Association for Computational Linguistics

9. Choi S-P, Jeong C-H, Choi Y-S, Myaeng S-H (2009) Relation extraction based on extended compositekernel using flat lexical features. J KIISE: Softw Appl 36:8

10. Collins M (1997) “Three generative, lexicalised models for statistical parsing,” in Proceedings of the 35thAnnual Meeting of the ACL (jointly with the 8th Conference of the EACL), Madrid

11. Collins M, and Duffy N (2001) “Convolution kernels for natural language,” in NIPS-200112. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–29713. Cristianini N, and Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-

based learning methods. Cambridge University Press14. Culotta A, and Sorensen J (2004) “Dependency tree kernels for relation extraction,” in Proceedings of the

42nd Annual Meeting on Association for Computational Linguistics15. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005)

Unsupervised named-entity extraction from the Web: an experimental study. Artif Intell 165(1):91–13416. Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge17. Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn

37(3):277–29618. Fundel K, Küffner R, Zimmer R (2007) RelEx—Relation extraction using dependency parse trees.

Bioinformatics 23(3):365–37119. Gartner T, Flach P, and Wrobel S (2003) “On graph kernels: hardness results and efficient alternatives,”

Learning Theory and Kernel Machines, pp. 129–14320. Hockenmaier J, and Steedman M (2002) “Generative models for statistical parsing with combinatory

categorial grammar,” in Proceedings of 40th Annual Meeting of the Association for ComputationalLinguistics, Philadelphia, PA

21. Jiang J, and Zhai C (2007) “A systematic exploration of the feature space for relation extraction,” inNAACL HLT22. Joachims T (1998) “Text categorization with support vecor machine: learning with many relevant

features,” in ECML-199823. Kambhatla N (2004) “Combining lexical, syntactic and semantic features with Maximum Entropy models

for extracting relations,” in ACL-200424. Li J, Zhang Z, Li X, Chen H (2008) Kernel-based learning for biomedical relation extraction. J Am Soc

Inf Sci Technol 59(5):756–76925. Li W, Zhang P, Wei F, Hou Y, and Lu Q (2008) “A novel feature-based approach to Chinese entity relation

extraction,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguisticson Human Language Technologies: Short Papers, pp. 89–92

26. Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using stringkernels. J Mach Learn Res 2:419–444

27. Mintz M, Bills S, Snow R, and Jurafsky D (2009) “Distant supervision for relation extraction withoutlabeled data,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4thInternational Joint Conference on Natural Language Processing of the AFNLP 2:1003–1011

28. Moncecchi G, Minel JL, and Wonsever D (2010) “A survey of kernel methods for relation extraction,” inWorkshop on NLP and Web-based technologies (IBERAMIA 2010)

29. Moschitti A (2004) “A study on convolution kernels for shallow semantic parsing,” in ACL-2004

Multimed Tools Appl

30. MUC, “The NIST MUCWebsite,” 2001. [Online]. Available: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/.

31. Nguyen DPT, Matsuo Y, and Ishizuka M (2007) “Exploiting syntactic and semantic information forrelation extraction from wikipedia,” in IJCAI Workshop on Text-Mining & Link-Analysis (TextLink 2007)

32. Nguyen T.-VT, Moschitti A, and Riccardi G (2009) “Convolution kernels on constituent, dependency andsequential structures for relation extraction,” in Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing 3:1378–1387

33. Ratnaparkhi A (1996) “A maximum entropy part-of-speech tagger,” in Proceedings of the EmpiricalMethods in Natural Language Processing Conference

34. Reichartz F, Korte H, and Paass G (2009) “Dependency tree kernels for relation extraction from naturallanguage text,” Machine Learning and Knowledge Discovery in Databases, pp. 270–285

35. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in thebrain. Psychol Rev 65(6):386–408

36. TAC, “Text Analysis Conference,” 2012. [Online]. Available: http://www.nist.gov/tac/.37. Yarowsky D (1995) “Unsupervised word sense disambiguation rivaling supervised methods,” in

Proceedings of the 33rd annual meeting on Association for Computational Linguistics pp. 189–19638. Yates A, Cafarella M, Banko M, Etzioni O, Broadhead M, and Soderland S (2007) “TextRunner: open

information extraction on the web,” in Proceedings of Human Language Technologies: The AnnualConference of the North American Chapter of the Association for Computational Linguistics:Demonstrations, pp. 25–26

39. Zelenco D, Aone C, Richardella A (2003) Kernel methods for relation extraction. J Mach Learn Res3:1083–1106

40. Zhang M, Zhang J, and Su J (2006) “Exploring syntactic features for relation extraction using aconvolution tree kernel,” in Proceedings of the main conference on Human Language TechnologyConference of the North American Chapter of the Association of Computational Linguistics, pp. 288–295

41. Zhang M, Zhang J, Su J, and Zhou G (2006) “A composite kernel to extract relations between entitieswith both flat and structured features,” in 21st International Conference on Computational Linguisticsand 44th Annual Meeting of the ACL, pp. 825–832

42. Zhang M, Zhou G, Aiti A (2008) Exploring syntactic structured features over parse trees for relationextraction using kernel methods. Inf Process Manag 44(2):687–701

43. Zhao S, and Grishman R (2005) “Extracting relations with integrated information using kernel methods,”in ACL-2005

44. Zhou G, Su J, Zhang J, and ZhangM (2005) “Exploring various knowledge in relation extraction,” inACL-200545. Zhou G, Zhang M, Ji D, and Zhu Q (2007) “Tree Kernel-based relation extraction with context-sensitive

structured parse tree information,” in The 2007 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning, 2007

Sung-Pil Choi received his Ph.D degree in Department of Information and Communications Engineeringfrom Korea Advanced Institute of Science and Technology (KAIST) in 2012. He works for Korea Institute ofScience and Technology Information (KISTI) as senior researcher. He has experience in big data processing/analytics and contents management. His research interests are in the area of text mining, natural languageprocessing, data analytics, information/knowledge processing.

Multimed Tools Appl

Seungwoo Lee received the B.S. degree from Kyungbook National University, Korea, in 1997. He receivedhis M.S. and Ph.D. degrees in Computer Science and Engineering from POSTECH, Korea in 1999 and 2005.Currently, he works as a senior researcher and leader of platform research group at Korea Institute of Scienceand Technology Information (KISTI), Korea. Previously, He studied and developed an information retrievalengine and a named entity recognizer based on Natural Language Processing (NLP) technologies. Andrecently he has researched and developed Semantic Web-related tools such as a semantic repository, aninference engine. His current research interest includes technology intelligence and discovery of technologyopportunities based on Text Mining and Semantic Web technologies.

Hanmin Jung works as chief researcher and the head of the Dept. of S/W Research at Korea Institute ofScience and Technology Information (KISTI), Korea. He received his M.S. and Ph.D. degrees in ComputerScience and Engineering from POSTECH, Korea. His current research interest is intelligent service based onthe Semantic Web and Text Mining, Human-Computer Interaction, and Natural Language Processing.

Multimed Tools Appl

Sa-kwang Song received his B.S. degree in Statistics in 1997 and his M.S. degree in Computer Science in1999 at Chungname National University, Korea. He received his Ph.D. degree in Computer Science at KoreaAdvanced Institute of Science and Technology (KAIST), Korea. He joined Korea Institute of Science andTechnology Information (KISTI), Daejeon Korea in 2010. He is currently a senior researcher at Dept. of SWResearch. His current research interest includes Text Mining, Natural Language Processing, InformationRetrieval, Semantic Web, and Health-IT.

Multimed Tools Appl