Mining methodologies from NLP publications: A case study in automatic terminology recognition

A

EiAdptpERfm6f(t©

K

1

p

T

G

0

Available online at www.sciencedirect.com

Computer Speech and Language 26 (2012) 105–126

Mining methodologies from NLP publications: A case study inautomatic terminology recognition�

Aleksandar Kovacevic a, Zora Konjovic a, Branko Milosavljevic a, Goran Nenadic b,∗a Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovica 6, 21000 Novi Sad, Serbia

b School of Computer Science, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK

Received 31 May 2010; received in revised form 12 April 2011; accepted 5 September 2011Available online 10 September 2011

bstract

The task of reviewing scientific publications and keeping up with the literature in a particular domain is extremely time-consuming.xtraction and exploration of methodological information, in particular, requires systematic understanding of the literature, but

n many cases is performed within a limited context of publications that can be manually reviewed by an individual or group.utomated methodology identification could provide an opportunity for systematic retrieval of relevant documents and for exploringevelopments within a given discipline. In this paper we present a system for the identification of methodology mentions in scientificublications in the area of natural language processing, and in particular in automatic terminology recognition. The system compriseswo major layers: the first layer is an automatic identification of methodological sentences; the second layer highlights methodologicalhrases (segments). Each mention is categorised in four semantic categories: Task, Method, Resource/Feature and Implementation.xtraction and classification of the segments is formalised as a sequence tagging problem and four separate phrase-based Conditionalandom Fields are used to accomplish the task. The system has been evaluated on a manually annotated corpus comprising 45

ull text articles. The results for the segment level annotation show an F-measure of 53% for identification of Task and Methodentions (with 70% precision), whereas the F-measures for Resource/Feature and Implementation identification were 61% (with

7% precision) and 75% (with 86% precision) respectively. At the document-level, an F-measure of 72% (with 81% precision)or Task mentions, 60% (with 81% precision) for Method mentions, 74% (with 78% precision) for the Resource/Feature and 79%with 81% precision) for the Implementation categories have been achieved. We provide a detailed analysis of errors and explorehe impact that the particular groups of features have on the extraction of methodological segments.

2011 Elsevier Ltd. All rights reserved.

eywords: Information extraction; Methodology mining; Conditional Random Fields; Automatic terminology mining

. Introduction

Both the volume and availability of scientific publications are constantly increasing: for example, one biomedicalublication is published approximately every two minutes (DeShazo et al., 2009; MEDLINE). The task of keeping up

� This paper has been recommended for acceptance by ‘Edward J. Briscoe’.∗ Corresponding author. Current address: School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK.el.: +44 161 30 65936; fax: +44 161 30 61281.

E-mail addresses: [email protected] (A. Kovacevic), ftn [email protected] (Z. Konjovic), [email protected] (B. Milosavljevic),[email protected] (G. Nenadic).

885-2308/$ – see front matter © 2011 Elsevier Ltd. All rights reserved.doi:10.1016/j.csl.2011.09.001

http://www.sciencedirect.com/science/journal/08852308

dx.doi.org/10.1016/j.csl.2011.09.001

mailto:[email protected]




dx.doi.org/10.1016/j.csl.2011.09.001

106 A. Kovacevic et al. / Computer Speech and Language 26 (2012) 105–126

with the literature in a particular domain is typically done through so-called “strategic reading” (Renear and Palmer,2009), where researchers skim a number of publications to identify those that address tasks, methods, resources andconcepts of interests, and then read only a selection of those in detail. This typically raises a question of the number of‘false negative’ manuscripts i.e. the number of articles of interest that may have been missed when reviewing a specifictask in a given field. Reviewing methodologies that have been used for a given task, in particular, is a time consumingprocess that requires systematic knowledge and understanding of the literature. In many cases it is performed within alimited context of publications that can be manually reviewed by an individual or a group of researchers. Automatedmethodology identification could provide an alternative approach for improving retrieval of relevant documents and forexploring developments within a discipline and enabling identification of best and common practices within particularcommunities (Eales et al., 2008). Methodological information can be further used to explore patterns in the evolutionof a particular domain and to identify “hot-topics” and experts (Eales et al., 2008; Buitelaar and Eigner, 2009).

Systematic harvesting of methodological information from the scientific literature through natural language pro-cessing (NLP) is relatively new (Kappeler et al., 2008). Most of the work done in this area comes from experimentalfields, in particular biology and medicine, where – given the volume and diversity of scientific output – it is extremelydifficult to keep the pace with the evolution of research methods. However, this is the case with many other domains:NLP itself witnesses a growing number of publications on diverse themes and incorporates an increasing numberof methods from other fields. In this paper we focus on the extraction of methodologies from scientific publi-cations in the field of Automatic Term Recognition (ATR). ATR combines methods from a variety of disciplines(including natural language processing, machine learning and statistics) and relies on diverse resources, featuresand software implementations. This makes ATR an interesting and challenging field for exploration of methodologyextraction.

We represent methodological information through four semantic categories: Task(s) that are addressed by a givenmethodology, Method(s) that are used, Resource(s) and Feature(s) that the methodology relies on, and its Implemen-tation. We aim to highlight such information in the literature in two steps. In the first step, methodological sentencesthat present a manuscript’s contributions are identified. The second step extracts and classifies methodology mentionsinto the four semantic categories. The latter step is the focus of this paper: we present a method for identificationand semantic categorisation of textual segments that disseminate methodological information. The method is appliedon “methodological” sentences, which are recognised by building on approaches reported in related work (Teufeland Moens, 2002; Teufel, 1999). Here we focus only on sentences that convey methodological information that ispart of the manuscript’s contributions (we refer to these as Solution sentences). We apply machine learning in bothsteps, and aim for a higher precision, whereas we hypothesise that (document-level) recall would be supported bypotentially repeated mentions of methodological information in a given manuscript. To facilitate both the trainingand testing of the methods developed, we have created a manually annotated corpus comprising 45 full text articles.Using a fivefold cross validation process, the proposed method achieved an F-measure between 53 and 60% (withthe precision around 70%) for Task, Method and Resource/Feature mentions, and 75% F-measure (with the precisionof 86%) for the Implementation category. We also report evaluations with methodological information abstracted and“normalised” at the document level, with F-measures of 72% (with 81% precision) for Task mentions, 60% (with 81%precision) for Method mentions, 74% (with 78% precision) for the Resource/Feature and 78% (with 81% precision)for the Implementation categories.

The paper is organised as follows. Section 2 gives an overview of related work. The annotation guidelines andmanually annotated gold standard corpus are presented in Section 3. Section 4 provides an overview of the systemand the features, whereas Section 5 describes the experimental setup used for performance evaluation, along with theresults. Discussion and error analysis are given in Section 6. Finally, we conclude the paper and give an outline oftopics for future work in Section 7.

2. Related work

Although methodology mining is a relatively new task in NLP, there have been several attempts that can be groupedinto two categories. The first category refers to approaches that focus on the identification of sentences that describemethodologies; the second category includes attempts to extract structured information from the literature. We discussthese below.

2

assAcwi

•

•

•

•

•

•

•

l(btfiafFne

suPaSuc

FTfaer

A. Kovacevic et al. / Computer Speech and Language 26 (2012) 105–126 107

.1. Extracting methodological sentences

Several approaches have been reported on the identification of “methodological” sentences and zones from scientificrticles, including both full-text manuscripts and abstracts. One of the first approaches to identify methodologicalentences in full-text articles is the work by Teufel and Moens (2002) and Teufel (1999). They divide a paper intoeven rhetorical zones in terms of “argumentation and intellectual attribution”: Background (background knowledge),im (goal of research), Basis (specific other work that the presented approach is based on), Contrast (contrasting andomparison to other solutions), Other (specific other work), Textual (textual structure of the paper), and Own (ownork including method, results, future work). Each sentence in a document is represented by a set of features that

nclude the following seven types:

the sentence’s location in relation to consecutive segments that the paper is divided into (e.g. ‘A’ if it belong to thefirst segment, ‘B’ if it belongs to the second, etc.);

the paper’s structure in which the sentence appears (relative position within the section, paragraph and type ofheadline of the current section);

the sentence’s length (binary feature that indicates if the sentence is longer than a certain threshold inwords);

content features (presence of “significant terms” as determined by the tf*idf measure; presence of the words fromthe paper and section titles);

verb syntax (voice and tense of the first finite verb in the sentence and the presence of auxiliary or modal verb thatmodifies the first finite verb);

citation information (presence of a citation or the name of an author contained in the reference list; presence of selfcitation; relative location of the citation in the sentence), andmeta-discourse (e.g. the type of formulaic expression; type of agent, type of action and presence ofnegation).

Using a Naive Bayes (NB) classifier, Teufel and Moens have experimented with a corpus from computationalinguistics, and achieved varying F-measures, ranging from 26% for the Contrast zone to 86% for the Own categorywhich includes methodological expressions). Mizuta and Collier (2004) and Mizuta et al. (2006) extended this worky allowing shallow nesting of zones. They have divided zones in three groups: (1) background information, includinghe problems that authors solve, method, result, insight, implication, and else; (2) connection or difference betweenndings; (3) outline of the paper. The method has been implemented using support vector machines (SVM) and NB,nd applied in the domain of molecular biology (Mullen et al., 2005). The authors reported an F-measure of 81%or NB and 87% for SVM for the Method category. On the other hand, Wilbur and colleagues use five dimensions:ocus, Polarity, Certainty, Evidence, and Directionality in order to characterise text in the sense of users informationeeds (Wilbur et al., 2006; Shatkay et al., 2008, 2010). The Focus category, in particular, contains scientific facts,xperimental methodology, or general knowledge.

Several approaches also attempted the segmentation of research abstracts into zones that correspond to the standardtructure of reporting research findings (introduction, methods, result and conclusion). For example, Ruch et al. (2007)se a NB classifier, word-based features and sentence position to classify abstract sentences into one of the four classes:urpose, Method, Result, and Conclusion. Similar work, relying on an SVM classifier, has been reported by McKnightnd Srinivasan (2003), with an F-measure of 82% for the Methods category. SVM classifiers have been also used inhimbo et al. (2003), Ito et al. (2004) and Yamamoto and Takagi (2005), whereas Wu et al. (2006) and Lin et al. (2006)se Hidden Markov Models (HMM) to categorise sentences in abstracts. The reported F-measure for the Methodategory varies between 50% and 85%.

Kenji et al. (2008) formalise sentence categorisation as a sequence labelling task and use Conditional Randomields (CRFs) to label a sentence with one of the following categories: objective, methods, results and conclusions.hree groups of features are used to represent a sentence: content (n-grams), relative sentence location and features

rom the surrounding sentences. Performance of 95.5% per-sentence accuracy and 68.8% per-abstract accuracy (anbstract is considered correct if all constituent sentences are correctly labelled) is reported. CRFs are also used by Lint al. (2006) to retrieve and categorise sentences from randomised controlled trials and Chung (2009) to identify theesult and conclusion sections of biomedical abstracts. In all cases, the reported F-measure is between 80% and 93%.


2.2. Extraction of methodological information

There have been few attempts to extract structured information about the methodology used in a given paper.Kappeler et al. (2008), for example, used rule- and dictionary-based pattern matching and statistical filtering applied tothe biomedical literature to identify mentions of experimental methods by which an interaction between two proteinshas been verified (as reported in the literature). The task was to identify method mentions from a pre-specified set ofinteraction types as defined by a controlled vocabulary. A set of hand-crafted rules was created for the recognition ofmethods that were most frequent in the training data, with weights based on relative frequency of the method in thetraining data and precision and recall of all the rules used for the extraction of that method against the training set. Theauthors reported an F-measure of 45%.

Eales et al. (2008) conducted a study to identify “best practices” in the field of molecular phylogenetics by focusingon the methods used to perform experiments. The authors use a corpus of full text scientific papers and extractmethodologies from each of the articles. They define methodology as a protocol that consists of four phases that arespecific to phylogenetics. Each of methodological phases is extracted using a set of hand-crafted regular expressionsand a manually created controlled vocabulary of important names and terms. The authors reported an F-measure of87.7% for protocol identification.

2.3. CRF-based tagging

Conditional Random Fields are discriminatively undirected graphic models, a special case of which is a linear chainthat corresponds to a condition trained finite-state machine (Settles, 2004). Unlike other graphical models (such asHMM) that require a stringent conditional independence assumption, one of the key advantages of CRF models is theirflexibility in capturing non-independent features, such as capitalisation, suffixes and surrounding words. Besides thepreviously described approaches that use CRFs for sentence classification, this technique has been successfully appliedto many other tasks such as identification of various named entity classes (Settles, 2005; McDonald and Pereira, 2005;Tsai et al., 2005; Sarafraz et al., 2009).

These approaches generally treat a sentence as sequence of word tokens. Yang et al. (2009), on the other hand, useda phrase-based CRF model for identification and tagging of transcription factor mentions in the biomedical literature.The model is applied to sentences that were automatically pre-classified to contain mentions of the entities of interest.The method assigns linguistic, domain-specific and context features to each phrase token, including noun and verbphrases. An F-measure of 51.5% with the precision of 62.5% was reported.

Our approach focuses on the identification of mentions of methodological information in an ATR corpus. We use atwo-step approach that first identifies methodological sentences, and then uses a set of phrase-based CRFs to annotatementions of four methodological semantic categories (tasks, methods, resources, implementations). The corpus usedin this work is explained next.

3. Gold-standard corpus

We have collected a corpus of 110 publicly available full text articles from the field of general ATR by searchingvia Web and manually scanning papers from the meetings of the Association for Computational Linguistic (ACL),Association for Computing Machinery (ACM) and Conference on Computational Linguistics (COLING) between1992 and 2008. From this set, we have randomly chosen 45 full text articles (8551 sentences; 133,280 words) as agold standard corpus, and manually annotated them using Callisto. The papers were annotated at three levels. At thesentence level, we have categorised sentences in one of the seven categories as defined in Teufel and Moens (2002)and Teufel (1999): Background, Aim, Basis, Contrast, Other, Textual, and Own (see Section 3.1). Sentences referring
to methodological information (i.e. belonging to a sub-class of the Own category) have been annotated at the segmentlevel with four semantic categories (Task, Method, Resource/Feature, or Implementation, see Section 3.2). Finally,each article has been annotated at the document level to show a summary of Tasks, Methods, Resources/Features andImplementations that have been reported in it (see Section 3.3).

pA

3

afti

•••

sd

a(h

3

ttd

•

•

TT

N

TT

N


All annotation tasks have been performed by the first author. In order to measure the objectivity of the annotations,arts of the corpus were double annotated by an independent qualified annotator (a PhD student) and the Inter Annotatorgreement Assessment (IAAA) estimated as reported below.

.1. Sentence-level annotation

Each sentence has been classified in one of the seven categories (Background, Aim, Basis, Contrast, Other, Textual,nd Own) as defined in Teufel and Moens (2002) and Teufel (1999). The annotation scheme has been expanded tourther subcategorise the Own category, which includes methodological sentences and any other sentences that refero paper’s contributions, such as results and future work. In order to focus on methodological sentences, we haventroduced three new subcategories of the Own category:

Solution – sentences that describe the methodology used in the paper; Result – sentences that contain the results presented in the paper; Own else – any other Own sentence that cannot be categorised as Solution or Result.

Each sentence can belong to exactly one category, or – in the case of the Own category – to exactly one subcategory.Tables 1 and 2 present the detailed statistics of the sentence-level annotations. We note that almost two-thirds of

entences have been categorised as Own, while Solution sentences comprise even 41% of all sentences in the full textocument corpus.

In order to assess the objectivity of the annotations, five articles (11% of the corpus) were randomly chosen andnnotated by the independent annotator. The IAAA of 0.71 was calculated using the unweighted Choen kappa statisticCohen, 1960), which was considered to be substantial agreement among annotators (Artstein and Poesio, 2008). Weave therefore concluded that our corpus can be used as a gold standard for sentence identification.

.2. Annotation of methodological segments

A sub-corpus for the annotation of methodological segments has been formed from the sentences that were assignedo the Solution category during the manual annotation of sentences. These sentences have been annotated for segmentshat represent methodological information. We consider four semantic categories that characterise methodologicalescriptions in the area of natural language processing (and computer science in general):

Task specifies the main undertaking(s) that authors aim to perform as a part of their exploration. For example, tasksare term recognition, ontology construction, term classification, document clustering, etc.

Method refers to the means and approaches used to perform the tasks. A task can be performed by various methodsor combination of methods. For example, classification task can be performed using Decision Tree or Naive Bayesmethods, while clustering can be done using k-means or complete-link hierarchical clustering methods, etc. Note
that, depending on the context, a term can be either a task or a method. For example, a paper can present a newmethod for dictionary-based look-up, in which case dictionary-based look-up is a task; another publication can usedictionary-based look-up as a method to perform the task of Automatic Term Recognition.
able 1he statistics for the seven sentence categories (as defined in Teufel and Moens, 2002; Teufel, 1999) from the gold-standard corpus.

Background Aim Basis Contrast Other Textual Own Total

umber of sentences 1201 (15%) 170 (2%) 112 (1%) 186 (3%) 959 (11%) 341 (3%) 5542 (65%) 8551 (100%)

able 2he statistics for the subcategorisation of the Own category (the gold-standard corpus).

Solution Result Own else Total

umber of sentences 3513 (63%) 676 (13%) 1353 (24%) 5542 (100%)


Fig. 1. Examples of sentences with annotated segments.

• Implementation indicates names of software tools or services that authors use to realise the methods. For example,a paper can report using LibSVM to perform SVM-based classification, or ABNER for biomedical named-entityextraction. We also include programming languages and environments in this class (e.g. Perl, Eclipse).

• Resources/Features represent specific resources (e.g. ontologies, corpora, databases) and features (e.g. noun phrase,frequency, alignment score, termhood, rules) that authors use or define as part of their exploration. We also includeevaluation and similarity metrics in this category (e.g. precision, recall and F-measure).

The manual annotation exercise aimed at identification and classification of relevant methodological segments intext as defined by this schema. Each textual segment can only be assigned to exactly one of the four categories. Anypart of a sentence can be classified in one of the categories, with two constraints: firstly, an annotation segment mustbe contained within one sentence; secondly, a tagged segment should include only whole phrases as provided bypre-processing (it can include one or more phrases). There are two reasons for the latter constraint: (a) having phrasesas segment-building blocks would facilitate easier estimation of the inter-annotator agreement; (b) in order to providea training set for phrase-based machine learning, each shallow-parsed phrase needs to be categorised in exactly onecategory. In the case where an annotated segment is smaller than the shallow parsed phrase in which it is embedded,the whole phrase is used as a tagging unit. If a segment contains more than one phrase, then all phrases are tagged withthe same tag. Examples of sentences with annotated segments are given in Fig. 1, and annotation statistics are given inTable 3.

We note that on average more than 150 segments have been annotated per full-text document. To estimate theinter-annotator agreement, 10% of the Solution sentences (350 examples) were double-annotated by the independentannotator. Given that (except for the aforementioned constraints) the annotators have the freedom of determining the
size, content and category of a methodological segment, a unifying annotation unit needs to be determined in order tocalculate the kappa statistic. Here we used the shallow parsed phrase (as returned by the Stanford parser (Klein andManning, 2003a,b), see below) as the basic annotation unit. The unweighted Choen kappa statistic was 0.68. Given
Table 3The statistics for the manually annotated methodological segments (the gold standard).

Task Method Resource/Feature Implementation Total

Number of segments 1610 1051 4576 172 11,158Average length of segments in words 3.09 2.76 2.34 1.43 2.4Average length of segments in phrases 2.06 1.76 1.35 1.14 1.58Average number of segments per document 35.78 23.36 101.69 3.82 156.64


Table 4The statistics for the document level annotations (the gold standard).

Task Method Resource/Feature Implementation Total

Number of mentions 645 552 1511 63 2748AS

tt

3

Imsf

a

4

aTo

4

(ScmccOcSteS

4

oacts

verage per document 14.33 12.26 33.57 1.40 15.39tandard deviation 4.61 5.15 9.77 1.73 5.31

hat this annotation task is more complicated than previous one, the obtained value was considered high enough to usehis corpus as a gold standard with methodological segments (Artstein and Poesio, 2008).

.3. Annotation at the document level

Each of the 45 full-text papers has been additionally annotated for reported Tasks, Methods, Resources/Features andmplementations at the document level by grouping methodological information presented in a given paper. Severalentions of the same methodological segments have been “normalised” and collapsed together, so that document

ummaries contain only unique methodological concepts mentioned in a given document. Table 4 presents statisticsor the document level annotations.

We note that on average more than 14 tasks and 12 methods have been identified as being discussed in a paper,long with more than 33 resources and features.

. Mining methodological segments – system overview

The system comprises two major steps. The first step is the automatic identification of methodological sentences in paper. In the second step, the system extracts and classifies methodology segments in all methodological sentences.he classification of segments into the four semantic categories is accomplished by four separate CRFs models. Theverview of the system is given in Fig. 2.

.1. Sentence categorisation

The idea of extracting target information only from certain sections of scientific papers has been applied previouslycf. Teufel and Moens, 2002; Mizuta and Collier, 2004; Mizuta et al., 2006; Mullen et al., 2005; Wilbur et al., 2006;hatkay et al., 2008, 2010; Yang et al., 2008; Teufel, 1999). We have followed that idea to identify sentences thatontain methodological information, before extracting methodological segments. In our approach, we have used theethod suggested by Teufel and Moens (2002) and Teufel (1999): each sentence is represented as a feature vector

ontaining features explained in Section 2.1. A classifier is then trained to classify sentences into one of the sevenlasses (Background, Aim, Basis, Contrast, Other, Textual, Own). Given that we have added three subcategories to thewn category, we used a two-level classification procedure: the first-level classifies sentences in the original seven

ategories, and the classifier at the second level is then used to further classify sentences from the Own category intoolution, Result or Own else category. The classifier at the second level uses the same set of features as the one athe first level. Feature extraction was performed using LT-TTT2 software, RapidMiner’s text mining plugin (Mierswat al., 2006) and some tailored-made feature extraction. We experimented with various classification methods includingVM, k-nearest neighbours, Decision trees, and Naïve Bayes.

.2. Extraction of methodological segments

Sentences that have been classified as belonging to the Solution category were then processed to identify method-logical segments. This task is construed as a sequence tagging problem. We chose CRFs for this task because, among

large variety of approaches to entity tagging, they have been shown as useful for modelling dependences betweenonstituents (e.g. in protein mention detection, Yeh et al., 2005; Wilbur et al., 2007). However, unlike CRF methodshat treat a sentence as a word sequence, our model is based on shallow-parsed phrases (Yang et al., 2009). Eachentence is represented as a sequence of shallow-parsed phrase segments obtained by the Stanford parser (Klein and


Fig. 2. The system overview.

Manning, 2003a,b). Before parsing, sentences that contained bullet lists were separated using heuristics into separateclauses that have been treated as separate sentences. Sentences that could not be parsed were not processed further (i.e.are ignored): in the gold standard set, only 3 out of 3513 sentences were not “parsable”. For all other sentences, the
results of the Stanford parser were used as is i.e. no post-processing was done. However, words not bearing content(such as determiners, auxiliaries, modals and adverbs) have been filtered out. Each remaining phrase is labelled witha set of features as described below:

•

◦◦

◦

•

◦◦

•

•

•

◦◦◦◦

•

[

uaodto

w


Lexical features represent grammatical information assigned to the phrase and include:

phrase (PH): the surface expression of the phrase; normalised phrase (NPH): the morphologically and derivationally normalised phrase, as obtained by the Stanford

parser;phrase type (PHT): as returned by the Stanford parser (e.g. noun phrase (NP), verb phrase (VP), prepositional phrase(PP)),

Syntactic features are engineered from specific relations in which the phrase is a governor1 or a dependant,2 asreturned by the Stanford parser; in case where there are several relations, the associated names are alphabeticallysorted and merged (e.g. “xcomp” and “nsubj” are combined as “nsubj xcomp”). Syntactic features therefore include:

governor (GOV): syntactic relations in which the phrase is the governor; dependant (DEP): syntactic relations in which the phrase is the dependant;

The semantic feature (ACT) refers to the category of the verb (if the phrase contains a verb); it is determined fromthe Action lexicon (Teufel, 1999); otherwise the value is ‘O’. The Action lexicon is based on argumentative moves(Teufel, 1999) and contains semantic classifications (e.g. presentation, solution, research) of verbs. For example,verbs such as measure, classify and count are tagged as belonging to the research category, whereas perform andprovide belong to the solution category.

The citation feature (CIT) represents the type of citation as defined in Teufel and Moens (2002) and Teufel (1999)that is contained in the following phrase. Many tasks, methods, resources, etc. are followed by citations and thereforewe hypothesised that this feature may be useful for identification of methodological segments. We use three values(self-citation, citation, author name) as defined in Teufel (1999). A similar feature was used to classify the sentences.A set of normalised frequency features (see below) models the distributional statistics of the given phrase in thetraining set. We removed the stop words from the phrase and calculated the average of the frequencies of all itswords.3 The frequency features include:

the frequency of the phrase annotated as Task in the training corpus (FT); the frequency of the phrase annotated as Method in the training corpus (FM); the frequency of the phrase annotated as Resource/Feature in the training corpus (FR);

the frequency of the phrase annotated as Implementation in the training corpus (FI).

Label: either Task, Method, Resource/Feature or Implementation (depending on the semantic type) or ‘O’ if thephrase is not categorised as belonging to the given category for which the CRF is built (these have been taken fromthe gold-standard data).

An example of the features engineered for the sentence “The corpus was processed by using the c/nc-value method2] for term recognition” is given in Table 5.

The engineered features represent both the content and context information of the phrase tokens. This model wassed to build feature templates for each of the four CRFs separately. Each line in a CRF template file is described with

macro %x[row,col], where row is the relative position from the current tagging focus, and col is the absolute positionf the column in the input CRF data file as explained in detail in (Yang et al., 2009; Liu et al., 2010). By specifying
ifferent values for the macros, we designed features that make use of phrase properties of a tagging candidate andhe surrounding phrases. Overall, 108 templates4 have been created. Table 6 gives examples to illustrate a varietyf features based on the CRF data file example shown in Table 5. The feature set consists of the local features that
1 The governor of a phrase is the element that determines the syntactic function of the whole phrase.2 The dependant of a phrase is any element in a phrase that does not refer to the same entity that the whole phrase refers to.3 We experimented with calculating the frequencies for the whole phrases but that yielded worse performance as expected, given that such phrasesere sparse.4 All templates are available at: http://www.informatika.ftn.uns.ac.rs/AleksandarKovacevic/MethodologyExtraction.

http://www.informatika.ftn.uns.ac.rs/AleksandarKovacevic/MethodologyExtraction


Table 5Extracted features for the sentence “The corpus was processed by using the c/nc-value method [2] for term recognition.”.

Phrase token (PH) NPH PHT GOV DEP ACT CIT FT FM FI FR Label

the corpus corpus NP O nsubjpass O O 0 0 0 140 Resource/Featurewas be VP O auxpass O O 0 0 0 0 Oprocessed process VP auxpass nsubjpass prep O O O 24 11 0 0 Oby by PP pcomp O O O 0 0 0 0 Ousing use VP dobj prep pcomp USE O 0 0 0 0 Othe c/nc-value method c/nc-value method NP amod appos prep dobj O O 0 6 0 0 Method[ [ O O aposs O O 0 0 0 0 O2 2 NP O aposs O AUTO CIT 0 0 0 0 O] ] O O aposs O O 0 0 0 Ofor for PP prep pobj O O 0 0 0 0 Oterm recognition term recognition NP O pobj O O 98 0 0 0 Task

O O O O O 0 0 0 0 O

Table 6An example of the local and context features for the sentence in Table 5.

Template Feature value Feature type

%x[0,1] the c/nc-value method Lexical, local%x[0,3] NP Lexical, local%x[2,8] AUTO CIT Type of citation, context
%x[−1,9] pcomp Syntactic, context
describe the candidate’s own properties (e.g. lexical properties of the phrase), and context features that incorporate theproperties of the neighbouring phrases. Experiments were performed with various context window sizes, ranging from1 to 3 phrases left and right.

5. Experiments and results

In order to evaluate the performance of the proposed system, experiments were performed on the gold-standardcorpus (see Section 3). We used a standard five-fold cross validation approach in which, for each iteration, we train on80% and test on 20% of the gold standard data. This setting was used to evaluate the performances of both the sentenceclassification task and methodological segment categorisation task. Additionally, we have performed document-levelevaluation as explained below.

The performance of each task is measured in terms of precision (P), recall (R) and F-measure (F), which are definedas follows:

P = TP

TP + FP, R = TP

TP + FN, F -measure = 2PR

P + R

where TP, FP and FN denote true positives, false positives and false negatives respectively. While interpretation of TP,FP and FN for the sentence classification task is straightforward, determining these for the segment categorisation taskneeds to be explicitly defined. In this case, given that the minimal annotation unit is a sentence segment, the definitionof TP, FP and FN needs to be based on the annotated segment level. Two situations can arise (note that each CRFgenerates only labels for one category):

• The annotated segment is (part of) exactly one phrase. A TP is generated if the true category of the segment matchesthe category returned by the CRF model. If the segment belongs to the given category while the predicted categoryis O, an FN is generated. An FP is produced in the case when the category of the segment is O, while it is predicted
to be of the given category.
• The segment spreads over two or more phrases. In this case, the predicted category of the segment is determinedby the unanimous vote of the predicted categories of the phrases that constitute that segment. This means that a


Table 7Examples of TPs. The three phrases in segment 2 and one in segment 4 have correct predicted categories, so a TP is generated for each of these twosegments.

Segment Phrase True category Predicted category

1 The cls measure O O1 was O O1 tested O O1 on O O2 a corpus Resource/Feature Resource/Feature2 of Resource/Feature Resource/Feature2 2008 abstracts Resource/Feature Resource/Feature3 retrieved O O3 from O O4 medline database Resource/Feature Resource/Feature5

tu

gcsmtao

5

a

TAs

S

111222223

. O O

TP is generated only if all of the phrases within the segment have the same predicted category (see Table 7 for anexample), otherwise an FN is generated (see Table 8). In the case when the segment has the category O, and at leastone of the phrases has a different predicted category, an FP is generated (see Table 9).

We have also evaluated the performances of our four CRFs models at the phrase level. In this case, each phrase isreated separately (independently of the segment to which it belongs), and TPs, FPs and FNs are determined in thesual manner.

For the document-level evaluation exercise, all methodological mentions (both in the gold standard and in the resultsenerated by our system) were considered as a set of “normalised” mentions (for each document separately) and theorresponding sets have been compared manually by the authors. A TP represents a case where a system-generatedegment matches a manually generated annotation (of the same type). An FP is generated if there is not a suitableatch for a system-obtained segment, whereas an FN is generated for each gold-standard document level annotation

hat has not been identified by the system. The results were then micro (the results for all documents grouped together)nd macro averaged (an average is calculated as the mean of document-level values for P, R and F-measure) for eachf the four classes considered.

.1. Sentence classification results

Sentence classification experiments were performed in RapidMiner (Mierswa et al., 2006). We experimented with number of different classification methods (including SVM, k-nearest neighbours, Decision trees, and Naïve Bayes).

able 8n example of FN. Since the phrases “concordances” and “automatically recognized terms” in segment 2 have the predicted category O, the whole

egment is predicted as O and an FN is generated for that segment.

egment Phrase True category Predicted category

First O O , O O we O O collect Method Method concordances Method O for Method Method all Method Method automatically recognized terms Method O

. O O


Table 9An example of FP. The phrases “natural language processing” and “noun-phrase chunking” (segment 3) have been annotated as O, but their predictedcategory is Task, so an FP is generated for this segment.

Segment Phrase True category Predicted category

1 Besides O O1 , O O1 this dedicated approach O O2 to determine Task Task2 the unithood Task Task2 of Task Task2 word sequences Task Task3 will prove to be O O3 invaluable O O3 to O O3 other areas O O3 in O O3 natural language processing O Task3 such as O O3 noun-phrase chunking O Task3 and O O
3 named-entity recognition O O4 . O O
Since the Naïve Bayes approach gave the best performance (data for other approaches not shown), we performedfeature selection for it in RapidMiner using the backward selection algorithm.5

The results for the identification of sentences belonging to the Own category were comparable to those reported inthe original paper by Teufel and Moens (2002): precision was 80%, recall 85%, F-measure 83% (compared to 84%,88%, 86% respectively as reported in Teufel and Moens, 2002). We note that a baseline classifier that classifies eachsentence in the Own category would provide an F-measure of 78%, with recall of 100% and precision of 65%. Weperformed a pairwise t-test that has shown there was statistically significant difference between the two classifiers(p = 0.004). However, the relatively low precision of 65% for the baseline classifier and the fact that methodologicalinformation would be also extracted from Background, Contrast or Other sentences (whereas we are focused only onmethods used in the given paper) could result in potentially misleading methodological information. Therefore, theprecision of sentence identification is critical since we are focused on extracting only the methodology presented inthe given paper. Furthermore, our initial hypothesis was that the methodological information that the authors used islikely to be appearing elsewhere in the document (see Section 5.3) so that we could address the problem of coverageby harvesting repeated methodological information.

Within the Own sentences, our second-step classifier achieved 78% F-measure (74% precision and 82% recall) forthe identification of Solution sentences. When using a baseline classifier that classifies each Own sentence into theSolution category, an F-measure of 77% would be obtained (with recall of 100% and precision of 63%). Althoughthere are no statistically significant differences between the two classifiers, the proposed classifier still provides betterprecision (extra 11%) which is more important for the identification of specific contributions of the given paper.

5.2. Results for the extraction of methodological segments

Pre-processing of Solution sentences was performed by the Stanford parser (Klein and Manning, 2003a,b) and theCRF models were built using CRF++. A separate CRF is trained and evaluated for each of the four categories. Wehave experimented with a single multi-label CRF, but it resulted in lower performance (data not shown).

The performance of the CRF models, formed using all the features described in Section 3.3, for all four categories(Task, Method, Resource/Feature and Implementation), calculated at the segment level, is given in Table 10. Theoverall results for segment-level evaluation show that the best performance can be achieved for the Implementation

5 The resultant feature subsets are available at: http://www.informatika.ftn.uns.ac.rs/AleksandarKovacevic/MethodologyExtraction.

http://www.informatika.ftn.uns.ac.rs/AleksandarKovacevic/MethodologyExtraction


Table 10Segment-level performance of the CRFs classifiers for the four categories.

Task Method Resource/Feature Implementation

P R F P R F P R F P R F

0.6959 0.4325 0.5335 0.7046 0.4251 0.5303 0.6761 0.5539 0.6089 0.8566 0.6774 0.7565

Table 11Phrase-level performance of the CRFs classifiers for the four categories.

Task Method Resource/Feature Implementation

P R F P R F P R F P R F

0

c(

si

wMrawc

5

afTofta

TD

T

P

0

TD

T

P

0

.7300 0.4709 0.5725 0.6964 0.4296 0.5314 0.6556 0.5393 0.5918 0.8671 0.6832 0.7643

ategory (76% F-measure), followed by Resource/Feature (61% F-measure). Method and Task proved to be challengingF-measure of 53%).

The performances calculated at the phrase level are given in Table 11. As expected, the results are better, but onlylightly compared to those achieved at the segment level. For example, the phrase level F-measure for the Task categoryncreased by 4%, whereas the F-measures for other categories were not significantly different.

We note that a baseline classifier that would classify each segment as belonging to a given methodological categoryould achieve an F-measure of 20% (R = 100%, P = 14.42%) for Task, 17.2% F-measure (R = 100%, P = 9.41%) forethod and 3% F-measure (with R = 100% and P = 1.54%) for Implementation, compared to 53%, 53% and 76%

espectively for the proposed CRF-based classifiers. In the case of Resources/Features, the baseline classifier wouldchieve F-measure of 58% (R = 100%, P = 41%), which is comparable to our classifier (F-measure of 61%). However,e note that our classifier achieves much better precision (68%) and that recall can be improved by document-level

onsideration (see the following subsection).

.3. Document-level results

In this experiment we aimed to evaluate the capabilities for identification of Tasks, Methods, Resources/Featuresnd Implementations at the document level: we assessed to what extent we could identify methodological informationor a given article rather than at the mention level. As with previous tasks, we used a standard five-fold cross validation:ables 12 and 13 present the results. While there is a minor improvement for the Implementation category (F-measuref 78%, up from 76%), all other methodological categories showed better results, in particular for recall. The results
or the Task category, for example, demonstrated significant improvement (+11.34% for P, +21.92% for R), suggestinghat there are repeating segments (e.g. in the abstract, introduction, method section, conclusion) that could be groupedt the document level. Resource/Features demonstrated similar increases (+9.95% for P, 14.50% for R).
able 12ocument-level performance of the CRFs classifiers – macro measures.

ask Method Resource/Feature Implementation

R F P R F P R F P R F

.8093 0.6517 0.7220 0.8121 0.4758 0.6000 0.7756 0.6989 0.7353 0.8058 0.7683 0.7866

able 13ocument-level performance of the CRFs classifiers – micro measures.

ask Method Resource/Feature Implementation

R F P R F P R F P R F

.8081 0.6465 0.7183 0.8271 0.4420 0.5761 0.7878 0.7054 0.7444 0.7755 0.6031 0.6785


6. Discussions and error analysis

The impact that particular groups of features have on the extraction of methodological segments has been exploredin detail. We have also performed error analysis and identified four major error categories which are described below(Section 6.2). Finally, we also briefly discuss the results at different levels of annotation and demonstrate potentials inthe exploration of methodological spaces.

6.1. Impact of feature types

The features used in CRFs have been analysed with regard to their four types (as introduced in Section 4.2):

• lexical features (PH, NPH and PHT),• syntactic features (DEP and GOV), and• frequency (domain) features (FT, FM, FI and FR),• the citation feature (CIT) and the semantic category of the verb (ACT).

The analyses have been performed for each of the four categories (Task, Method, Resource/Feature and Implemen-tation) separately. In all experiments, the surrounding window of two phrases showed the best performances for theCRFs models and is used to present and discuss the results.

6.1.1. Lexical featuresThe lexical features, in general, are beneficial for the process of extraction of all four categories. For the Task, Method

and Resource categories, precision and recall improve between 5% and almost 20% when these features are used. TheImplementation category benefits less, reflecting the variability of expressions used to name various software tools.When only lexical features are used for classification, recall suffers significantly, whereas there is a limited impact onprecision, suggesting that there is lexical variability that needs to be compensated by other features in order to identifymethodological mentions. The detailed results are presented in Table 14.

For the Task category, if only the features from this group are used, the model achieves good precision (only 4%less than the model with all the features), whereas recall drops 6%. On the other hand, if the model does not use lexicalfeatures, precision drops by 9%.

Table 14The impact of the lexical features (PH, phrase; NPH, normalised phrase; PHT, phrase type).

Task Method

Precision Recall F-measure Precision Recall F-measure

All features 0.6959 0.4325 0.5335 0.7046 0.4251 0.5303PH + NPH + PTH only 0.6543 0.3752 0.4769 0.6646 0.3240 0.4356All, but NPH 0.6723 0.4211 0.5179 0.5943 0.3486 0.4395All, but PH 0.6740 0.4290 0.5243 0.6904 0.4180 0.5208All, but PHT 0.7011 0.4253 0.5294 0.6994 0.4168 0.5223All, but (PH + NPH + PTH) 0.6066 0.3764 0.4646 0.5181 0.3201 0.3957

Resource/Feature Implementation


All features 0.6761 0.5539 0.6089 0.8566 0.6774 0.7565PH + NPH + PTH only 0.6903 0.4258 0.5267 0.7833 0.3380 0.4723All, but NPH 0.6459 0.5308 0.5827 0.8611 0.6933 0.7682All, but PH 0.6570 0.5321 0.5880 0.8495 0.6878 0.7601All, but PHT 0.6707 0.5345 0.5949 0.8706 0.6679 0.7559All, but (PH + NPH + PTH) 0.5669 0.4853 0.5229 0.8378 0.6674 0.7429


Table 15The impact of the frequency features (FT, frequency for Task; FM, frequency for Method; FI, frequency for Implementation; FR, frequency forResource/Feature).

Task Method


All features 0.6959 0.4325 0.5335 0.7046 0.4251 0.5303FT + FM + FI + FR only 0.3756 0.2488 0.2993 0.3410 0.2207 0.2680FT (FM) 0.2689 0.0801 0.1234 0.2724 0.1915 0.2249All, but FT (FM) 0.6906 0.3808 0.4909 0.7245 0.3566 0.4779All, but (FT + FM + FI + FR) 0.7243 0.4238 0.5347 0.7377 0.3764 0.4985



All features 0.6761 0.5539 0.6089 0.8566 0.6774 0.7565FT + FM + FI + FR only 0.4980 0.4332 0.4634 0.7582 0.6831 0.7187FR (FI) 0.4528 0.2863 0.3508 0.6919 0.5410 0.6072All, but FR (FI) 0.6846 0.4792 0.5638 0.8204 0.4647 0.5933A

piv

was

cCi

ibt

6

dim

etti

th

f

ll, but (FT + FM + FI + FR) 0.6866 0.4508 0.5442 0.8291 0.4363 0.5718

When only lexical features are used to identify Methods, the F-measure is degraded by 10%. The phrase and thehrase type have a similar effect on the performance (when not used, both precision and recall decrease by 1%). It isnteresting that precision drops by almost 19% when lexical features are not used, which suggests that there is a lexicalocabulary that could well characterise Method mentions.

Similar patterns were shown for the Resource/Feature category, with 11% drop in precision and 7% drop in recallhen lexical features are ignored. However, using only these features provides the best precision (69%), suggesting

gain that there are ‘lexical’ clues that can help identify resources and features used in ATR methodologies. This alsouggests that many ATR techniques rely on the same features.

In comparison to other categories, using only the lexical features to identify Implementation mentions has a signifi-antly lower performance, in particular recall, which reflects the lexical variability of tool names as mentioned above.onsistent with this, excluding lexical features has a very small impact on the overall performance results (1–2% drop

n precision and recall).Overall, lexical features achieve high precision when only this group of features is used. This indicates that there

s an important and stable relationship between lexical features and methodologies in this domain, and that it needs toe exploited in order to recognise and classify methodological phrases. On the other hand, lower recall indicates thathere is still lexical variability in expressing methodological knowledge.

.1.2. Frequency featuresFrequency features have varying affects on the results, depending on the category. In general, there are no significant

ifferences for Task and Method identification, whereas Implementation would suffer from almost 20% drop in recallf frequencies are not used. As expected, frequency features on their own are not beneficial for the identification of

ethodological mentions. The detailed results are given in Table 15.The use of frequency features for the Task category model has no significant effect on F-measure (a slight negative

ffect on precision (3% drop), but improves recall). Using only FT (the frequency of the phrase annotated as Task inhe training corpus) results in extremely low recall, with poor precision, which is a likely consequence of a limitedask-variability of the corpus. It is also interesting that – by excluding FT from the full model – there is a small dropn precision and 5% drop in recall. This pattern is more extreme in other categories (bigger drops in recall).

The use of frequency features for the Method category is beneficial for recall (but not for precision). Removing onlyhe frequency for this category (FM) increased the precision by 2% and decreased the recall by 7%, indicating that FM
as the greatest impact from all the features in this group.
The frequency features have significant effect on recall for Resources (a 10% drop when omitted). Overall, addingrequency improves the F-measure by 6%. It is also evident that FR has a greater impact on the results than FT, FM


and FI. Furthermore, the impact on Implementation is even more significant: when frequency features are removedfrom the full model, there is a 24% drop in recall and 3% drop in precision. It is, however, interesting that frequencyfeatures perform reasonably well on their own (only 4% drop in F-measure).

In general, the use of frequency features improves recall while decreasing precision. Given that the test corpus is froma relatively narrow domain, it makes sense that some tasks, methods, resources and implementations are mentionedfrequently across the domain. The decay in precision can be explained by the cases in which a methodological phraseis mentioned but not annotated as such because of the context (e.g. when authors point out why they use the methodthat they choose and not a particular other method which they mention). In the Implementation category, frequencyfeatures seem to have the most influence on the performance, indicating probably a small set of software tools thathave been proven to perform well for the tasks in the domain of ATR.

6.1.3. Syntactic featuresSyntactic features, in general, improved the results, but only modestly. The detailed results are presented in Table 16.

Overall, the increase in F-measure attributed to syntactic features is between 0% (Resource/Feature) and 6% (Task).While the syntactic features have a positive effect on precision and recall for identification of Task and Method mentions,this effect is much smaller for the Resource/Feature category (no impact) and for the Implementation category, indicatingthat syntactic relations may be already redundant if other types of features have been included.

6.1.4. Citation and the semantic category of the verbIf removed from the full model, these features have no significant effect on the results (see Table 17), although they

still bring an increase in the overall F-measure. Again, this means that there is either redundancy in the informationsupplied to the model or that these features are not relevant.

The citation type has a limited positive effect on both precision and recall across all categories, whereas the semanticcategory of the verb improves performance for the Task and Method categories, mainly as these categories usuallycontain specific verbal phrases.

6.2. Error analysis

We performed an analysis of a random sample with 200 sentences containing false positive and 200 sentencescontaining false negative cases. Three major error categories have been identified.

6.2.1. Incorrect tagging of sub-phrases with or around methodological segmentsWe consider a segment as correctly recognised and classified only if all the phrases that belong to it were tagged

with the appropriate category. We estimate that around 20% of errors come from segments where not all sub-phraseshave been correctly classified. By carefully analysing these segments, we found that the majority of the misclassifiedphrases are prepositional chunks associated to the verbal phrase (which was correctly classified), as in some cases the

Table 16The impact of the syntactic features (DEP, the relations in which the phrase is the dependant; GOV, the relations in which the phrase is the governor).

Task Method


All features 0.6959 0.4325 0.5335 0.7046 0.4251 0.5303DEP + GOV only 0.2642 0.1735 0.2094 0.3173 0.2028 0.2475All, but (DEP + GOV) 0.5854 0.3958 0.4723 0.6400 0.4071 0.4977



All features 0.6761 0.5539 0.6089 0.8566 0.6774 0.7565DEP + GOV only 0.4542 0.3082 0.3672 0.5786 0.3265 0.4174All, but (DEP + GOV) 0.6719 0.5519 0.6060 0.8420 0.6571 0.7381


Table 17The impact of citation and the semantic category of the verb (CIT, type of citation; ACT, category of the verb).

Task Method


All features 0.6959 0.4325 0.5335 0.7046 0.4251 0.5303CIT + ACT only 0.5103 0.2287 0.3159 0.3228 0.2457 0.2790All, but CIT 0.6979 0.4234 0.5271 0.7006 0.4229 0.5274All, but ACT 0.6897 0.4266 0.5271 0.6993 0.4231 0.5272All, but (CIT + ACT) 0.6934 0.4116 0.5166 0.7030 0.4227 0.5279



All features 0.6761 0.5539 0.6089 0.8566 0.6774 0.7565CIT + ACT only 0.4745 0.3753 0.4191 0.5800 0.3204 0.4127All, but CIT 0.6764 0.5463 0.6044 0.8516 0.6877 0.7610All, but ACT 0.6746 0.5463 0.6037 0.8652 0.6950 0.7708A

etshc

6

wktCsa

6

ppaehf

TA

P

itov

ll, but (CIT + ACT) 0.6709 0.5406 0.5987 0.8530 0.6896 0.7627

ntire prepositional phrase would be tagged as methodological, while in some it would not, depending on how detailedhe information that is expressed was. For example, in sentence fragment “identifying terms of various lengths”, theegment “identifying terms” can be annotated as Task, but so can the whole fragment (see Table 18). Future workere would need to take into account possible “normalisation” of methodological segments e.g. linking to an agreedontrolled vocabulary or ontology.

.2.2. Confusions based on narrow contextsThere are cases where two segments with the same content have different categories, depending on the context in

hich they have been mentioned (e.g. “part-of-speech tagging” can be either task or a method). In order to make thisind of distinction, the annotator (human or automated) would need to have an understanding of the context in whichhe segment has been mentioned (e.g. a section or even a whole paper). As this wider context is not captured by ourRF models, a number of false positives and negatives arise from this. For example, in the sentence given in Table 19,

egment “machine translation” is tagged as Method even though in this context it is a Task. We estimated that 20% ofll errors fell into this category.

.2.3. Frequent phrasesWe have identified a number of cases where phrases frequent for a particular category may mislead the prediction

rocess when such phrases are used as part of other segments. For example, in the sentence given in Table 20,hrase “vacuous rule application” has been tagged as Resource/Feature, given that word “rule” has a high frequencys a resource (e.g. “decision rule”, “derivation rule”, “morphological rule”), which resulted in a false positive. We
stimated that around 20% of all errors fall into this category for Task, Method and Resource. The frequency featuresave a greater impact in particular on the Implementation category – we estimated that 60% of all errors in this categoryall into this error category.
able 18n example of incorrect tagging of surrounding phrases. Phrases: “of” and “various lenghts” were misclassified as the Task category.

hrase Syntactic relation Category Predicted category

dentifying dobj Task Taskerms dobj Task Taskf prep O Taskarious lengths pobj O Task


Table 19An example of a misinterpreted context. Segment “machine translation” is tagged as Method even though in this context it is a Task.

Phrase Category Predicted category

in O Othis paper O O, O Owe O Ofocus O Oon O Oour O Oexperiment O Oto O Oextract O Ochinese multiword expressions Resource/Feature Ofrom O Ocorpus resources Resource/Feature Oas O Opart O Oof O Oa larger research effort O Oto O Oimprove O Oa O Omachine Task Methodtranslation Task Method( O Omt O O) O Osystem O O. O O

Table 20An example of a frequent phase that resulted in misclassification. Phrase “vacuous rule application” has been tagged as Resource, as word “rule”has a high frequency as a resource in the corpus.

Phrase Category Predicted category

If O Owe O Otake O Oan un-derived monomorphemic native wordform Resource/Feature Resource/Featurethis O Ocan O Obe O Oseen O Oconceptually O Oas O Opassing O Othrough O Olevels 1–3 O O, O Owith O Ovacuous rule application Method Resource/Feature. O O


Table 21The most frequent tasks, methods, resources and implementations in the area of ATR.

Task Overall frequency Document frequency Method Overall frequency Document frequency

term recognition 216 61 part-of-speech tagging 33 22classification 163 46 morphological analysis 24 16pattern matching 72 35 syntactic parsing 16 12similarity calculation 69 20 genetic algorithm 10 9frequency analysis 34 24 stemming 9 8clustering 32 13 corpus analysis 7 7dictionary construction 23 16 lexical lookup 7 7rule learning 23 13 statistical method 5 5disambiguation 20 10 suffix checking 5 5ontology construction 11 12 manual annotation 3 2

Resource/Feature Overall frequency Document frequency Implementation Overall frequency Document frequency

corpus frequency 459 76 Perl 6 3contextual information 347 61 Access 3 3syntactic patterns 215 58 Lucene 3 3linguistic rule 166 41 Conexor parser 3 2dictionary 78 28 Chasen system 2 2termhood 68 30 Stanford parser 2 2ontology 61 22 Atract 2 1lexicon 55 20 Fastr 2 1UMLS 30 17 Sylex 2 1similarity measure 27 13 Xtract 2 1

6

ecmpblto

rcwtdta

fvfvnn

.3. Performance at different levels of annotations

Tables 10 and 11 indicate similar performances measured at the segment and the phrase levels, despite the differ-nces in determining the TPs, FPs and FNs. Since a true positive at the segment level requires the correct prediction ofategories for all phrases in that segment (see Table 7), it should be considered more difficult to achieve good perfor-ance than at the phrase level. Consequently, as expected, there are fewer TPs at the segment level. Furthermore, all

hrase level FPs and FNs at the phrase levels are counted as only one FP or FN at the segment level when these phraseselong to the same methodological segment (see Tables 8 and 9). This results in fewer FPs and FNs at the segmentevel in comparison to the phrase level. Therefore, the higher numbers in true positives at the phrase level compared tohe segment level are “compensated” by the higher quantities of false positives and negatives, which results in similarverall performance (see Tables 10 and 11).

The increase in performance at the document level for the Task and Resource/Features can be explained by someepetition of associated mentions and wide terminological variability in methodological segments belonging to theseategories. If we consider the way that false positives and negatives are calculated at the segment level (see Tables 7–9),e can see that terminological variability plays a significant role. Even the slightest difference in segments between

he gold standard and the CRFs models results in annotation errors (FP and FN). The normalisation of terms at theocument level eliminates this variability, and thus decreases the number of such errors. For the evaluation purposeshis has been done manually, but an obvious task for future work is to explore automated normalisation and groupingpproaches.

While authors perform mostly similar tasks and use similar resources – methods for performing these tasks varyrom one approach to the other. However, the document-level results for the Method category indicate that there is lessariability and repetitions in mentions than for the Task and Resource/Feature categories. This is probably due to theact that Methods are usually described in dedicated sections and not often repeated elsewhere. As for terminologicalariability, we have compared the average number of variations for normalised mentions for Methods to Tasks: for each
ormalised Task (Method) we counted the number of variations in a given paper, summed them up and divided by theumber of normalised Tasks (Methods) in the paper. We than averaged those values across all of the papers. There are


2.3 mentions for a normalised Task and 1.5 mentions per a Method per paper, which indicates that the Task categoryhas greater terminological variability in comparison to Methods.

Implementation mentions come from a restricted set of values (tool names, programming languages, developmentenvironments) and thus obviously have less terminological variability, explaining limited increase in performance forthis category at the document level.

6.4. Towards exploration of the ATR methodological space

We further explored the results in a wider context by applying the proposed method to the entire ATR corpus thatwe have collected (see Section 3). Table 21 provides the most frequent mentions for the four classes in the Solutionsentences that describe manuscripts’ contributions. As expected, the most frequent tasks in this domain are all focusedon recognition and profiling of terms. It is interesting that the top most frequent methods used in the area of ATRare: part-of-speech tagging, morphological analysis, and syntactic parsing, but also genetic algorithm and stemming,which give some flavour of methodologies used in this area. The top extracted implementations reveal storage andindexing software (e.g. Access, Lucene), NLP tools (Stanford parser, Chasen system) and names of the systems thatauthor’s present in their papers (Atract, Fastr). As one of the most commonly used programming languages for NLP,Perl is found to be the most frequent implementation-related term appearing in this corpus. Obviously, a larger-scaledocument processing and further and wider statistical analyses (e.g. including time of publication) are needed to givea better understanding of the development of the field. We also plan to investigate the distribution of methodologicalinformation in other parts of the manuscript (e.g. Result, Own, Background, Basis, Contrast).

7. Conclusions

In this paper we have proposed a system for the extraction of mentions of methodological information from scientificpublications in the field of Automatic Term Recognition. To the best of our knowledge, this work is one of the firstattempts to provide systematic methodology mining in the field of NLP and more widely in Computer Science. Thesystem consists of two major steps. The first step is the automatic classification of methodological sentences, based onprevious work of Teufel and Moens (2002) and Teufel (1999); to further filter out methodological sentences inside thiscategory, a second classifier was used to identify sentences that report on the specific contributions of a given paper.

The second layer of the system extracts and classifies methodology mentions (segments) in the sentences obtainedfrom the first layer. The segments are classified in four semantic categories: Task, Method, Resource/Feature andImplementation. Classification of segments was accomplished by four separate CRF models. Given that most of themethodological segments consist of two or more words, the CRF models are based on shallow-parsed phrases ratherthan words.

The system was evaluated on the manually annotated corpus using five-fold cross validation. The corpus wasannotated on three levels: document, sentence and segment. Ten percent of the corpus was also double-annotated at alllevels by an independent annotator to ensure consistency.

Both sentence classifiers provided good performances in terms of precision and recall. At the mention level, F-measures of 75% for the identification of Implementation mentions (with the precision of 86%) and 61% for theidentification of Resources/Features (67% precision) are promising. The results for the Task and Method identificationwere lower (F-measures of 53%) but still with relatively good precision (70%). The recall values for all categorieswere lower (between 43 and 68%), indicating that the current features have not covered all necessary methodologicalattributes. Therefore, improving the mention-level coverage is one of the main topics for the future work. Still, we notethat Kappeler et al. (2008) reported an F-measure of 45% for the identification of experimental method mentions inthe biomedical text (see Section 2.2), despite their task being focused on a “closed” set of methods as specified by anexperimental taxonomy, whereas our approach annotates “open” mentions of ATR methods and is not constrained toa set of techniques. We have further explored clustering methodological mentions at the document level, achieving an
F-measure of 72% (with 81% precision) for Task mentions (an increase of almost 20%), 60% (with 81% precision)for Method mentions, 74% (with 78% precision) for the Resources/Features and 78% (with 80% precision) for theImplementation categories. Further work is needed on automated clustering of methodological mentions at the documentlevel.

nCiik

iiti(

A

I

R

A

AB

“CC“D

E

I

K

K

K

K

L

L

“M

M

MM


We also plan to experiment with rule-based methods to post-filter the errors that we have identified and to incorporateew features to better capture the wider context dependencies. Another area of exploration is the order by which theRFs are applied with the goal to incorporate the previously predicted methodological labels as features (e.g. the

nformation about the neighbouring phrase being predicted as a Task can be useful in determining if the current phrases a Method). In order to further semantically connect the resulting phrases, we will explore the use of backgroundnowledge (e.g. an ontology of tasks and methods).

A successful identification of methodological mentions opens a number of possibilities to explore the developmentn the given area. We therefore plan to conduct further analyses of the methodological mentions in the area of ATR,ncluding mutual information, time series and association links. By integrating these findings, we aim to help seman-ically enrich scientific outputs by providing information about methodology used to solve particular problems. Suchnformation can be further used to identify experts in a given field, explore patterns in its evolution, identify “hot-topics”Eales et al., 2008; Buitelaar and Eigner, 2009) and to learn methodological vocabularies (Afzal et al., 2008).

cknowledgement

The work presented here was partially supported by the Serbian Ministry of Education and Science (projectsII47003, III44006).

eferences

fzal, H., Stevens, R., Nenadic, G., 2008. Towards semantic annotation of bioinformatics services: building a controlled vocabulary. In: Proceedingsof the Third International Symposium on Semantic Mining in Biomedicine, Turku, Finland, pp. 5–12.

rtstein, R., Poesio, M., 2008. Inter-coder agreement for computational linguistics. Comput. Linguist. 34, 555–596.uitelaar, P., Eigner, T.,2009. Expertise mining from scientific literature. In: Proceedings of the Fifth international Conference on Knowledge Capture

(K-CAP’09). Redondo Beach, CA, USA, pp. 171–172.Callisto” http://callisto.mitre.org (last visited 12.05.10).ohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (1), 37–46.hung, G.Y., 2009. Sentence retrieval for abstracts of randomized controlled trials. BMC Med. Inform. Decis. Mak. 9, 10.CRF++” http://crfpp.sourceforge.net/ (last visited 12.05.10).eShazo, J.P., LaVallie, D.L., Wolf, M., 2009. Publication trends in the medical informatics literature: 20 years of medical informatics. BMC Med.

Inform. Decis. Mak. 9, 7.ales, J.M., Pinney, J.W., Stevens, R.D., Robertson, D.L., 2008. Methodology capture: discriminating between the best and the rest of community

practice. BMC Bioinformatics 9, 359.to, T., Simbo, M., Yamasaki, T., Matsumoto, Y., 2004. Semi-supervised sentence classification for medline documents. IEIC Technical Report

104:486(AI2004 34-44), pp. 51–56.appeler, T., Clematide, S., Kaljurand, K., Schneider, G., Rinaldi, F., 2008. Towards automatic detection of experimental methods from biomedical

literature. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, pp. 61–68.enji, H., Okazaki, N., Ananiadou, S., Ishizuka, M., 2008. Identifying sections in scientific abstracts using conditional random fields. In: Proceedings

of the Third International Joint Conference on Natural Language Processing (IJCNLP 2008), Hyderabad, India, pp. 381–388.lein, D., Manning, C.D., 2003a. Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information

Processing Systems 15(NIPS 2002), pp. 3–10.lein, D., Manning, C.D., 2003b. Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational

Linguistics, Sapporo, Japan, pp. 423–430.iu, Z., Zhu, C., Zhao, T., 2010. Chinese named entity recognition with a sequence labeling approach: based on characters, or based on words? In:

Proceedings of the Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, vol. 6216, Changsha,China, pp. 634–640.

in, J., Karakos, D., Demner-Fushman, D., Khudanpur, S., 2006. Generative content models for structural analysis of medical abstracts. In:Proceedings of the HLT/NAACL 2006 Workshop on Biomedical Natural Language Processing (BioNLP’06), New York, USA, pp. 65–72.

LT-TTT2” http://www.ltg.ed.ac.uk/software/lt-ttt2/ (last visited 12.05.10).cKnight, L., Srinivasan, P., 2003. Categorization of sentence types in medical abstracts. In: Proceedings of the 2003 Annual Symposium of the

American Medical Informatics Association (AMIA 2003), pp. 440–444.cDonald, R., Pereira, F., 2005. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6 (Suppl. 1),

S6.
EDLINE http://www.nlm.nih.gov/bsd/stats/cit added.html (last visited 12.05.10).izuta, Y., Collier, N., 2004. Zone identification in biology articles as a basis for information extraction. In: Proceedings of the JNLPBA’04:
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Geneva, Switzerland,pp. 29–35.

http://callisto.mitre.org/

http://crfpp.sourceforge.net/

http://www.ltg.ed.ac.uk/software/lt-ttt2/

http://www.nlm.nih.gov/bsd/stats/cit_added.html


Mizuta, Y., Korhonen, A., Mullen, T., Collier, N., 2006. Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inform.75 (6), 468–487.

Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T., 2006. YALE. rapid prototyping for complex data mining tasks. In: Proceedings ofthe 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), Philadelphia, PA, USA, pp. 935–940.

Mullen, T., Mizuta, Y., Collier, N., 2005. A baseline feature set for learning rhetorical zones using full articles in the biomedical domain. SIGKDDExplor. Newslett. 7 (2), 52–58.

Renear, A.H., Palmer, C.L., 2009. Strategic reading, ontologies, and the future of scientific publishing. Science 325 (5942), 8–832.Ruch, P., Boyer, C., Chichester, C., Tbahriti, I., Geissbühler, A., Fabry, P., Gobeill, J., Pillet, V., Rebholz-Schuhmann, D., Lovis, C., Veuthey, A.L.,

2007. Using argumentation to extract key sentences from biomedical abstracts. Int. J. Med. Inform. 76 (2–3), 195–200.Settles, B., 2004. Biomedical named entity recognition using Conditional Random Fields and rich feature sets. In: Proceedings of the International

Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, pp. 104–107.Settles, B., 2005. ABNER. An open source tool for automatically tagging genes. Bioinformatics 21 (14), 3191–3192.Shatkay, H., Wilbur, W., Rzhetsky, A., 2010. Annotation Guidelines, http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/AnnotationGuidelines.pdf

(last visited 12.05.10).Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J., 2008. Multi-dimensional classification of biomedical text: toward automated, practical provision

of high-utility text to diverse users. Bioinformatics 24 (18), 2086–2093.Shimbo, M., Yamasaki, T., Matsumoto, Y., 2003. Using sectioning information for text retrieval: a case study with the medline abstracts. In:

Proceedings of the Second International Workshop on Active Mining (AM’03), Maebashi, Japan, pp. 32–41.Sarafraz, F., Eales, J., Mohammadi, R., Dickerson, D., Robertson, D., Nenadic, G., 2009. Biomedical event detection using rules, conditional random

fields and parse tree distances. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for the Shared Task in Event Extraction,Boulder, USA, pp. 115–118.

Teufel, S., Moens, M., 2002. Summarizing scientific articles – experiments with relevance and rhetorical status. Comput. Linguist. 28 (4), 409–445.Tsai, T.-H., Wu, S.-H., Hsu, W.-L., 2005. Exploitation of linguistic features using a CRF-based biomedical named entity recognizer. In: Proceedings

of the ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Detroit, USA.Teufel, S., 1999. Argumentative Zoning: Information Extraction from Scientific Text, Ph.D. thesis, School of Cognitive Science, University of

Edinburgh, Edinburgh, 1999.Wilbur, W., Rzhetsky, A., Shatkay, H., 2006. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC

Bioinformatics 7, 356.Wilbur, J., Smith, L., Tanabe, L., 2007. BioCreative 2. Gene mention task. In: Proceedings of the Second BioCreative Challenge Evaluation Workshop,

Madrid, Spain, pp. 7–16.Wu, J.-C., Chang, Y.-C., Liou, H.-C., Chang, J.S., 2006. Computational analysis of move structures in academic abstracts. In: Proceedings of the

COLING/ACL on Interactive Presentation Sessions, Sydney, Australia, pp. 41–44.Yamamoto, Y., Takagi, T., 2005. A sentence classification system for multi-document summarization in the biomedical domain. In: Proceedings of

the International Workshop on Biomedical Data Engineering (BMDE2005), NJ, USA, pp. 90–95.Yang, H., Nenadic, G., Keane, J.A., 2008. Identification of transcription factor contexts in literature using machine learning approaches. BMC

Bioinformatics 9 (Suppl. 3), S11.Yang, H., Keane, J., Bergman, C.M., Nenadic, G., 2009. Assigning roles to protein mentions: the case of transcription factors. J. Biomed. Inform.

42 (5), 887–894.Yeh, A., Morgan, A., Colosimo, M., Hirschman, L., 2005. BioCreAtIvE Task 1A: gene mention finding evaluation. BMC Bioinformatics 6 (Suppl.

1), S2.

Aleksandar Kovacevic graduated from the University of Novi Sad, Faculty of Sciences, in 2003. He obtained his MSc (2006) and PhD (2011)in Computer Science, from the University of Novi Sad, Faculty of Technical Sciences where he is a teaching assistant. He has authored papers ininternational and national journals and conferences. His research interests are in information extraction, text and data mining.

Zora Konjovic received her Bachelor degree in Mathematics from the Faculty of Natural Science Novi Sad (in 1973), the Master degree and PhD(both in Robotics, in 1985 and 1992 respectively) from the Faculty of Technical Sciences Novi Sad. She is a full professor at the Faculty of TechnicalSciences, Novi Sad, Serbia since 2003. Prof. Konjovic participated in 30 research projects (as the project leader in 18). She published more than180 scientific and professional papers. Her current research interests include artificial intelligence, web programming, digital libraries and archives,and geo-informatics.

Branko Milosavljevic received his bachelor (1997), Master (1999) and PhD degrees (2003) all in Computer Science from the University of NoviSad, Faculty of Technical Sciences. He is an Associate Professor at the same Faculty since 2004. Dr. Milosavljevic participated in eight researchprojects; in one he was the project leader. He has published more than 70 scientific and professional papers.

Goran Nenadic graduated from the University of Belgrade (MMath) in 1993 and received his Master degree (Computer Science) from the same
University in 1997. He was awarded a PhD in Computer Science in 2003 from the University of Salford. He is a Senior Lecturer in Text Mining atthe University of Manchester’s School of Computer Science, and a principal investigator at the Manchester Interdisciplinary BioCentre. His mainresearch interests are in the area of text mining and natural language processing, in particular in automated terminology management. He has led anumber of projects in those areas and published more than 80 papers.
http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/AnnotationGuidelines.pdf

Mining methodologies from NLP publications: A case study in automatic terminology recognition

Documents

Transcript of Mining methodologies from NLP publications: A case study in automatic terminology recognition