Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be...

12
Creating a focused corpus of factual outcomes from biomedical experiments James Eales, George Demetriou, and Robert Stevens School of Computer Science, University of Manchester, UK {james.eales,george.demetriou,robert.stevens} @manchester.ac.uk Abstract. Keywords: text mining, data mining, knowledge network The results of an experiment are often described in a series of textual state- ments, the most concise of which being the title of the article. Here we imple- mented a novel approach, using standard data mining techniques, to collect a set of concise ‘factual’ statements about a research area. We compare two standard text classification approaches to identify ‘factual’ and ‘non-factual’ sentences in article titles; the first of which uses a statistical language-modelling approach, and the second a more sophisticated semantic and grammatical approach. We find that the simple approach provides more accurately classified titles; achiev- ing 92% overall accuracy compared to 90% for the complex approach. We also implement a strategy to convert the phrasal dependencies in a ‘factual’ title into subject-predicate-object structures (triples). These triples can then be organised according to a schema provided by domain ontologies; which occurs by mapping URIs to entities found in the textual labels. 1 Introduction Publicly available data repositories, such as GenBank, EMBL and Swiss-Prot, are proof of the willingness of the scientific community to adopt common data formats and automated procedures in order to deposit their data in these re- sources. In some cases, researchers deposit their data long before the accom- panying publications are out. In contrast, the development of many specialised knowledge bases still largely relies on results presented in the literature and this is often a costly and laborious exercise given the vast amounts of text in the field. In this paper, we aim to identify a focused set of reliable statements or ‘facts’ on what is reported about a relatively narrow, yet important area of biomedical research, relating to disease influencing the kidney and urinary pathway (KUP). We then use the extracted facts to build a network of knowledge for inclusion in an existing knowledge base, the KUP Knowledge Base (KUPKB) [1].

Transcript of Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be...

Page 1: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

Creating a focused corpus of factual outcomesfrom biomedical experiments

James Eales, George Demetriou, and Robert Stevens

School of Computer Science,University of Manchester, UK

{james.eales,george.demetriou,robert.stevens}

@manchester.ac.uk

Abstract. Keywords: text mining, data mining, knowledge network

The results of an experiment are often described in a series of textual state-ments, the most concise of which being the title of the article. Here we imple-mented a novel approach, using standard data mining techniques, to collect a setof concise ‘factual’ statements about a research area. We compare two standardtext classification approaches to identify ‘factual’ and ‘non-factual’ sentences inarticle titles; the first of which uses a statistical language-modelling approach,and the second a more sophisticated semantic and grammatical approach. Wefind that the simple approach provides more accurately classified titles; achiev-ing 92% overall accuracy compared to 90% for the complex approach. We alsoimplement a strategy to convert the phrasal dependencies in a ‘factual’ title intosubject-predicate-object structures (triples). These triples can then be organisedaccording to a schema provided by domain ontologies; which occurs by mappingURIs to entities found in the textual labels.

1 Introduction

Publicly available data repositories, such as GenBank, EMBL and Swiss-Prot,are proof of the willingness of the scientific community to adopt common dataformats and automated procedures in order to deposit their data in these re-sources. In some cases, researchers deposit their data long before the accom-panying publications are out. In contrast, the development of many specialisedknowledge bases still largely relies on results presented in the literature and thisis often a costly and laborious exercise given the vast amounts of text in thefield.

In this paper, we aim to identify a focused set of reliable statements or ‘facts’on what is reported about a relatively narrow, yet important area of biomedicalresearch, relating to disease influencing the kidney and urinary pathway (KUP).We then use the extracted facts to build a network of knowledge for inclusion inan existing knowledge base, the KUP Knowledge Base (KUPKB) [1].

Page 2: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

2 J. Eales, G. Demetriou, R. Stevens

The KUPKB has been created from existing databases and ontologies foruse in the e-LICO project1, it also has its own ontology (KUPO) for modelingrelationships between data [1]. Currently the KUPKB has been populated withexperimental results manually extracted from the published literature. This pro-cess takes a considerable amount of effort to identify statements in the literatureand form them into a representation compatible with the knowledge base [2–4].We assist the enrichment of KUPKB by providing relational triples for inclusionthat are mined from the literature and are known to describe an experimentalresult. We use standard text classification techniques [5] to label titles as ‘fac-tual’ or ‘non-factual’, the ‘factual’ titles are then processed into triple structuresusing a set of rules applied on the list of phrasal dependencies provided by theStanford parser.

Traditionally most text mining systems use abstracts or full text articles;instead we use article titles. We are analysing titles for several reasons; first, theyare short and to-the-point; second, they are descriptive and tend to use specificrather than general terms. The intention of an article title is to summarise thefindings of a whole study into a single sentence. The title is also the first thing auser sees when searching PubMed and it is therefore important for advertisingthe article to potential readers. If a study identifies a new piece of definableknowledge then the authors will usually want to present this clearly in the title.If a study finds a slightly less than clear result, then the language used to describeit is often softened and we can detect this using text mining methods. On theother hand, our task is quite challenging; titles can use complex language and arealso quite short. Nevertheless, our process for mining the information involvesthe computationally expensive task of dependency parsing [3, 6], therefore it isimportant to limit the amount of text to be analysed.

Previous work in this area has also used dependency parsing [3, 7, 4] and thishas proven useful in the field of pharmacogenomics when looking for relation-ships between predefined sets of entities. Our approach focuses on identifying thefacts presented from arbitrary biological articles, representing them as triples,and then later matching these to entities in existing ontologies. Further workincorporating the use of semantic patterns for identifying entity relationships[8, 9] has proven useful for capturing relationships describing protein structure,metabolic pathways and the function of enzymes. All of which have used a se-mantic framework to make the results of the analysis more widely usable, butalso to make it easier to incorporate newly identified relationships with existingknowledge and then form queries based on the combined relationships; it is thisflexibility that we seek in this work.

A by-product of our work is text categorization: a focused corpus of articlesrelevant to a task can provide time and cost savings to users who may want touse it for futher computational analysis or even human curation.

1 http://www.e-lico.eu/

Page 3: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

Creating a focused corpus of factual outcomes from biomedical experiments 3

1.1 A corpus of factual titles

Our approach is to collect a set of titles, classify them into factual and non-factual groups and then extract sets of triples from the factual titles. We definea factual title here as ‘A direct sentential report about the outcome of abiomedical investigation ’. An example of a factual title is:

‘Bluetongue virus RNA binding protein NS2 is a modulator of viral repli-cation and assembly.’ (PubMed ID: 17241458)

It should be ‘direct’, in that, it is neither hedged nor does it merely implya result but actually states the result in the title. Such a result can be both apositive or negative outcome.

Such statements do not contain all the necessary contextual information tofully comprehend the implications of its finding, but this is not our aim; insteadwe hope to capture what is reported by the authors and then present this toother readers who can investigate further. An example of a non-factual title is:

‘A role for NANOG in G1 to S transition in human embryonic stem cellsthrough direct binding of CDK6 and CDC25A’ (PubMed ID: 19139263)

This title contains many specifics (e.g. the NANOG protein, and the G1 andS cell cycle phases) and does allude to a role of NANOG in cell cycle transition,but the role is not explicitly defined, it merely suggests that a role exists. Suchstatements are important and could be used, but the lack of an explicit role wouldhave to be recorded; this will be future work. Our work has also revealed otherkinds of titles, such as ones that report, ‘hedged’ or possible results; describe toolsor methods; and those that simply say what the article is about. In this workwe concentrate on the direct descriptions of an investigation to create a focusedcorpus of statements on a given topic. This corpus can then be integrated withthe existing KUPKB framework, thus linking textually represented experimentaloutcomes with relevant database entries and ontology concepts.

2 Methods

Additional resources and supplementary data for this paper can be found in amyExperiment pack. http://www.myexperiment.org/packs/193.html

2.1 Software and workflows

All data-mining workflows were designed and run in RapidMiner 5.1.006. Work-flows can be found on myExperiment as well as input data files, including theannotated training data. The Stanford parser version 1.6.5 was used for all de-pendency parsing and the OpenNLP POS-tagger was used to tag all sentences,we also used the OpenNLP english sentence detector to split titles into sentences.

Page 4: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

4 J. Eales, G. Demetriou, R. Stevens

2.2 Title classification

Title sentences in the training data were labelled as belonging to one of twoclasses, these are

1. Factual: titles that provide a directly reported experimental outcome e.g.– ‘AICAR decreases the activity of two distinct amiloride-sensitive Na+-

permeable channels in H441 human lung epithelial cell monolayers.’– ‘Inhibition of p38 MAP kinase pathway induces apoptosis and prevents

Epstein Barr virus reactivation in Raji cells exposed to lytic cycle induc-ing compounds’

2. Non-factual: Those titles that do not fit with the description of factualtitles e.g.– ‘The CENP-S complex is essential for the stable assembly of outer kine-

tochore structure.’– ‘A dual role of Cdk2 in DNA damage response.’– ‘Cytoplasmic signaling in the control of mitochondrial uproar?’– ‘Anni 2.0: a multipurpose text-mining tool for the life sciences.’– ‘Still no flying cars.’

2.3 Data sets

Titles can be composed of several different sentences and these can be a mix offactual and non-factual statements. Therefore all titles were preprocessed to splitthem into sentences, this was done using the standard english sentence detectorfrom OpenNLP2 and a set of simple rules which account for some vagaries foundin biological titles. Our aim with sentence splitting was to provide more clearlydefined instances for training a classification model.

We collected a training set of 1,875 article titles (1,939 sentences), these titleswere restricted to those articles published by a set of 82 biologically-themedjournals that are archived by PubMed Central (see myExperiment for list). Thetraining sentences were manually annotated as ‘factual’ or ‘non-factual’.

We have also collected a set of 86,217 titles (91,626 sentences) that are relatedto kidney research and therefore the KUP domain. These we identified using aPubMed search using the keywords ‘kidney’ and ‘renal’ in title and abstracttext. These titles have not been manually annotated but are to be used as arepresentative sample of the KUP literature to illustrate the kind of outputsthat our approach can provide.

2.4 Attributes/Features

We derive two different sets of attributes for each data set. The first are simpletri, bi and unigram word frequency data (the simple attribute set), the secondprovide a more complex impression (the complex attribute set) of a title but also

2 http://incubator.apache.org/opennlp/

Page 5: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

Creating a focused corpus of factual outcomes from biomedical experiments 5

are more computationally expensive to derive. The simple attribute set originallycontained 42,745 attributes but was filtered down to 525 attributes using an SVMweighting method and by selecting the top k attributes that provide the best‘good’ class classification f-measure score in a 10-fold stratified cross-validationusing a naive Bayes classifier (see section 3). The simple attribute data areclassified using naive Bayes because they form part of a statistical language-modelling approach.

The complex attribute set, is derived from a set of string-based, lexical, se-mantic and grammatical profiling methods. We collect binary attributes on thepresence of question marks, hyphens, colons and verbs. We also provide the partof speech (POS) tag for the first, second and third tokens. We use the Whatizit3

web service to count the mentions of disease, protein and chemical entities inthe title and we use our own service for cell type recognition. The Whatizit webservice is a third-party service hosted at the EBI4 that uses a large dictionaryof exact string matches to identify entity mentions in text and return databaseidentifiers for these mentions. Additionally we provide data on the prevalence ofdifferent phrase chunk types in the sentence as well as the proportion of nouns,verbs, adverbs and adjectives. The number of all different types of phrasal de-pendencies (provided by the Stanford parser) as well as the maximum parse treedepth are also used as attributes. Finally we calculate the frequency of a set ofcommon POS-tag tri- and bi-grams in the title. This set of POS-tag n-gramswere chosen for their differential frequencies between factual and non-factualtitles. There are a total of 795 attributes in the complex set. The complex at-tribute data are classified using an SVM trained using the Sequential MinimalOptimisation method (as implemented in the WEKA extension to RapidMiner)because this was found to give the best results on this attribute set.

2.5 Triplification

Our triplification process uses the dependency parse of the sentence, providedby the Stanford parser [6, 10], to identify subjects, objects and predicates by theapplication of heuristic rules. The rules are applied in the following order:

1. Concatenate all compound noun (nn) and adjectival modifier (amod) depen-dencies on shared governor tokens.

2. Identify all nominal subject (nsubj) and nominal subject passive (nsubjpass)dependencies. (the subjects)

3. Identify all direct object (dobj) dependencies (the objects)4. Attempt to join all subject and object dependencies by a common governor

token. The shared governor token becomes the predicate of a new triple,with the dependent token of the subject and object dependency becomingthe subject and object of the new triple respectively.

5. For each new triple look for conjunct (conj) dependencies with a token sharedbetween the triples object and the dependencys dependent token. Create new

3 http://www.ebi.ac.uk/webservices/whatizit4 http://www.ebi.ac.uk

Page 6: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

6 J. Eales, G. Demetriou, R. Stevens

triples with shared subject and predicate tokens, but with the object set asthe dependent token of the conjunct dependency.

6. For each sentence look for abbreviation (abbrev) dependencies and create anew triple with a ‘has label’ predicate.

The following example illustrates how a sentence is transformed into triplesusing the rules applied to the list of phrasal dependencies. ‘Nitric oxide dimin-ishes apoptosis and p53 gene expression after renal ischemia and reperfusioninjury.’ The set of dependencies are listed in Table 1 and the resultant set oftriples are listed in Table 2.

Table 1. Stanford parser typed dependencies

Dep type Governor Dependent

amod oxide Nitricnsubj diminishes oxidedobj diminishes apoptosiscc apoptosis andnn expression p53nn expression geneconj apoptosis expressionprep diminishes afteramod ischemia renalpobj after ischemiacc ischemia andnn injury reperfusionconj ischemia injury

Table 2. Triples produced according to rules from dependencies listed in Table 1

Subject Predicate Object

nitric oxide diminish apoptosisnitric oxide diminish p53 gene expression

3 Results

3.1 N-gram attribute weighting

There are 42,745 different n-grams in the simple attribute set, we systemati-cally tested whether a lower number of attributes would provide better or worseclassification results. We used an SVM weighting method to select the top K

Page 7: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

Creating a focused corpus of factual outcomes from biomedical experiments 7

best attributes for varying numbers of K. We assessed the performance of eachsmaller attribute set using the f-measure for the factual (Positive) class of titles.The results of our search can be seen in Figure 1, a peak in classification per-formance (83.7%) is seen at 525 of the best attributes. These top 525 n-gramswere used in all further title classification processes.

Fig. 1. Classification f-measure for factual class with increasing number of top weightedn-gram attributes.

3.2 Validation of training data

We performed 10-fold stratified cross-validation on the training data using thesimple and complex attribute sets, the results of which are presented in Tables 3and 4. The performance values between simple and complex attribute sets arenot directly comparable, however they do provide a strong indication of theabilities of the two attribute sets to classify factual sentences. When using thesimple attribute set we can see how it performs very well (weighted average F1= 92.41%) in labelling sentences as ‘good’ or ‘bad’. The complex attribute setperforms in a similar manner giving a weighted average F1 of 90.01%.

To assess the quality of the manual annotations on our training data, it wasannotated by both RS and GD independently; their annotations were foundto disagree on 38 (2%) sentences, giving an inter-annotator agreement (usingCohens kappa coefficient) of 0.936.

We also performed a 50:50 split-validation (stratified) on the training datausing both attribute sets (see Tables 5 and 6). This gave similar results to the

Page 8: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

8 J. Eales, G. Demetriou, R. Stevens

cross-validation with a weighted average F1 of 91.12% and 90.25% for the simpleand complex attribute sets respectively.

Table 3. Classifier cross-validation output on good/bad labelled training data withthe simple attribute set

Class Precision Recall F-measure (F1) N

Bad 96.0 94.18 95.08 1530Good 79.68 85.33 82.41 409

Weighted average 92.56 92.32 92.41

Overall classification accuracy 92.32

Table 4. Classifier cross-validation output on good/bad labelled training data withthe complex attribute set

Class Precision Recall F-measure (F1) N

Bad 92.94 94.71 93.82 1530Good 78.68 73.11 75.79 409

Weighted average 89.94 90.15 90.01

Overall classification accuracy 90.15

Table 5. Classifier 50:50 split-validation output on good/bad labelled training datawith the simple attribute set

Class Precision Recall F-measure (F1) N

Bad 94.38 94.38 94.38 765Good 78.92 78.92 78.92 204

Weighted average 91.12 91.12 91.12

Overall classification accuracy 91.12

3.3 Titles to triples

We applied the complex good/bad classifier (built on all the training data) tothe 91,626 KUP sentences. This produced a set of 5,735 (6.3%) sentences that

Page 9: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

Creating a focused corpus of factual outcomes from biomedical experiments 9

Table 6. Classifier 50:50 split-validation output on good/bad labelled training datawith the complex attribute set

Class Precision Recall F-measure (F1) N

Bad 93.51 94.25 93.88 771Good 77.78 75.49 76.62 198

Weighted average 90.2 90.3 90.25

Overall classification accuracy 90.3

were labelled as ‘good’. We manually annotated the first 300 of these sentencesand found that 209 (70%) were ‘true’ good sentences. We then retrieved the listof phrasal dependencies for each of the 5,735 ‘good’ sentences and applied a setof rules to construct subject-predicate-object statements (triples) from each ofthe sentences. These rules are described in the Methods section (see 2.5), butat a simplistic level they concatenate compound nouns and adjectival modifiertokens and link them together by verbs present in both nominal-subject anddirect-object dependencies. The rules generated a set of 7,113 triples, contain-ing 9,080 unique subject and objects and 1,255 unique predicates. In a manualanalysis of a sample of 150 triples extracted from the KUP titles, we found that96 (64%) were correct. It should be emphasised that titles erroneously classifiedas ‘good’ were found to commonly produce incorrect triples, thus compoundingerrors made before triplification. Furthermore the Stanford parser has not beentrained on biomedical text, this can lead to parser errors and therefore depen-dency and triplification errors. The raw output from the rules can be found onmyExperiment.

3.4 Network construction

The triples generated from the ‘good’ KUP titles were formed into a graphby connecting triples with shared subject and object labels. These labels werestemmed and joined based on string matching. The largest connected componentof this graph (see myExperiment for Cytoscape graph) contains 2,676 nodes andhas 2,765 edges of 603 distinct types (as defined by the predicate label). Thecentral region of this graph (see figure 2) contains a number of highly connectedentities, the most highly connected being ‘rats’. Other highly connected entitiesare ‘kidney’, ‘renal function’, ‘angiotensin II’ and ‘renal injury’.

4 Conclusions

The literature is a vital source of domain knowledge and this knowledge needsto be exposed in integrated, computationally accessible forms. As well as inte-grating the literature with databases and ontologies we need to integrate factualstatements between articles and across the literature in general.

Page 10: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

10 J. Eales, G. Demetriou, R. Stevens

enhance

vasodilatation

cause

regression

uptake

cause

cause

induction

inhibit

rate

increase

inhibitinhibit

inhibit

reduce

actionsassociation

preventinduceprevent

severity

reduce

develop

reduce

has_labeldecrease

regulate

induce

regulate

decrease

enhance

reducepromoteregulateprotect

reduce

decreaseretard

is_a

attenuate

attenuate

is_a

activity

inhibit

numberexpression

augment

expression

increaseinfluence

increase

severity

induce

mediate

modulate

accelerate

induce

regulateinduce stimulate

promote

preventexacerbate

attenuate

decrease

effect

increase

potentiate

regulate

attenuate

increase

potentiate

inhibit

postoperative

attenuate

aggravate

inhibitis_a

prevent

attenuate

improve

attenuateprevent

is_apromote

bluntbiomarkers

rats

rats

actionhave

stability

increaseupregulateexpression

induce

is_a

transplantation convertinositol

development

is_a

decrease

upregulate

protect

attenuate

attenuate

up-regulates

convert

do

reduce

is_a

affect

strain

up-regulates

decrease

uptake

inhibit

influence

receive

utilization

promote

enhance

accelerate

load

nephrocalcinaffection

choose

pharmacokinetic

have

levelhypertensiondistribution

have

activity

proteinvolume

flow

deposits

adjustment

coagulopathyperfusion

dysfunction

pressuresensitivityperfusion

responses

survival

dialyse

pyelonephritis

mortality

enhance showabundance

decrease

pair

effectshavereduce

collecthas_label

block

ameliorate

induce

regulate

reduce

prevent

utilizationreflect

concentration

cause

progression

prevent

expression

abundanceeffect

activation

expression

reduce

damageattenuate

know

rats

excretion

role

effects

cause

effect

impair

attenuate

develop

activate

protect

protect

ameliorate

is_a

ameliorate

inhibitblockade

develop

antibody

develop

severity

reflect

reduce

aggravate

modulate

influence

suppress

prevent

is_a

attenuate

improve

changesinduce

induce

reduce

ameliorate

onset

attenuate

attenuate

ameliorate

prevent

enhance

decreasederivative

decrease

decreasedecrease

inhibit

glucuronidation

reduce

proliferation

affect

induce

induce

inhibition

injurybind

showinfluence

attenuateinjury

induce

impair

angiotensin

injury

prevent

stress

attenuate

injury

improve

excretion

nephrotoxicity

prevent

progression

prevent

improveblock

phase

worsen

limit

limitdifferences

deletion

modulation

induce

convertis_aturnoverdecrease

inhibit

inhibit

kidney

severityincrease

prevent

induce

subpopulation

activatefiltrationameliorate

treatment

suppress

reactivity

convert

convert

is_a

effect

protection

inhibit

enhance

damage

ameliorate

stress

attenuate

damage

nephrotoxicity

reduce

function

predictangiogenesis

ameliorate

ameliorate injury8-hydroxydeoxyguanosine

decrease

convert

reduce

reduce

deficitnephrotoxicity

shock

levelarf

fibrosis

delivery

nephropathy

guarenlargement

protect

survivalexposure

inducehypercholesterolemia

attenuate

hypertensive

injury

expression

ligate

injury

damage

nephropathy

nephropathy

expression

expression

hypertension

injury

stress

hypertension

alphaenacinjurynephrotoxicity

failure

nephrotoxicity

injury

injurynephrotoxicity

neurotoxicity

injury

damage

levelsinjury

nephrotoxicity

damagerejection

nephropathy

damage

stress

dysfunctionhypertension

rejection

injury

vasoconstriction

fibrosis

progression

nephrotoxicity

nephropathynephropathy

damage

function

injury

protect

hypertension

nephrotoxicity

proliferation

damage

injury

failure

injury

hypertension

injury

hypertensive

hypertension

expression

necrosis

susceptibilitydevelopment

pathogenesis

jeopardize

induceincrease

induce

affectaccelerate

prevent

development

causereversal

cause

attenuate

prevent

development

produce

increaseis_a

effects

pace

expression

has_label

follow

acquireprevent

attenuate

hypertensionattenuate

prevent

protectincrease

improvereduce

inhibition

attenuate

infusion

ameliorate

attenuate reduceimprove

enhance

regulate

cause

attenuate

oxidative

ameliorateinduce

prevent

attenuate

improve

attenuate

patientsreduce

preventcause

phosphorylation

requireinduce

induce

emboli-inducedaction

reduce

increaselesion

inflammasomeexchange

transport

inhibit

cause

adhesion

proliferation

enhanceprevent

regulate

reverse

accompany

preserve

lower

oxalate

expression

form

preserveprevalence

represent

mrus

target

occlude

revascularization

interstitial

nephrographic

ochratoxin

reduce

nephritis

preventinhibitinhibitintegrateexpress

protect

sarcomaprevent

mycobacteria

expressionlesionsablation

effects

protectdecrease

biopsy

percutaneous

changes

membraneprotect

is_a

membrane

abnormalities

prevent

injury

require

synthesis

parameters

induce

consumptioninduce

stenosisinjury

reduce

pathologytolerance

breakultrasonographyindicator

lossactivity

effect

is_a

enhance

increase

growth

ameliorate

utilize

condition

bluntmediate

function

calcify

revascularization

autotransplantation

anomaly

ly-6

activation

transplantation

cool

leiomyoma

tubular

tumoratpase

protect

protect

protect

protect

expression

vasculopathy

destruction

ameliorateexhibit

response

inhibitinhibit

reduce

inhibit

receptor

inhibit

calcium oxalate crystal adherence

ANONYMOUS-2059

ANONYMOUS-678

myoglobin

ANONYMOUS-866

extracellular signal-regulated kinase activation

isolated perfused porcine slaughterhouse kidneys

renal nitric oxide production

adrenomedullinANONYMOUS-1353

glomerulosclerosis

ANONYMOUS-2463

igf

insulin-like growth factor

inflammationsalt appetite

losartanANONYMOUS-979

ANONYMOUS-1697

angiotensin-ii receptor antagonist

ANONYMOUS-1833

ANONYMOUS-2084

aortic tissues

experimental mesangioproliferative glomerulonephritis

estradiol-induced tumorigenesis

nk cells

renal epithelial cell phenotype

c3a receptor deficiency

dithiane prevents

angiotensini i

tin chloride pretreatment

activated macrophages

renal ischemia reperfusion injury

ANONYMOUS-1632

anf

ANONYMOUS-935

renal vascular

alphav beta6 integrin

high salt intake

glomerular damage

lung injury

renal artery stenting minimizes

sildenafil

angiotensin ii type receptor deficiency

alpha-melanocyte-stimulating hormone

poly-adp-ribose polymerase inhibitor

lidocaine-containing euro-collins solution

renalfibrosis

diferuloylmethane

calcium oxalate deposition

hypertension following

kidney injury

ANONYMOUS-576

gentamicin-induced nephrotoxicity

ANONYMOUS-1660

peroxynitrite formation

kidney proximal tubule cells

ANONYMOUS-517

i ia na\/p

mesenchymal stem cells

enzyme activity

p-glycoprotein multispecific organic anion transporter

1,4,5-trisphosphate receptor

ANONYMOUS-1178

bone morphogenetic protein-7

hepatic igf-i mrna

mitochondrial nitric oxide synthase

curcumin

biosynthesis expression icisplatin

t-cell replication

ANONYMOUS-1568

ANONYMOUS-284

kidney transplants

recovery

ANONYMOUS-346

patients

ANONYMOUS-1162

ANONYMOUS-1081

ANONYMOUS-204

ANONYMOUS-523

ANONYMOUS-446

donation

ANONYMOUS-574

renal lactate uptake

duct cells

hepatocytegrowthfactor

ANONYMOUS-1857

ANONYMOUS-60

afferent renal sympathetic nerve activity

ANONYMOUS-1056

tubular epithelial

ANONYMOUS-168

aquaporin water channels

ANONYMOUS-2220

hgf

nuclear factor-kappab

ANONYMOUS-2596

ANONYMOUS-817

we

kidney disease

ANONYMOUS-347

endothelin-1

ANONYMOUS-2294

ANONYMOUS-2461

nitric-oxide-deficient spontaneously hypertensive rats

interstitial fibrosis

renal water-reabsorptive pathways

rosiglitazone

peroxisome proliferator receptor-gamma agonist

endogenous tissue kallikrein

renal cysts

type phospholipase a2

endothelin-1 transgenic mice

ANONYMOUS-1864

ANONYMOUS-331

ANONYMOUS-404

tezosentan

ANONYMOUS-815

pj34

suramin

sex chromosomes

n-type calcium channel blocker cilnidipine

ANONYMOUS-188

pravastatin

renal brush border membranes

ANONYMOUS-1776

ANONYMOUS-2690

what

urine inhibitory factor

renalinjury

indomethacinfurosemide-induced natriuresis

ANONYMOUS-1475

diuresis

osteogenic protein-1

ANONYMOUS-2504

plasma renin activity

renal functional adaptation

ANONYMOUS-2006

cardiotonic steroids

apoptosis

nitric oxide synthase

human kidneys

ANONYMOUS-1518

ANONYMOUS-1989

urinary nitrite excretion

offspring disease progression

escherichia coli-induced pyelonephritis

interferon-alpha

nf-kappab inhibition ameliorates

ANONYMOUS-1378

ANONYMOUS-2526

ANONYMOUS-379

ANONYMOUS-617

ANONYMOUS-1071

ANONYMOUS-1601

diagnostic procedures

interstitial hyaluronan dissipation

ANONYMOUS-147

endogenous il-13

ANONYMOUS-989

ANONYMOUS-869

vectorial water transport

ANONYMOUS-2587

emerging role

ang

neonatal rat heart

ANONYMOUS-173

humoral responses

probenecid

injury

plasma volume

ANONYMOUS-1245

urinary peristaltic machinery

ANONYMOUS-970

ANONYMOUS-511

diuretic renography

angiotensinANONYMOUS-2065

postganglionic sympathetic neurons

ANONYMOUS-847

ANONYMOUS-580

ANONYMOUS-1395

ANONYMOUS-906

ANONYMOUS-878

ANONYMOUS-1128

ginkgo biloba extract

ANONYMOUS-1572

ANONYMOUS-2077

heart rate

ramipril

trimetazidine

ANONYMOUS-2069

ANONYMOUS-745

ANONYMOUS-2451

ANONYMOUS-1186

ANONYMOUS-463

ANONYMOUS-2060

ANONYMOUS-2425

ANONYMOUS-2428

ANONYMOUS-565

dietary glutathione

gum halts

ANONYMOUS-1247

ANONYMOUS-822

ANONYMOUS-1787

recombinant vascular endothelial growth factor 121

podocyte stress

receptor antagonists

ANONYMOUS-292

ratshypertension

atrap deficiency

intracellular na +

macromolecules

ANONYMOUS-1044

ANONYMOUS-1348

ANONYMOUS-1915

pro-oxidant generation

ANONYMOUS-1574

ANONYMOUS-946

infiltration mellitus

ANONYMOUS-2166

ANONYMOUS-601

chronic graft dysfunction

ANONYMOUS-424

ANONYMOUS-997

map

experimental renal warm ischemia

3-hydroxy-3-methylglutaryl coenzyme

enzyme

ANONYMOUS-418

ANONYMOUS-415

ANONYMOUS-1035

arterial blood pressure

renal proximal tubular cells

gap junction protein connexin43

intrarenal adenosine

ANONYMOUS-2207

ANONYMOUS-535

ANONYMOUS-2135

ANONYMOUS-786

naloxone

ANONYMOUS-512

stroke dysfunction

ANONYMOUS-2199

enalapril

increased kinin survival

ANONYMOUS-1069

ANONYMOUS-1317

rat ischemic acute renal failure

high linoleic acid diets

hyperbaric oxygen therapy

glycine

ANONYMOUS-1113

ANONYMOUS-1299

ANONYMOUS-320

ANONYMOUS-2183

ANONYMOUS-1119

hyperhomocysteinemia

ANONYMOUS-961

ANONYMOUS-881

ANONYMOUS-1046

eta receptor blockade

ticlopidine

ANONYMOUS-1556

oral erdosteine administration

bilateral native nephrectomy

spirulina

ANONYMOUS-1360

ANONYMOUS-2200

ANONYMOUS-2516

ANONYMOUS-1183

ANONYMOUS-11

ANONYMOUS-1412

ANONYMOUS-819

ANONYMOUS-1522

ANONYMOUS-2172

ANONYMOUS-2034

ANONYMOUS-2245

renal changes

endothelin type

angiotensin ii infused

preglomerular vascular changes

ANONYMOUS-2155

lacidipine

ANONYMOUS-248

cholesterol

peroxisome proliferator-activated receptor-gamma agonists

ANONYMOUS-1284

ANONYMOUS-91

ANONYMOUS-500

p38alpha mapk

severe functionally significant muscle wasting

ANONYMOUS-2120

dialysis modality

ANONYMOUS-51

ANONYMOUS-459

ANONYMOUS-1608

ANONYMOUS-119

ANONYMOUS-2412

ANONYMOUS-1229

cardiac myocyte o2 consumption

ANONYMOUS-1385

crystals

ANONYMOUS-2375

long-term

ANONYMOUS-208

streptozocine diabetes

cardiac function

ANONYMOUS-1839

aging renal

ANONYMOUS-131

ANONYMOUS-2641

ANONYMOUS-1654

ANONYMOUS-2662

ANONYMOUS-1817

surgery

ANONYMOUS-1126

ANONYMOUS-495

phyllodiscus semoni

contrast media tissue concentration

ANONYMOUS-126

chronicdose a

culture

ANONYMOUS-2576

homocysteine aminothiol concentrations

intra-aortic balloon pumping

successful kidney transplantation

ANONYMOUS-810

prophylactic nesiritide

ANONYMOUS-2610

platelet

azelnidipine

nonencapsulated drug excretionlong term

calcium-sensing receptor

aspirin

tertatolol

delayed kidney function risk

ANONYMOUS-125

plasma

18-year-old due analgesic

ANONYMOUS-2598

renal hemodynamic disorder

ANONYMOUS-1430

ANONYMOUS-2182

ischemia stress

slit diaphragm proteins neph1 glomerular podocyte effacement

enfuvirtide

crystal-cell interaction

ANONYMOUS-1913

ANONYMOUS-384

ANONYMOUS-1903

ANONYMOUS-2680

5-aminoisoquinolinone

ANONYMOUS-1563

ANONYMOUS-2379

ANONYMOUS-1558

mature dendritic cells

ANONYMOUS-1370

ANONYMOUS-1012

anti-monocyte protein-1 gene therapy

ANONYMOUS-547

ANONYMOUS-66

bioflavonoid

endogenous

ANONYMOUS-1867

ANONYMOUS-2634

normal mice

ANONYMOUS-860

connective tissue growth factor expression

quercetin

charybdotoxin-sensitive activated k + channel

multiple autoantibodies

renal vasoconstriction

ANONYMOUS-1202

ANONYMOUS-428

aneurysm

kidneyANONYMOUS-1823

ANONYMOUS-1364

ANONYMOUS-1588

thromboxane receptor

dysfunction

rho-kinase inhibition

renal chronic transplant dysfunction

ANONYMOUS-1751

ANONYMOUS-1573

ANONYMOUS-687

kidney stones

freeze-fractured rat kidney cortex

ANONYMOUS-1902

direct renoprotective effects

renal artery

spironolactone

ANONYMOUS-924

ANONYMOUS-1862

angiotensin-stimulated tubular sodiumatrialnatriuretic

peptidewater reabsorption

contrast enhanced sonography

ANONYMOUS-1320

renal glutamine extraction

interstitial fibroblasts

lipids

Fig. 2. Central portion of the largest connected component of the triple graph. Pred-icates labels are shown in red, subject and object labels are represent as blue nodes.Graph visualised in Cytoscape using organic layout

In this paper, we have presented an approach for digging into factual knowl-edge presented in scientific papers and for making this knowledge availablethrough a network of facts. We embarked on an exercise of ‘factomics’, wherewe retrieve and encode fact-like statements from scientific papers that are richin context necessary to fully interpret such facts [11]. Text mining technologiesoffer a tempting means to expose this literature-based knowledge, yet can sufferfrom the need to create corpora of focused collections of statements of interest.We have presented a method for creating a focused corpus of one kind of state-ment and turning this into triples to be used in a domain knowledge base. Ourapproach does not attempt any kind of full extraction of the scientific knowledgenecessary for the interpretation of the facts. Instead, we have taken the approachthat we are exposing the ‘headlines’ of what has been said, and provided linksback to the original paper for when a scientist finds a ‘fact of interest’.

Our main goal was to create a corpus of fact-orientated statements about aparticular domain. We did this by training a classifier to recognise titles that pro-vide factual statements about the outcome of an investigation. We then turned

Page 11: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

Creating a focused corpus of factual outcomes from biomedical experiments 11

these into co-ordinated sets of triples using a dependency parser, in which weexpand key verb relationships into triples that link domain entities.

The results of the cross-validation work, provides some useful insights intothe two different approaches to represent sentences to a classifier. The complexapproach, although slow to process, returns very similar results on the classi-fication of ‘factual’ sentences to the simple attribute set. It should however benoted that the simple attributes can be derived in approximately 10% the timerequired for the complex attributes. Additionally a classifier model trained onsimple attributes can label new sentences at nearly 100 times the rate of a modeltrained on the complex attributes. These results demonstrate that a simple sta-tistical language-modelling approach can be superior, when dealing with largevolumes of data, when the number of classes is low and the number of trainingexamples for each class is not a limiting factor. When training data is scarcethen a more complex semantic and grammatical sentence profiling method islikely to provide better results albeit with greater computational overheads.

There is a notable disparity between the proportion of manually verified‘good’ titles in the KUP titles and in the training data. This is likely to beexplained by differences in collection scope for the two title sets. The KUPtitles were collected without temporal contraints whereas the training data istaken from PubMed Central which was launched in February 2000. Also the setof biologically-themed journals are likely to commonly publish articles with astrong results orientated theme.

The results of this work are satisfactory so far but clearly the complex ap-proach to classification could, for example, be improved with more informationon the recognised entities. Additionally the classification step is vital for re-moving ‘non-factual’ titles because the triplification output from these titles isfrequently erroneous. Give that our aim is to create a focused corpus of factsabout a domain our methods have deliberately tended toward greater precisionrather than recall to avoid spurious or unreliable statements. It seems that im-provements in the title classification process should pay the greatest dividends,by providing a tighter and more focused set of genuinely factual titles to thetriplification process.

One difference with past work on knowledge extraction in this field is thatwe exploit only a small part of the text, the article title. Although our trainingdata have been collected from a specific domain, our computational approach isgeneric and our algorithms, feature sets and classification types of titles can beused in other areas without major modifications.

To interlink our triples with the KUPKB, we intend to map our textualsubject, predicate and object values to URIs derived from several sources. Usingour existing named entity recognition results from Whatizit (see Section 2) wewill replace matching subject and object values with URIs for Uniprot (in thecase of proteins), ChEBI (for chemical entities) and UMLS terms (for diseases).We will also use the NCBO BioPortal Annotator to replace further subject andobject labels with URIs from the Mouse anatomy ontology. We will also replaceany matching predicate values found in the Molecular Interactions ontology. All

Page 12: Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be composed of several di erent sentences and these can be a mix of factual and non-factual

12 J. Eales, G. Demetriou, R. Stevens

of these mapped ontologies are currently part of the KUPKB thus easing theintegration of literature-derived knowledge into the knowledge base.

References

1. Jupp, S., Klein, J., Schanstra, J., Stevens, R.: Developing a Kidney and UrinaryPathway Knowledge Base. Proceedings of Bio-ontologies SIG, ISMB 2010

2. Croset, S., Grabmueller, C., Li, C., Kavaliauskas, S. and Rebholz-Schumann, D.: TheCALBC RDF Triple Store: Retrieval over Large Literature Content. Proceedings ofSWAT4LS 2010

3. Coulet, A., Shah, N.H., Garten, Y., Musen, M.A., Altman, R. B.: Using text to buildsemantic networks for pharmacogenomics. J. Biomed. Informatics 43(6): 1009–10192010

4. Coulet, A., Garten, Y., Dumontier, M., Altman, R., Musen, M., Shah, N.: Inte-gration and publication of heterogeneous text-mined relationships on the SemanticWeb. J. Biomed. Semantics (2) Suppl 2, S10, 2011

5. Sebastiani, F.: Machine learning in automated text categorization. ACM ComputingSurveys 34(1): 1-47, 2002

6. Klein, D., Manning C.: Accurate Unlexicalized Parsing. Proceedings of the 41stMeeting of the Association for Computational Linguistics, Sapporo, Japan, 423–430 2003

7. Coulet, A., Shah, N.H., Hunter,L ., Baral, C., Altman, R.B.: Extraction ofGenotype-Phenotype-Drug Relationships from Text: From Entity Recognition toBioinformatics Application. Proceedings of Pacific Symposium on Biocomputing2010

8. Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Wilett, P.: Protein structures andinformation extraction from biological texts: the PASTA system. Bioinformatics19(1): 135–143 2003

9. Humphreys, K., Demetriou, G., Gaizauskas, R.: Automatically Augmenting Termi-nological Lexicons from Untagged Text. Proceedings of the Workshop on NaturalLanguage Processing for Biology, held at the Pacific Symposium on Biocomputing(PSB2000), Hawaii, USA, 505–516 2000

10. de Marneffe, M.C., MacCartney, B., Manning, C.D.: Generating Typed Depen-dency Parses from Phrase Structure Parses. Proceedings of LREC 2006

11. Mons, B., Velterop, J.: Nano-publication in the e-science era. Proceedings ofSWASD 2009