Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be...
Transcript of Creating a focused corpus of factual outcomes from ...stevensr/papers/mind2011.pdf · Titles can be...
Creating a focused corpus of factual outcomesfrom biomedical experiments
James Eales, George Demetriou, and Robert Stevens
School of Computer Science,University of Manchester, UK
{james.eales,george.demetriou,robert.stevens}
@manchester.ac.uk
Abstract. Keywords: text mining, data mining, knowledge network
The results of an experiment are often described in a series of textual state-ments, the most concise of which being the title of the article. Here we imple-mented a novel approach, using standard data mining techniques, to collect a setof concise ‘factual’ statements about a research area. We compare two standardtext classification approaches to identify ‘factual’ and ‘non-factual’ sentences inarticle titles; the first of which uses a statistical language-modelling approach,and the second a more sophisticated semantic and grammatical approach. Wefind that the simple approach provides more accurately classified titles; achiev-ing 92% overall accuracy compared to 90% for the complex approach. We alsoimplement a strategy to convert the phrasal dependencies in a ‘factual’ title intosubject-predicate-object structures (triples). These triples can then be organisedaccording to a schema provided by domain ontologies; which occurs by mappingURIs to entities found in the textual labels.
1 Introduction
Publicly available data repositories, such as GenBank, EMBL and Swiss-Prot,are proof of the willingness of the scientific community to adopt common dataformats and automated procedures in order to deposit their data in these re-sources. In some cases, researchers deposit their data long before the accom-panying publications are out. In contrast, the development of many specialisedknowledge bases still largely relies on results presented in the literature and thisis often a costly and laborious exercise given the vast amounts of text in thefield.
In this paper, we aim to identify a focused set of reliable statements or ‘facts’on what is reported about a relatively narrow, yet important area of biomedicalresearch, relating to disease influencing the kidney and urinary pathway (KUP).We then use the extracted facts to build a network of knowledge for inclusion inan existing knowledge base, the KUP Knowledge Base (KUPKB) [1].
2 J. Eales, G. Demetriou, R. Stevens
The KUPKB has been created from existing databases and ontologies foruse in the e-LICO project1, it also has its own ontology (KUPO) for modelingrelationships between data [1]. Currently the KUPKB has been populated withexperimental results manually extracted from the published literature. This pro-cess takes a considerable amount of effort to identify statements in the literatureand form them into a representation compatible with the knowledge base [2–4].We assist the enrichment of KUPKB by providing relational triples for inclusionthat are mined from the literature and are known to describe an experimentalresult. We use standard text classification techniques [5] to label titles as ‘fac-tual’ or ‘non-factual’, the ‘factual’ titles are then processed into triple structuresusing a set of rules applied on the list of phrasal dependencies provided by theStanford parser.
Traditionally most text mining systems use abstracts or full text articles;instead we use article titles. We are analysing titles for several reasons; first, theyare short and to-the-point; second, they are descriptive and tend to use specificrather than general terms. The intention of an article title is to summarise thefindings of a whole study into a single sentence. The title is also the first thing auser sees when searching PubMed and it is therefore important for advertisingthe article to potential readers. If a study identifies a new piece of definableknowledge then the authors will usually want to present this clearly in the title.If a study finds a slightly less than clear result, then the language used to describeit is often softened and we can detect this using text mining methods. On theother hand, our task is quite challenging; titles can use complex language and arealso quite short. Nevertheless, our process for mining the information involvesthe computationally expensive task of dependency parsing [3, 6], therefore it isimportant to limit the amount of text to be analysed.
Previous work in this area has also used dependency parsing [3, 7, 4] and thishas proven useful in the field of pharmacogenomics when looking for relation-ships between predefined sets of entities. Our approach focuses on identifying thefacts presented from arbitrary biological articles, representing them as triples,and then later matching these to entities in existing ontologies. Further workincorporating the use of semantic patterns for identifying entity relationships[8, 9] has proven useful for capturing relationships describing protein structure,metabolic pathways and the function of enzymes. All of which have used a se-mantic framework to make the results of the analysis more widely usable, butalso to make it easier to incorporate newly identified relationships with existingknowledge and then form queries based on the combined relationships; it is thisflexibility that we seek in this work.
A by-product of our work is text categorization: a focused corpus of articlesrelevant to a task can provide time and cost savings to users who may want touse it for futher computational analysis or even human curation.
1 http://www.e-lico.eu/
Creating a focused corpus of factual outcomes from biomedical experiments 3
1.1 A corpus of factual titles
Our approach is to collect a set of titles, classify them into factual and non-factual groups and then extract sets of triples from the factual titles. We definea factual title here as ‘A direct sentential report about the outcome of abiomedical investigation ’. An example of a factual title is:
‘Bluetongue virus RNA binding protein NS2 is a modulator of viral repli-cation and assembly.’ (PubMed ID: 17241458)
It should be ‘direct’, in that, it is neither hedged nor does it merely implya result but actually states the result in the title. Such a result can be both apositive or negative outcome.
Such statements do not contain all the necessary contextual information tofully comprehend the implications of its finding, but this is not our aim; insteadwe hope to capture what is reported by the authors and then present this toother readers who can investigate further. An example of a non-factual title is:
‘A role for NANOG in G1 to S transition in human embryonic stem cellsthrough direct binding of CDK6 and CDC25A’ (PubMed ID: 19139263)
This title contains many specifics (e.g. the NANOG protein, and the G1 andS cell cycle phases) and does allude to a role of NANOG in cell cycle transition,but the role is not explicitly defined, it merely suggests that a role exists. Suchstatements are important and could be used, but the lack of an explicit role wouldhave to be recorded; this will be future work. Our work has also revealed otherkinds of titles, such as ones that report, ‘hedged’ or possible results; describe toolsor methods; and those that simply say what the article is about. In this workwe concentrate on the direct descriptions of an investigation to create a focusedcorpus of statements on a given topic. This corpus can then be integrated withthe existing KUPKB framework, thus linking textually represented experimentaloutcomes with relevant database entries and ontology concepts.
2 Methods
Additional resources and supplementary data for this paper can be found in amyExperiment pack. http://www.myexperiment.org/packs/193.html
2.1 Software and workflows
All data-mining workflows were designed and run in RapidMiner 5.1.006. Work-flows can be found on myExperiment as well as input data files, including theannotated training data. The Stanford parser version 1.6.5 was used for all de-pendency parsing and the OpenNLP POS-tagger was used to tag all sentences,we also used the OpenNLP english sentence detector to split titles into sentences.
4 J. Eales, G. Demetriou, R. Stevens
2.2 Title classification
Title sentences in the training data were labelled as belonging to one of twoclasses, these are
1. Factual: titles that provide a directly reported experimental outcome e.g.– ‘AICAR decreases the activity of two distinct amiloride-sensitive Na+-
permeable channels in H441 human lung epithelial cell monolayers.’– ‘Inhibition of p38 MAP kinase pathway induces apoptosis and prevents
Epstein Barr virus reactivation in Raji cells exposed to lytic cycle induc-ing compounds’
2. Non-factual: Those titles that do not fit with the description of factualtitles e.g.– ‘The CENP-S complex is essential for the stable assembly of outer kine-
tochore structure.’– ‘A dual role of Cdk2 in DNA damage response.’– ‘Cytoplasmic signaling in the control of mitochondrial uproar?’– ‘Anni 2.0: a multipurpose text-mining tool for the life sciences.’– ‘Still no flying cars.’
2.3 Data sets
Titles can be composed of several different sentences and these can be a mix offactual and non-factual statements. Therefore all titles were preprocessed to splitthem into sentences, this was done using the standard english sentence detectorfrom OpenNLP2 and a set of simple rules which account for some vagaries foundin biological titles. Our aim with sentence splitting was to provide more clearlydefined instances for training a classification model.
We collected a training set of 1,875 article titles (1,939 sentences), these titleswere restricted to those articles published by a set of 82 biologically-themedjournals that are archived by PubMed Central (see myExperiment for list). Thetraining sentences were manually annotated as ‘factual’ or ‘non-factual’.
We have also collected a set of 86,217 titles (91,626 sentences) that are relatedto kidney research and therefore the KUP domain. These we identified using aPubMed search using the keywords ‘kidney’ and ‘renal’ in title and abstracttext. These titles have not been manually annotated but are to be used as arepresentative sample of the KUP literature to illustrate the kind of outputsthat our approach can provide.
2.4 Attributes/Features
We derive two different sets of attributes for each data set. The first are simpletri, bi and unigram word frequency data (the simple attribute set), the secondprovide a more complex impression (the complex attribute set) of a title but also
2 http://incubator.apache.org/opennlp/
Creating a focused corpus of factual outcomes from biomedical experiments 5
are more computationally expensive to derive. The simple attribute set originallycontained 42,745 attributes but was filtered down to 525 attributes using an SVMweighting method and by selecting the top k attributes that provide the best‘good’ class classification f-measure score in a 10-fold stratified cross-validationusing a naive Bayes classifier (see section 3). The simple attribute data areclassified using naive Bayes because they form part of a statistical language-modelling approach.
The complex attribute set, is derived from a set of string-based, lexical, se-mantic and grammatical profiling methods. We collect binary attributes on thepresence of question marks, hyphens, colons and verbs. We also provide the partof speech (POS) tag for the first, second and third tokens. We use the Whatizit3
web service to count the mentions of disease, protein and chemical entities inthe title and we use our own service for cell type recognition. The Whatizit webservice is a third-party service hosted at the EBI4 that uses a large dictionaryof exact string matches to identify entity mentions in text and return databaseidentifiers for these mentions. Additionally we provide data on the prevalence ofdifferent phrase chunk types in the sentence as well as the proportion of nouns,verbs, adverbs and adjectives. The number of all different types of phrasal de-pendencies (provided by the Stanford parser) as well as the maximum parse treedepth are also used as attributes. Finally we calculate the frequency of a set ofcommon POS-tag tri- and bi-grams in the title. This set of POS-tag n-gramswere chosen for their differential frequencies between factual and non-factualtitles. There are a total of 795 attributes in the complex set. The complex at-tribute data are classified using an SVM trained using the Sequential MinimalOptimisation method (as implemented in the WEKA extension to RapidMiner)because this was found to give the best results on this attribute set.
2.5 Triplification
Our triplification process uses the dependency parse of the sentence, providedby the Stanford parser [6, 10], to identify subjects, objects and predicates by theapplication of heuristic rules. The rules are applied in the following order:
1. Concatenate all compound noun (nn) and adjectival modifier (amod) depen-dencies on shared governor tokens.
2. Identify all nominal subject (nsubj) and nominal subject passive (nsubjpass)dependencies. (the subjects)
3. Identify all direct object (dobj) dependencies (the objects)4. Attempt to join all subject and object dependencies by a common governor
token. The shared governor token becomes the predicate of a new triple,with the dependent token of the subject and object dependency becomingthe subject and object of the new triple respectively.
5. For each new triple look for conjunct (conj) dependencies with a token sharedbetween the triples object and the dependencys dependent token. Create new
3 http://www.ebi.ac.uk/webservices/whatizit4 http://www.ebi.ac.uk
6 J. Eales, G. Demetriou, R. Stevens
triples with shared subject and predicate tokens, but with the object set asthe dependent token of the conjunct dependency.
6. For each sentence look for abbreviation (abbrev) dependencies and create anew triple with a ‘has label’ predicate.
The following example illustrates how a sentence is transformed into triplesusing the rules applied to the list of phrasal dependencies. ‘Nitric oxide dimin-ishes apoptosis and p53 gene expression after renal ischemia and reperfusioninjury.’ The set of dependencies are listed in Table 1 and the resultant set oftriples are listed in Table 2.
Table 1. Stanford parser typed dependencies
Dep type Governor Dependent
amod oxide Nitricnsubj diminishes oxidedobj diminishes apoptosiscc apoptosis andnn expression p53nn expression geneconj apoptosis expressionprep diminishes afteramod ischemia renalpobj after ischemiacc ischemia andnn injury reperfusionconj ischemia injury
Table 2. Triples produced according to rules from dependencies listed in Table 1
Subject Predicate Object
nitric oxide diminish apoptosisnitric oxide diminish p53 gene expression
3 Results
3.1 N-gram attribute weighting
There are 42,745 different n-grams in the simple attribute set, we systemati-cally tested whether a lower number of attributes would provide better or worseclassification results. We used an SVM weighting method to select the top K
Creating a focused corpus of factual outcomes from biomedical experiments 7
best attributes for varying numbers of K. We assessed the performance of eachsmaller attribute set using the f-measure for the factual (Positive) class of titles.The results of our search can be seen in Figure 1, a peak in classification per-formance (83.7%) is seen at 525 of the best attributes. These top 525 n-gramswere used in all further title classification processes.
Fig. 1. Classification f-measure for factual class with increasing number of top weightedn-gram attributes.
3.2 Validation of training data
We performed 10-fold stratified cross-validation on the training data using thesimple and complex attribute sets, the results of which are presented in Tables 3and 4. The performance values between simple and complex attribute sets arenot directly comparable, however they do provide a strong indication of theabilities of the two attribute sets to classify factual sentences. When using thesimple attribute set we can see how it performs very well (weighted average F1= 92.41%) in labelling sentences as ‘good’ or ‘bad’. The complex attribute setperforms in a similar manner giving a weighted average F1 of 90.01%.
To assess the quality of the manual annotations on our training data, it wasannotated by both RS and GD independently; their annotations were foundto disagree on 38 (2%) sentences, giving an inter-annotator agreement (usingCohens kappa coefficient) of 0.936.
We also performed a 50:50 split-validation (stratified) on the training datausing both attribute sets (see Tables 5 and 6). This gave similar results to the
8 J. Eales, G. Demetriou, R. Stevens
cross-validation with a weighted average F1 of 91.12% and 90.25% for the simpleand complex attribute sets respectively.
Table 3. Classifier cross-validation output on good/bad labelled training data withthe simple attribute set
Class Precision Recall F-measure (F1) N
Bad 96.0 94.18 95.08 1530Good 79.68 85.33 82.41 409
Weighted average 92.56 92.32 92.41
Overall classification accuracy 92.32
Table 4. Classifier cross-validation output on good/bad labelled training data withthe complex attribute set
Class Precision Recall F-measure (F1) N
Bad 92.94 94.71 93.82 1530Good 78.68 73.11 75.79 409
Weighted average 89.94 90.15 90.01
Overall classification accuracy 90.15
Table 5. Classifier 50:50 split-validation output on good/bad labelled training datawith the simple attribute set
Class Precision Recall F-measure (F1) N
Bad 94.38 94.38 94.38 765Good 78.92 78.92 78.92 204
Weighted average 91.12 91.12 91.12
Overall classification accuracy 91.12
3.3 Titles to triples
We applied the complex good/bad classifier (built on all the training data) tothe 91,626 KUP sentences. This produced a set of 5,735 (6.3%) sentences that
Creating a focused corpus of factual outcomes from biomedical experiments 9
Table 6. Classifier 50:50 split-validation output on good/bad labelled training datawith the complex attribute set
Class Precision Recall F-measure (F1) N
Bad 93.51 94.25 93.88 771Good 77.78 75.49 76.62 198
Weighted average 90.2 90.3 90.25
Overall classification accuracy 90.3
were labelled as ‘good’. We manually annotated the first 300 of these sentencesand found that 209 (70%) were ‘true’ good sentences. We then retrieved the listof phrasal dependencies for each of the 5,735 ‘good’ sentences and applied a setof rules to construct subject-predicate-object statements (triples) from each ofthe sentences. These rules are described in the Methods section (see 2.5), butat a simplistic level they concatenate compound nouns and adjectival modifiertokens and link them together by verbs present in both nominal-subject anddirect-object dependencies. The rules generated a set of 7,113 triples, contain-ing 9,080 unique subject and objects and 1,255 unique predicates. In a manualanalysis of a sample of 150 triples extracted from the KUP titles, we found that96 (64%) were correct. It should be emphasised that titles erroneously classifiedas ‘good’ were found to commonly produce incorrect triples, thus compoundingerrors made before triplification. Furthermore the Stanford parser has not beentrained on biomedical text, this can lead to parser errors and therefore depen-dency and triplification errors. The raw output from the rules can be found onmyExperiment.
3.4 Network construction
The triples generated from the ‘good’ KUP titles were formed into a graphby connecting triples with shared subject and object labels. These labels werestemmed and joined based on string matching. The largest connected componentof this graph (see myExperiment for Cytoscape graph) contains 2,676 nodes andhas 2,765 edges of 603 distinct types (as defined by the predicate label). Thecentral region of this graph (see figure 2) contains a number of highly connectedentities, the most highly connected being ‘rats’. Other highly connected entitiesare ‘kidney’, ‘renal function’, ‘angiotensin II’ and ‘renal injury’.
4 Conclusions
The literature is a vital source of domain knowledge and this knowledge needsto be exposed in integrated, computationally accessible forms. As well as inte-grating the literature with databases and ontologies we need to integrate factualstatements between articles and across the literature in general.
10 J. Eales, G. Demetriou, R. Stevens
enhance
vasodilatation
cause
regression
uptake
cause
cause
induction
inhibit
rate
increase
inhibitinhibit
inhibit
reduce
actionsassociation
preventinduceprevent
severity
reduce
develop
reduce
has_labeldecrease
regulate
induce
regulate
decrease
enhance
reducepromoteregulateprotect
reduce
decreaseretard
is_a
attenuate
attenuate
is_a
activity
inhibit
numberexpression
augment
expression
increaseinfluence
increase
severity
induce
mediate
modulate
accelerate
induce
regulateinduce stimulate
promote
preventexacerbate
attenuate
decrease
effect
increase
potentiate
regulate
attenuate
increase
potentiate
inhibit
postoperative
attenuate
aggravate
inhibitis_a
prevent
attenuate
improve
attenuateprevent
is_apromote
bluntbiomarkers
rats
rats
actionhave
stability
increaseupregulateexpression
induce
is_a
transplantation convertinositol
development
is_a
decrease
upregulate
protect
attenuate
attenuate
up-regulates
convert
do
reduce
is_a
affect
strain
up-regulates
decrease
uptake
inhibit
influence
receive
utilization
promote
enhance
accelerate
load
nephrocalcinaffection
choose
pharmacokinetic
have
levelhypertensiondistribution
have
activity
proteinvolume
flow
deposits
adjustment
coagulopathyperfusion
dysfunction
pressuresensitivityperfusion
responses
survival
dialyse
pyelonephritis
mortality
enhance showabundance
decrease
pair
effectshavereduce
collecthas_label
block
ameliorate
induce
regulate
reduce
prevent
utilizationreflect
concentration
cause
progression
prevent
expression
abundanceeffect
activation
expression
reduce
damageattenuate
know
rats
excretion
role
effects
cause
effect
impair
attenuate
develop
activate
protect
protect
ameliorate
is_a
ameliorate
inhibitblockade
develop
antibody
develop
severity
reflect
reduce
aggravate
modulate
influence
suppress
prevent
is_a
attenuate
improve
changesinduce
induce
reduce
ameliorate
onset
attenuate
attenuate
ameliorate
prevent
enhance
decreasederivative
decrease
decreasedecrease
inhibit
glucuronidation
reduce
proliferation
affect
induce
induce
inhibition
injurybind
showinfluence
attenuateinjury
induce
impair
angiotensin
injury
prevent
stress
attenuate
injury
improve
excretion
nephrotoxicity
prevent
progression
prevent
improveblock
phase
worsen
limit
limitdifferences
deletion
modulation
induce
convertis_aturnoverdecrease
inhibit
inhibit
kidney
severityincrease
prevent
induce
subpopulation
activatefiltrationameliorate
treatment
suppress
reactivity
convert
convert
is_a
effect
protection
inhibit
enhance
damage
ameliorate
stress
attenuate
damage
nephrotoxicity
reduce
function
predictangiogenesis
ameliorate
ameliorate injury8-hydroxydeoxyguanosine
decrease
convert
reduce
reduce
deficitnephrotoxicity
shock
levelarf
fibrosis
delivery
nephropathy
guarenlargement
protect
survivalexposure
inducehypercholesterolemia
attenuate
hypertensive
injury
expression
ligate
injury
damage
nephropathy
nephropathy
expression
expression
hypertension
injury
stress
hypertension
alphaenacinjurynephrotoxicity
failure
nephrotoxicity
injury
injurynephrotoxicity
neurotoxicity
injury
damage
levelsinjury
nephrotoxicity
damagerejection
nephropathy
damage
stress
dysfunctionhypertension
rejection
injury
vasoconstriction
fibrosis
progression
nephrotoxicity
nephropathynephropathy
damage
function
injury
protect
hypertension
nephrotoxicity
proliferation
damage
injury
failure
injury
hypertension
injury
hypertensive
hypertension
expression
necrosis
susceptibilitydevelopment
pathogenesis
jeopardize
induceincrease
induce
affectaccelerate
prevent
development
causereversal
cause
attenuate
prevent
development
produce
increaseis_a
effects
pace
expression
has_label
follow
acquireprevent
attenuate
hypertensionattenuate
prevent
protectincrease
improvereduce
inhibition
attenuate
infusion
ameliorate
attenuate reduceimprove
enhance
regulate
cause
attenuate
oxidative
ameliorateinduce
prevent
attenuate
improve
attenuate
patientsreduce
preventcause
phosphorylation
requireinduce
induce
emboli-inducedaction
reduce
increaselesion
inflammasomeexchange
transport
inhibit
cause
adhesion
proliferation
enhanceprevent
regulate
reverse
accompany
preserve
lower
oxalate
expression
form
preserveprevalence
represent
mrus
target
occlude
revascularization
interstitial
nephrographic
ochratoxin
reduce
nephritis
preventinhibitinhibitintegrateexpress
protect
sarcomaprevent
mycobacteria
expressionlesionsablation
effects
protectdecrease
biopsy
percutaneous
changes
membraneprotect
is_a
membrane
abnormalities
prevent
injury
require
synthesis
parameters
induce
consumptioninduce
stenosisinjury
reduce
pathologytolerance
breakultrasonographyindicator
lossactivity
effect
is_a
enhance
increase
growth
ameliorate
utilize
condition
bluntmediate
function
calcify
revascularization
autotransplantation
anomaly
ly-6
activation
transplantation
cool
leiomyoma
tubular
tumoratpase
protect
protect
protect
protect
expression
vasculopathy
destruction
ameliorateexhibit
response
inhibitinhibit
reduce
inhibit
receptor
inhibit
calcium oxalate crystal adherence
ANONYMOUS-2059
ANONYMOUS-678
myoglobin
ANONYMOUS-866
extracellular signal-regulated kinase activation
isolated perfused porcine slaughterhouse kidneys
renal nitric oxide production
adrenomedullinANONYMOUS-1353
glomerulosclerosis
ANONYMOUS-2463
igf
insulin-like growth factor
inflammationsalt appetite
losartanANONYMOUS-979
ANONYMOUS-1697
angiotensin-ii receptor antagonist
ANONYMOUS-1833
ANONYMOUS-2084
aortic tissues
experimental mesangioproliferative glomerulonephritis
estradiol-induced tumorigenesis
nk cells
renal epithelial cell phenotype
c3a receptor deficiency
dithiane prevents
angiotensini i
tin chloride pretreatment
activated macrophages
renal ischemia reperfusion injury
ANONYMOUS-1632
anf
ANONYMOUS-935
renal vascular
alphav beta6 integrin
high salt intake
glomerular damage
lung injury
renal artery stenting minimizes
sildenafil
angiotensin ii type receptor deficiency
alpha-melanocyte-stimulating hormone
poly-adp-ribose polymerase inhibitor
lidocaine-containing euro-collins solution
renalfibrosis
diferuloylmethane
calcium oxalate deposition
hypertension following
kidney injury
ANONYMOUS-576
gentamicin-induced nephrotoxicity
ANONYMOUS-1660
peroxynitrite formation
kidney proximal tubule cells
ANONYMOUS-517
i ia na\/p
mesenchymal stem cells
enzyme activity
p-glycoprotein multispecific organic anion transporter
1,4,5-trisphosphate receptor
ANONYMOUS-1178
bone morphogenetic protein-7
hepatic igf-i mrna
mitochondrial nitric oxide synthase
curcumin
biosynthesis expression icisplatin
t-cell replication
ANONYMOUS-1568
ANONYMOUS-284
kidney transplants
recovery
ANONYMOUS-346
patients
ANONYMOUS-1162
ANONYMOUS-1081
ANONYMOUS-204
ANONYMOUS-523
ANONYMOUS-446
donation
ANONYMOUS-574
renal lactate uptake
duct cells
hepatocytegrowthfactor
ANONYMOUS-1857
ANONYMOUS-60
afferent renal sympathetic nerve activity
ANONYMOUS-1056
tubular epithelial
ANONYMOUS-168
aquaporin water channels
ANONYMOUS-2220
hgf
nuclear factor-kappab
ANONYMOUS-2596
ANONYMOUS-817
we
kidney disease
ANONYMOUS-347
endothelin-1
ANONYMOUS-2294
ANONYMOUS-2461
nitric-oxide-deficient spontaneously hypertensive rats
interstitial fibrosis
renal water-reabsorptive pathways
rosiglitazone
peroxisome proliferator receptor-gamma agonist
endogenous tissue kallikrein
renal cysts
type phospholipase a2
endothelin-1 transgenic mice
ANONYMOUS-1864
ANONYMOUS-331
ANONYMOUS-404
tezosentan
ANONYMOUS-815
pj34
suramin
sex chromosomes
n-type calcium channel blocker cilnidipine
ANONYMOUS-188
pravastatin
renal brush border membranes
ANONYMOUS-1776
ANONYMOUS-2690
what
urine inhibitory factor
renalinjury
indomethacinfurosemide-induced natriuresis
ANONYMOUS-1475
diuresis
osteogenic protein-1
ANONYMOUS-2504
plasma renin activity
renal functional adaptation
ANONYMOUS-2006
cardiotonic steroids
apoptosis
nitric oxide synthase
human kidneys
ANONYMOUS-1518
ANONYMOUS-1989
urinary nitrite excretion
offspring disease progression
escherichia coli-induced pyelonephritis
interferon-alpha
nf-kappab inhibition ameliorates
ANONYMOUS-1378
ANONYMOUS-2526
ANONYMOUS-379
ANONYMOUS-617
ANONYMOUS-1071
ANONYMOUS-1601
diagnostic procedures
interstitial hyaluronan dissipation
ANONYMOUS-147
endogenous il-13
ANONYMOUS-989
ANONYMOUS-869
vectorial water transport
ANONYMOUS-2587
emerging role
ang
neonatal rat heart
ANONYMOUS-173
humoral responses
probenecid
injury
plasma volume
ANONYMOUS-1245
urinary peristaltic machinery
ANONYMOUS-970
ANONYMOUS-511
diuretic renography
angiotensinANONYMOUS-2065
postganglionic sympathetic neurons
ANONYMOUS-847
ANONYMOUS-580
ANONYMOUS-1395
ANONYMOUS-906
ANONYMOUS-878
ANONYMOUS-1128
ginkgo biloba extract
ANONYMOUS-1572
ANONYMOUS-2077
heart rate
ramipril
trimetazidine
ANONYMOUS-2069
ANONYMOUS-745
ANONYMOUS-2451
ANONYMOUS-1186
ANONYMOUS-463
ANONYMOUS-2060
ANONYMOUS-2425
ANONYMOUS-2428
ANONYMOUS-565
dietary glutathione
gum halts
ANONYMOUS-1247
ANONYMOUS-822
ANONYMOUS-1787
recombinant vascular endothelial growth factor 121
podocyte stress
receptor antagonists
ANONYMOUS-292
ratshypertension
atrap deficiency
intracellular na +
macromolecules
ANONYMOUS-1044
ANONYMOUS-1348
ANONYMOUS-1915
pro-oxidant generation
ANONYMOUS-1574
ANONYMOUS-946
infiltration mellitus
ANONYMOUS-2166
ANONYMOUS-601
chronic graft dysfunction
ANONYMOUS-424
ANONYMOUS-997
map
experimental renal warm ischemia
3-hydroxy-3-methylglutaryl coenzyme
enzyme
ANONYMOUS-418
ANONYMOUS-415
ANONYMOUS-1035
arterial blood pressure
renal proximal tubular cells
gap junction protein connexin43
intrarenal adenosine
ANONYMOUS-2207
ANONYMOUS-535
ANONYMOUS-2135
ANONYMOUS-786
naloxone
ANONYMOUS-512
stroke dysfunction
ANONYMOUS-2199
enalapril
increased kinin survival
ANONYMOUS-1069
ANONYMOUS-1317
rat ischemic acute renal failure
high linoleic acid diets
hyperbaric oxygen therapy
glycine
ANONYMOUS-1113
ANONYMOUS-1299
ANONYMOUS-320
ANONYMOUS-2183
ANONYMOUS-1119
hyperhomocysteinemia
ANONYMOUS-961
ANONYMOUS-881
ANONYMOUS-1046
eta receptor blockade
ticlopidine
ANONYMOUS-1556
oral erdosteine administration
bilateral native nephrectomy
spirulina
ANONYMOUS-1360
ANONYMOUS-2200
ANONYMOUS-2516
ANONYMOUS-1183
ANONYMOUS-11
ANONYMOUS-1412
ANONYMOUS-819
ANONYMOUS-1522
ANONYMOUS-2172
ANONYMOUS-2034
ANONYMOUS-2245
renal changes
endothelin type
angiotensin ii infused
preglomerular vascular changes
ANONYMOUS-2155
lacidipine
ANONYMOUS-248
cholesterol
peroxisome proliferator-activated receptor-gamma agonists
ANONYMOUS-1284
ANONYMOUS-91
ANONYMOUS-500
p38alpha mapk
severe functionally significant muscle wasting
ANONYMOUS-2120
dialysis modality
ANONYMOUS-51
ANONYMOUS-459
ANONYMOUS-1608
ANONYMOUS-119
ANONYMOUS-2412
ANONYMOUS-1229
cardiac myocyte o2 consumption
ANONYMOUS-1385
crystals
ANONYMOUS-2375
long-term
ANONYMOUS-208
streptozocine diabetes
cardiac function
ANONYMOUS-1839
aging renal
ANONYMOUS-131
ANONYMOUS-2641
ANONYMOUS-1654
ANONYMOUS-2662
ANONYMOUS-1817
surgery
ANONYMOUS-1126
ANONYMOUS-495
phyllodiscus semoni
contrast media tissue concentration
ANONYMOUS-126
chronicdose a
culture
ANONYMOUS-2576
homocysteine aminothiol concentrations
intra-aortic balloon pumping
successful kidney transplantation
ANONYMOUS-810
prophylactic nesiritide
ANONYMOUS-2610
platelet
azelnidipine
nonencapsulated drug excretionlong term
calcium-sensing receptor
aspirin
tertatolol
delayed kidney function risk
ANONYMOUS-125
plasma
18-year-old due analgesic
ANONYMOUS-2598
renal hemodynamic disorder
ANONYMOUS-1430
ANONYMOUS-2182
ischemia stress
slit diaphragm proteins neph1 glomerular podocyte effacement
enfuvirtide
crystal-cell interaction
ANONYMOUS-1913
ANONYMOUS-384
ANONYMOUS-1903
ANONYMOUS-2680
5-aminoisoquinolinone
ANONYMOUS-1563
ANONYMOUS-2379
ANONYMOUS-1558
mature dendritic cells
ANONYMOUS-1370
ANONYMOUS-1012
anti-monocyte protein-1 gene therapy
ANONYMOUS-547
ANONYMOUS-66
bioflavonoid
endogenous
ANONYMOUS-1867
ANONYMOUS-2634
normal mice
ANONYMOUS-860
connective tissue growth factor expression
quercetin
charybdotoxin-sensitive activated k + channel
multiple autoantibodies
renal vasoconstriction
ANONYMOUS-1202
ANONYMOUS-428
aneurysm
kidneyANONYMOUS-1823
ANONYMOUS-1364
ANONYMOUS-1588
thromboxane receptor
dysfunction
rho-kinase inhibition
renal chronic transplant dysfunction
ANONYMOUS-1751
ANONYMOUS-1573
ANONYMOUS-687
kidney stones
freeze-fractured rat kidney cortex
ANONYMOUS-1902
direct renoprotective effects
renal artery
spironolactone
ANONYMOUS-924
ANONYMOUS-1862
angiotensin-stimulated tubular sodiumatrialnatriuretic
peptidewater reabsorption
contrast enhanced sonography
ANONYMOUS-1320
renal glutamine extraction
interstitial fibroblasts
lipids
Fig. 2. Central portion of the largest connected component of the triple graph. Pred-icates labels are shown in red, subject and object labels are represent as blue nodes.Graph visualised in Cytoscape using organic layout
In this paper, we have presented an approach for digging into factual knowl-edge presented in scientific papers and for making this knowledge availablethrough a network of facts. We embarked on an exercise of ‘factomics’, wherewe retrieve and encode fact-like statements from scientific papers that are richin context necessary to fully interpret such facts [11]. Text mining technologiesoffer a tempting means to expose this literature-based knowledge, yet can sufferfrom the need to create corpora of focused collections of statements of interest.We have presented a method for creating a focused corpus of one kind of state-ment and turning this into triples to be used in a domain knowledge base. Ourapproach does not attempt any kind of full extraction of the scientific knowledgenecessary for the interpretation of the facts. Instead, we have taken the approachthat we are exposing the ‘headlines’ of what has been said, and provided linksback to the original paper for when a scientist finds a ‘fact of interest’.
Our main goal was to create a corpus of fact-orientated statements about aparticular domain. We did this by training a classifier to recognise titles that pro-vide factual statements about the outcome of an investigation. We then turned
Creating a focused corpus of factual outcomes from biomedical experiments 11
these into co-ordinated sets of triples using a dependency parser, in which weexpand key verb relationships into triples that link domain entities.
The results of the cross-validation work, provides some useful insights intothe two different approaches to represent sentences to a classifier. The complexapproach, although slow to process, returns very similar results on the classi-fication of ‘factual’ sentences to the simple attribute set. It should however benoted that the simple attributes can be derived in approximately 10% the timerequired for the complex attributes. Additionally a classifier model trained onsimple attributes can label new sentences at nearly 100 times the rate of a modeltrained on the complex attributes. These results demonstrate that a simple sta-tistical language-modelling approach can be superior, when dealing with largevolumes of data, when the number of classes is low and the number of trainingexamples for each class is not a limiting factor. When training data is scarcethen a more complex semantic and grammatical sentence profiling method islikely to provide better results albeit with greater computational overheads.
There is a notable disparity between the proportion of manually verified‘good’ titles in the KUP titles and in the training data. This is likely to beexplained by differences in collection scope for the two title sets. The KUPtitles were collected without temporal contraints whereas the training data istaken from PubMed Central which was launched in February 2000. Also the setof biologically-themed journals are likely to commonly publish articles with astrong results orientated theme.
The results of this work are satisfactory so far but clearly the complex ap-proach to classification could, for example, be improved with more informationon the recognised entities. Additionally the classification step is vital for re-moving ‘non-factual’ titles because the triplification output from these titles isfrequently erroneous. Give that our aim is to create a focused corpus of factsabout a domain our methods have deliberately tended toward greater precisionrather than recall to avoid spurious or unreliable statements. It seems that im-provements in the title classification process should pay the greatest dividends,by providing a tighter and more focused set of genuinely factual titles to thetriplification process.
One difference with past work on knowledge extraction in this field is thatwe exploit only a small part of the text, the article title. Although our trainingdata have been collected from a specific domain, our computational approach isgeneric and our algorithms, feature sets and classification types of titles can beused in other areas without major modifications.
To interlink our triples with the KUPKB, we intend to map our textualsubject, predicate and object values to URIs derived from several sources. Usingour existing named entity recognition results from Whatizit (see Section 2) wewill replace matching subject and object values with URIs for Uniprot (in thecase of proteins), ChEBI (for chemical entities) and UMLS terms (for diseases).We will also use the NCBO BioPortal Annotator to replace further subject andobject labels with URIs from the Mouse anatomy ontology. We will also replaceany matching predicate values found in the Molecular Interactions ontology. All
12 J. Eales, G. Demetriou, R. Stevens
of these mapped ontologies are currently part of the KUPKB thus easing theintegration of literature-derived knowledge into the knowledge base.
References
1. Jupp, S., Klein, J., Schanstra, J., Stevens, R.: Developing a Kidney and UrinaryPathway Knowledge Base. Proceedings of Bio-ontologies SIG, ISMB 2010
2. Croset, S., Grabmueller, C., Li, C., Kavaliauskas, S. and Rebholz-Schumann, D.: TheCALBC RDF Triple Store: Retrieval over Large Literature Content. Proceedings ofSWAT4LS 2010
3. Coulet, A., Shah, N.H., Garten, Y., Musen, M.A., Altman, R. B.: Using text to buildsemantic networks for pharmacogenomics. J. Biomed. Informatics 43(6): 1009–10192010
4. Coulet, A., Garten, Y., Dumontier, M., Altman, R., Musen, M., Shah, N.: Inte-gration and publication of heterogeneous text-mined relationships on the SemanticWeb. J. Biomed. Semantics (2) Suppl 2, S10, 2011
5. Sebastiani, F.: Machine learning in automated text categorization. ACM ComputingSurveys 34(1): 1-47, 2002
6. Klein, D., Manning C.: Accurate Unlexicalized Parsing. Proceedings of the 41stMeeting of the Association for Computational Linguistics, Sapporo, Japan, 423–430 2003
7. Coulet, A., Shah, N.H., Hunter,L ., Baral, C., Altman, R.B.: Extraction ofGenotype-Phenotype-Drug Relationships from Text: From Entity Recognition toBioinformatics Application. Proceedings of Pacific Symposium on Biocomputing2010
8. Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Wilett, P.: Protein structures andinformation extraction from biological texts: the PASTA system. Bioinformatics19(1): 135–143 2003
9. Humphreys, K., Demetriou, G., Gaizauskas, R.: Automatically Augmenting Termi-nological Lexicons from Untagged Text. Proceedings of the Workshop on NaturalLanguage Processing for Biology, held at the Pacific Symposium on Biocomputing(PSB2000), Hawaii, USA, 505–516 2000
10. de Marneffe, M.C., MacCartney, B., Manning, C.D.: Generating Typed Depen-dency Parses from Phrase Structure Parses. Proceedings of LREC 2006
11. Mons, B., Velterop, J.: Nano-publication in the e-science era. Proceedings ofSWASD 2009