Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo...
-
Upload
claire-agers -
Category
Documents
-
view
215 -
download
1
Transcript of Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo...
Three Approaches to GO-Tagging Biomedical Abstracts
Neil DavisHenk HarkemaRob GaizauskasYikun Guo
University of Sheffield
Jon RatcliffeInforSense
Moustafa GhanemTom BarnwellYike GuoImperial College London
Symposium on Semantic Mining in Biomedicine 2006
12/4/6
2
SMBM 2006
Introduction
• On-going explosive growth of biomedical literature
• Text Mining techniques can help through:• Extractive processes: extracting terms or facts
from papers for searching and linking
• Structuring processes: grouping papers based on content for conceptual navigation of large document collections
• GO-tag project:• Annotating biomedical papers with terms from
the Gene Ontology
3
SMBM 2006
Gene Ontology
• Provides common descriptive framework forgenes and gene products across species
• Consists of three structured, controlled vocabularies (ontologies) that describe genesand gene products in terms of:
• Biological processes
• Cellular components
• Molecular functions
4
SMBM 2006
Gene Ontology
• Contains almost 20,000 terms
• GO Slim (87 terms): subset of all GO terms• Aims to give broad overview of ontology content• Can be species-specific
• Typical GO term
Term name: isotropic cell growthAccession: GO:0051210Ontology: biological_processSynonyms: related: uniform cell growthDefinition: “The process by which a cell irreversibly increases in
size uniformly in all directions. In general, a rounded cellmorphology reflects isotropic cell growth.”
5
SMBM 2006
Common Use of GO• Associations of genes and gene products with GO terms in
model organism and protein databases
• FlyBase, SGD, MGD
• For example (from SGD):
Gene GO Annotation References Evidence CodeACT1 Structural constituent Botstein D, et al. (1997) Traceable Author
of cytoskeleton The yeast cytoskeleton StatementACT1 Exocytosis Pruyne D and Bretsher Traceable Author
(2000) Polarization of Statementin yeastBotstein D, et al (1997) Traceable AuthorThe yeast cytoskeleton Statement
ACT1 Histone acetyltransferase Galarneua L, et al. Inferred fromcomplex (2000) Multiple links Direct Assay
between the NuA4 …
6
SMBM 2006
GO-Tagging
• Task: given a text (PubMed abstract) and GO/GO Slim, assign 0 or more GO terms to the text if the text is “about” the process/component/function identified by the GO term
• Only most specific terms are assigned
• No association of GO term with specific genes or gene products
• User scenarios:• Research scientists: clustering of PubMed search results
• Database curators: identifying texts that may support Gene-GOterm associations
7
SMBM 2006
Outline of Rest of Talk
• Data sets / Gold standards• SGD Gold Standard• IC Gold Standard
• Three approaches to GO-tagging• Lexical look-up• Information retrieval approach• Machine learning
• Evaluation results
• Conclusions
8
SMBM 2006
SGD Gold Standard
• Derive Gold Standard from SGD model organism database (yeast)
• Given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T
• SGD Gold Standard• 4922 PMIDS
• 2455 GO terms
• 10485 PMID-GO term pairs
9
SMBM 2006
SGD Gold Standard• Advantages
• SGD data already exists – no further annotation work required• More Gold Standard data from other model organism databases
• Disadvantage• List of Gene-GO term assignments in SGD is incomplete for our task
• Each paper is associated with GO terms whose assignment to specific genes it supports, but the paper may be missing otherGO terms which can also be legitimately attached to it
• List does not contain all papers supporting a given assignment
• Consequence• SGD Gold Standard is “GO-term incomplete”
• Weak measure of Recall• Precision figures difficult to interpret
10
SMBM 2006
SGD Gold Standard
• Further issue:• SGD Gene-GO term assignments are based on full
papers, whereas system only has access to abstracts
• Consequence:• Limit on maximum Recall obtainable by system
11
SMBM 2006
IC Gold Standard
• Manually extend SGD Gold Standard to obtain GO-term complete annotation
• Select SGD papers for which all GO termassignments are supported by abstract or title
• Semi-automatically add further GO terms byfuzzy term matching + post-editing
• IC Gold Standard• 785 PMIDS• 1006 GO terms• 5170 PMID-GO term pairs
12
SMBM 2006
IC Gold Standard
• Advantage• Closer to GO-term complete Gold Standard
• Disadvantages• Still not GO-term complete
• Direct mentions of GO terms vs. semantically inferred GO terms
• Gold Standard creation method favors lexicallook-up approach to GO-tagging
• Data set is small
13
SMBM 2006
Outline of Rest of Talk
• Data sets / Gold standards• SGD Gold Standard• IC Gold Standard
• Three approaches to GO-tagging• Lexical look-up• Information retrieval approach• Machine learning
• Evaluation results
• Conclusions
14
SMBM 2006
Lexical Look-Up
• (Task: given a text (PubMed abstract) and GO/GO Slim, assign 0 or more GO terms to the text if the text is “about” the process/component/function identified by the GO term)
• GO term T is assigned to a paper if term Toccurs in the abstract of the paper
• Simple & fast baseline
• GO terms recognized in text can be used as features in Machine Learning approach
15
SMBM 2006
Lexical Look-Up
• Web service calls to Termino term tagger
• Term classes in Termino• GO terms
• GO term synonyms
• SGD yeast gene names
• Lexical look-up method• Case-insensitive
• Simple morphological analysis
• Cells mapped onto cell
• Mitochondrial, mitochondria not mapped onto mitochondrion
16
SMBM 2006
Lexical Look-Up Results
• Recall
• Full text (SGD) vs. abstracts only (IC)• Inherent drawbacks of lexical look-up: term variation, literal mentions• Effects of Gold Standard creation method (IC)
• Precision
• Effects of Gold Standard creation method (IC)
• GO vs. GO Slim
• Recognizing GO Slim terms is easier than recognizing GO terms
17
SMBM 2006
Lexical Look-Up
• Extensions• GO term T is assigned to a paper if synonym of
term T occurs in the abstract of the paper
• GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper
• Effects on performance• Adding synonyms: slight decrease in Precision, substantial
increase in Recall
• Adding yeast terms: substantial decrease in Precision, substantial increase in Recall
18
SMBM 2006
IR-Based Approach
• Document collection• For each GO term, create a document consisting
of the GO term, its synonyms, and its definition
• Query• For each paper, create a query consisting
of the words in the abstract of the paper
• Given a query (i.e., abstract), retrieve relevant documents (i.e., GO terms) from the document collection
• Assign top-ranked 5, 10, … GO terms to abstract
19
SMBM 2006
IR-Based Approach
• Index documents using Lucene search engine
• Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming
• Similarity measure: vector space model
• Two kinds of document• Flat document = GO term + synonyms + definition• Hierarchical document = GO term + synonyms +
definition + terms, synonyms, and definitions of parent GO nodes
20
SMBM 2006
IR-Based Results
• Better performance on IC abstracts than on SGD abstracts
• Hierarchical documents do slightly worse than flat documents
• Discriminatory effect of specific GO terms may be reducedby occurrence of general terms such as cell and protein
21
SMBM 2006
Machine Learning
• Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, …
• Naïve Bayes predicts only one GO term per abstract
• SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract
• Features: words, frequent phrases• Preprocessing steps: tokenization, removal of
stop words, stemming
• Training on 66% of annotated data, evaluation on remainder of data
• GO term assignments vis-à-vis generic GO Slim tomitigate data sparsity problems
22
SMBM 2006
Machine Learning Results
• One GO term vs. multiple GO terms per abstract makes a difference• Higher precision scores than lexical look-up (SGD): GO terms directly
mentioned in text not be assigned if GO terms not present in training set• Oracle Text Decision Tree (IC): classifier learns systematic, strong
correlation between words in text and words in GO terms
23
SMBM 2006
• Best F scores for GO Slim• SGD Gold Standard
• IC Gold Standard R P F
LLU 79.5 98.5 88.0
IR 59.5 37.6 46.1
ML 76.5 83.0 79.6
Comparison of Approaches
R P F
LLU 51.0 29.9 37.7
IR 51.5 26.2 34.7
ML 36.8 51.6 43.0
24
SMBM 2006
Conclusions
• GO-tagging is an interesting task• NLP challenges
• Benefits of functional GO-tagger forresearchers and curators
• Creating valid Gold Standard• Completeness of annotation
25
SMBM 2006
Conclusions
• Methods for GO-tagging• Lexical look-up
• Fast, simple
• Term variation, relevant GO terms inferred from text
• Information retrieval approach
• Novel perspective
• Noise from general biomedical terms
• Machine Learning
• Able to capture generalizations
• Feature selection
26
SMBM 2006
Future Work
• Enhancements to each of the three simple approaches
• Combining three approaches into a hybrid system
• Improving resources and methodology for evaluatingthe technology
• Building and evaluating end-user applications employing this technology
• Look at other tasks:• Extracting GO term-gene/gene product pairs• Assigning evidence codes
27
SMBM 2006
Navigating GO-Tagged Document Collections
GOHierarchy
AbstractTitles
AbstractBodies
GO Terms/Gene Names