Enriching Bio-ontologieswith Non-hierarchical Relations
Workshop on bio-ontologiesOctober 28, 2005
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
AcknowledgmentsAcknowledgments
Marc AubryMarc AubryUMR 6061 CNRS,UMR 6061 CNRS,Rennes, FranceRennes, France
Anita BurgunAnita BurgunUniversity of University of Rennes, FranceRennes, France
3 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Gene OntologyGene Ontology
Annotate gene productsAnnotate gene products CoverageCoverage
Molecular functionsMolecular functions Cellular componentsCellular components Biological processesBiological processes
Explicit relations to other terms within the same Explicit relations to other terms within the same hierarchyhierarchy
No (explicit) relationsNo (explicit) relations To terms across hierarchiesTo terms across hierarchies To concepts from other biological ontologiesTo concepts from other biological ontologies
4 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Gene OntologyGene Ontology
Molecularfunctions
Cellularcomponents
Biologicalprocesses
BP: metal ion transportMF: metal ion transporter activity
5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
MotivationMotivation
Richer ontologyRicher ontology Beyond hierarchiesBeyond hierarchies
Easier to maintainEasier to maintain Explicit dependence relationsExplicit dependence relations
More consistent annotationsMore consistent annotations Quality assurance Quality assurance Assisted curationAssisted curation
6 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Related workRelated work
Ontologizing GOOntologizing GO GONGGONG
Identifying relations among GO terms across hierarchiesIdentifying relations among GO terms across hierarchies Lexical approachLexical approach Non-lexical approachesNon-lexical approaches
Identifying relations between GO terms and OBO termsIdentifying relations between GO terms and OBO terms ChEBIChEBI
Representing relations among GO terms and between GO Representing relations among GO terms and between GO terms and OBO termsterms and OBO terms ObolObol
See also:See also:
[Ogren & al., PSB 2004-2005]
[Bodenreider & al., PSB 2005]
[Wroe & al., PSB 2003]
[Mungall, CFG 2005]
[Burgun & al., SMBM 2005]
[Bada & al., 2004], [Kumar & al., 2004], [Dolan & al., 2005]
Non-lexical approaches to identifyingassociative relations in the Gene Ontology
8 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
GO and annotation databasesGO and annotation databases
5 model organisms5 model organisms FlyBaseFlyBase GOA-HumanGOA-Human MGIMGI SGDSGD WormBaseWormBase
Brca1Brca1 GO:0000793GO:0000793 IDAIDA
Brca1Brca1 GO:0003677GO:0003677 IEAIEA
Brca1Brca1 GO:0003684GO:0003684 IDAIDA
Brca1Brca1 GO:0003723GO:0003723 ISSISS
Brca1Brca1 GO:0004553GO:0004553 ISSISS
Brca1Brca1 GO:0005515GO:0005515 ISS, TASISS, TAS
Brca1Brca1 GO:0005622GO:0005622 IEAIEA
Brca1Brca1 GO:0005634GO:0005634 IDA, ISSIDA, ISS
Brca1Brca1 ……270,000 gene-term associations270,000 gene-term associations
9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Three non-lexical approachesThree non-lexical approaches
All based on annotation databasesAll based on annotation databases
Similarity in the vector space modelSimilarity in the vector space model
Statistical analysis of co-occurring GO termsStatistical analysis of co-occurring GO terms
Association rule miningAssociation rule mining
10 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Similarity in the vector space modelSimilarity in the vector space model
Annotationdatabase
Annotationdatabase
GO
term
s
Genesg1 g2 … gn
t1
t2
…
tn
GO terms
Gen
es
g1
g2
…
gn
t1 t2 … tn
11 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Similarity in the vector space modelSimilarity in the vector space model
GO terms
GO
term
s t1
t2
…
tn
t1 t2 … tn
Similaritymatrix
Similaritymatrix
Sim(ti,tj) = ti · tj
GO
term
s
Genesg1 g2 … gn
t1
t2
…
tn
tj
ti
12 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Analysis of co-occurring GO termsAnalysis of co-occurring GO terms
Annotationdatabase
Annotationdatabase
GO terms
Gen
es
g1
g2
…
gn
t1 t2 … tn
g2
t2 t7 t9
… t3 t7 t9
g5
t5
tt22-t-t77 11
tt22-t-t99 11
tt77-t-t99 22
……
tt55 11
tt77 22
tt99 22
……
13 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Analysis of co-occurring GO termsAnalysis of co-occurring GO terms
Statistical analysis: test independenceStatistical analysis: test independence Likelihood ratio test (GLikelihood ratio test (G22)) Chi-square test (Pearson’s Chi-square test (Pearson’s χχ22))
Example from GOA (Example from GOA (22,72022,720 annotations) annotations) C0006955 [BP]C0006955 [BP] Freq. = Freq. = 588588 C0008009 [MF]C0008009 [MF] Freq. = Freq. = 5353
presentpresent absentabsent TotalTotal
presentpresent 4646 542542 588588
absentabsent 77 21,58321,583 22,13222,132
totaltotal 5353 22,12522,125 22,72022,720
GO:0008009 immune responseimmune response
GO:0006955
chemokinechemokineactivityactivity
Co-oc. = Co-oc. = 4646
GG22 = 298.7 = 298.7p < 0.000p < 0.000
14 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Association rule miningAssociation rule mining
Annotationdatabase
Annotationdatabase
GO terms
Gen
es
g1
g2
…
gn
t1 t2 … tn
g2
t2 t7 t9
transaction
apriori• Rules: t1 => t2
• Confidence: > .9• Support: .05
15 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Examples of associationsExamples of associations
Vector space modelVector space model MF: ice binding BP: response to freezing
Co-occurring terms MF: chromatin binding CC: nuclear chromatin
Association rule mining MF: carboxypeptidase A activity BP: peptolysis and peptidolysis
16 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Associations identifiedAssociations identified
VSM COC ARM LEX
MF-CC 499 893 362 917
MF-BP 3057 1628 577 2523
CC-BP 760 1047 329 2053
Total 4316 3568 1268 5493
7665 by at least one approach
17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Associations identified Associations identified VSMVSM
VSM COC ARM LEX
MF-CC 499 893 362 917
MF-BP 3057 1628 577 2523
CC-BP 760 1047 329 2053
Total 4316 3568 1268 5493
MF: ice bindingBP: response to freezing
18 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Associations identified Associations identified COCCOC
VSM COC ARM LEX
MF-CC 499 893 362 917
MF-BP 3057 1628 577 2523
CC-BP 760 1047 329 2053
Total 4316 3568 1268 5493
MF: chromatin bindingCC: nuclear chromatin
19 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Associations identified Associations identified ARMARM
VSM COC ARM LEX
MF-CC 499 893 362 917
MF-BP 3057 1628 577 2523
CC-BP 760 1047 329 2053
Total 4316 3568 1268 5493
MF: carboxypeptidase A activityBP: peptolysis and peptidolysis
20 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Limited overlap among approachesLimited overlap among approaches
Lexical vs. non-lexicalLexical vs. non-lexical Among non-lexicalAmong non-lexical
7665 5493
230
VSM COC
ARM
3689 2587
453
305 309
121201
Linking the Gene Ontologyto other biological ontologies
22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Related domainsRelated domains
OrganismsOrganisms Cell typesCell types Physical entitiesPhysical entities
Gross anatomyGross anatomy MoleculesMolecules
FunctionsFunctions Organism functionsOrganism functions Cell functionsCell functions
PathologyPathology
cytosolic ribosome (sensu Eukaryota)
T-cell activation
brain development
transferrin receptor activity
visual perception
T-cell activation
regulation of blood pressure
23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
GO and other domainsGO and other domains
[adapted from B. Smith]
Physical Physical entityentity
FunctionFunction ProcessProcess
OrganismOrganismGross Gross anatomyanatomy
Organism Organism functionsfunctions
Organism Organism processesprocesses
CellCellCellular Cellular componentscomponents
Cellular Cellular functionsfunctions
Cellular Cellular processesprocesses
MoleculeMolecule MoleculesMoleculesMolecular Molecular functionsfunctions
Molecular Molecular processesprocesses
24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
GO and other domains (revisited)GO and other domains (revisited)
[adapted from B. Smith]
ResolutionResolutionPhysical Physical wholewhole
PhysicalPhysicalpartpart
FunctionFunction ProcessProcess
OrganismOrganismOrganism Organism componentscomponents
Organism Organism functionsfunctions
Organism Organism processesprocesses
CellCellCellular Cellular componentscomponents
Cellular Cellular functionsfunctions
Cellular Cellular processesprocesses
MoleculeMoleculeMolecular Molecular componentscomponents
Molecular Molecular functionsfunctions
Molecular Molecular processesprocesses
25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Biological ontologies (OBO)Biological ontologies (OBO)
26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
GO and other domains (revisited)GO and other domains (revisited)
[adapted from B. Smith]
ResolutionResolutionPhysical Physical wholewhole
PhysicalPhysicalpartpart
FunctionFunction ProcessProcess
OrganismOrganismOrganism Organism componentscomponents
Organism Organism functionsfunctions
Organism Organism processesprocesses
CellCellCellular Cellular componentscomponents
Cellular Cellular functionsfunctions
Cellular Cellular processesprocesses
MoleculeMoleculeMolecular Molecular componentscomponents
Molecular Molecular functionsfunctions
Molecular Molecular processesprocesses
27 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Integrating biological ontologiesIntegrating biological ontologies
Cellularcomponents
Cellularcomponents
Molecularfunctions
Molecularfunctions
MoleculeMolecule
Biologocalprocesses
Biologocalprocesses
OrganismOrganism
Other OBOontologies
Other OBOontologies
CellCell
Linking GO to ChEBI
[Burgun & al., SMBM 2005]
29 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
ChEBIChEBI
Member of the OBO familyMember of the OBO family Ontology ofOntology of
ChChemical emical EEntities of ntities of BBiological iological IInterestnterest AtomAtom MoleculeMolecule IonIon RadicalRadical
10,516 entities10,516 entities 27,097 terms27,097 terms [Dec. 22, 2005]
30 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
MethodsMethods
Every ChEBI term searched in every GO termEvery ChEBI term searched in every GO term Maximize precisionMaximize precision
Ignored ChEBI terms of 3 characters or lessIgnored ChEBI terms of 3 characters or less Proper substringProper substring
Maximize recallMaximize recall Case insensitive matchesCase insensitive matches Normalized ChEBI namesNormalized ChEBI names
(generated singular forms from plurals)(generated singular forms from plurals)
31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
ExamplesExamples
iron [CHEBI:18248]
uronic acid [CHEBI:27252]
carbon [CHEBI:27594]
BP iron ion transport [GO:0006826]
MF iron superoxide dismutase activity [GO:0008382]
CC vanadium-iron nitrogenase complex [GO:0016613]
BP response to carbon dioxide [GO:0010037]
MF carbon-carbon lyase activity [GO:0016830]
BP uronic acid metabolism [GO:0006063]
MF uronic acid transporter activity [GO:0015133]
32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Quantitative resultsQuantitative results
2,700 ChEBI entities 2,700 ChEBI entities (27%) identified in some (27%) identified in some GO termGO term
9,431 GO terms (55%) 9,431 GO terms (55%) include some ChEBI include some ChEBI entity in their namesentity in their names
10,516 entities 17,250 terms
20,497 associations
33 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
GeneralizationGeneralization
MFCC BP
CHEBI
FMA
Cell types
REX FIX
MousePathology
cellmembraneviscosity
enzymaticreaction
Leydic cell tumor
irondeposition
cerebellar aplasia
Conclusions
35 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Conclusions (1)Conclusions (1)
Links across OBO ontologies need to be made Links across OBO ontologies need to be made explicitexplicit Between GO terms across GO hierarchiesBetween GO terms across GO hierarchies Between GO terms and OBO termsBetween GO terms and OBO terms Between terms across OBO ontologiesBetween terms across OBO ontologies
Automatic approachesAutomatic approaches Effective (GO-GO, GO-ChEBI)Effective (GO-GO, GO-ChEBI) At least to bootstrap the processAt least to bootstrap the process Needs to be refinedNeeds to be refined
36 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Conclusions (2)Conclusions (2)
Affordable relationsAffordable relations Computer-intensive, not labor-intensiveComputer-intensive, not labor-intensive
Methods must be combinedMethods must be combined Cross-validationCross-validation Redundancy as a surrogate for reliabilityRedundancy as a surrogate for reliability Relations identified specifically by one approachRelations identified specifically by one approach
False positivesFalse positives Specific strength of a particular methodSpecific strength of a particular method
Requires (some) manual curationRequires (some) manual curation Biologists must be involvedBiologists must be involved
37 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
ReferencesReferences
Bodenreider O, Aubry M, Burgun A. Bodenreider O, Aubry M, Burgun A. Non-lexical approaches to Non-lexical approaches to identifying associative relations in the Gene Ontologyidentifying associative relations in the Gene Ontology. In: Altman RB, . In: Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-102.102.http://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdf
Burgun A, Bodenreider O. Burgun A, Bodenreider O. An ontology of chemical entities helps An ontology of chemical entities helps identify dependence relations among Gene Ontology termsidentify dependence relations among Gene Ontology terms. . Proceedings of the First International Symposium on Semantic Mining Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM-2005)in Biomedicine (SMBM-2005)Electronic proceedings: CEUR-WS/Vol-148Electronic proceedings: CEUR-WS/Vol-148http://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdf
MedicalOntologyResearch
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA
Contact:Contact:Web:Web:
[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov