Enriching Bio-ontologies with Non-hierarchical Relations

38
Enriching Bio-ontologies with Non-hierarchical Relations Workshop on bio-ontologies October 28, 2005 Olivier Bodenreider Olivier Bodenreider Lister Hill National Center Lister Hill National Center for Biomedical Communications for Biomedical Communications Bethesda, Maryland - USA Bethesda, Maryland - USA

description

Workshop on bio-ontologies October 28, 2005. Enriching Bio-ontologies with Non-hierarchical Relations. Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA. Acknowledgments. Marc Aubry UMR 6061 CNRS, Rennes, France - PowerPoint PPT Presentation

Transcript of Enriching Bio-ontologies with Non-hierarchical Relations

Page 1: Enriching Bio-ontologies with Non-hierarchical Relations

Enriching Bio-ontologieswith Non-hierarchical Relations

Workshop on bio-ontologiesOctober 28, 2005

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Page 2: Enriching Bio-ontologies with Non-hierarchical Relations

2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

AcknowledgmentsAcknowledgments

Marc AubryMarc AubryUMR 6061 CNRS,UMR 6061 CNRS,Rennes, FranceRennes, France

Anita BurgunAnita BurgunUniversity of University of Rennes, FranceRennes, France

Page 3: Enriching Bio-ontologies with Non-hierarchical Relations

3 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Gene OntologyGene Ontology

Annotate gene productsAnnotate gene products CoverageCoverage

Molecular functionsMolecular functions Cellular componentsCellular components Biological processesBiological processes

Explicit relations to other terms within the same Explicit relations to other terms within the same hierarchyhierarchy

No (explicit) relationsNo (explicit) relations To terms across hierarchiesTo terms across hierarchies To concepts from other biological ontologiesTo concepts from other biological ontologies

Page 4: Enriching Bio-ontologies with Non-hierarchical Relations

4 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Gene OntologyGene Ontology

Molecularfunctions

Cellularcomponents

Biologicalprocesses

BP: metal ion transportMF: metal ion transporter activity

Page 5: Enriching Bio-ontologies with Non-hierarchical Relations

5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MotivationMotivation

Richer ontologyRicher ontology Beyond hierarchiesBeyond hierarchies

Easier to maintainEasier to maintain Explicit dependence relationsExplicit dependence relations

More consistent annotationsMore consistent annotations Quality assurance Quality assurance Assisted curationAssisted curation

Page 6: Enriching Bio-ontologies with Non-hierarchical Relations

6 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Related workRelated work

Ontologizing GOOntologizing GO GONGGONG

Identifying relations among GO terms across hierarchiesIdentifying relations among GO terms across hierarchies Lexical approachLexical approach Non-lexical approachesNon-lexical approaches

Identifying relations between GO terms and OBO termsIdentifying relations between GO terms and OBO terms ChEBIChEBI

Representing relations among GO terms and between GO Representing relations among GO terms and between GO terms and OBO termsterms and OBO terms ObolObol

See also:See also:

[Ogren & al., PSB 2004-2005]

[Bodenreider & al., PSB 2005]

[Wroe & al., PSB 2003]

[Mungall, CFG 2005]

[Burgun & al., SMBM 2005]

[Bada & al., 2004], [Kumar & al., 2004], [Dolan & al., 2005]

Page 7: Enriching Bio-ontologies with Non-hierarchical Relations

Non-lexical approaches to identifyingassociative relations in the Gene Ontology

Page 8: Enriching Bio-ontologies with Non-hierarchical Relations

8 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and annotation databasesGO and annotation databases

5 model organisms5 model organisms FlyBaseFlyBase GOA-HumanGOA-Human MGIMGI SGDSGD WormBaseWormBase

Brca1Brca1 GO:0000793GO:0000793 IDAIDA

Brca1Brca1 GO:0003677GO:0003677 IEAIEA

Brca1Brca1 GO:0003684GO:0003684 IDAIDA

Brca1Brca1 GO:0003723GO:0003723 ISSISS

Brca1Brca1 GO:0004553GO:0004553 ISSISS

Brca1Brca1 GO:0005515GO:0005515 ISS, TASISS, TAS

Brca1Brca1 GO:0005622GO:0005622 IEAIEA

Brca1Brca1 GO:0005634GO:0005634 IDA, ISSIDA, ISS

Brca1Brca1 ……270,000 gene-term associations270,000 gene-term associations

Page 9: Enriching Bio-ontologies with Non-hierarchical Relations

9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Three non-lexical approachesThree non-lexical approaches

All based on annotation databasesAll based on annotation databases

Similarity in the vector space modelSimilarity in the vector space model

Statistical analysis of co-occurring GO termsStatistical analysis of co-occurring GO terms

Association rule miningAssociation rule mining

Page 10: Enriching Bio-ontologies with Non-hierarchical Relations

10 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Similarity in the vector space modelSimilarity in the vector space model

Annotationdatabase

Annotationdatabase

GO

term

s

Genesg1 g2 … gn

t1

t2

tn

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

Page 11: Enriching Bio-ontologies with Non-hierarchical Relations

11 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Similarity in the vector space modelSimilarity in the vector space model

GO terms

GO

term

s t1

t2

tn

t1 t2 … tn

Similaritymatrix

Similaritymatrix

Sim(ti,tj) = ti · tj

GO

term

s

Genesg1 g2 … gn

t1

t2

tn

tj

ti

Page 12: Enriching Bio-ontologies with Non-hierarchical Relations

12 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Analysis of co-occurring GO termsAnalysis of co-occurring GO terms

Annotationdatabase

Annotationdatabase

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

g2

t2 t7 t9

… t3 t7 t9

g5

t5

tt22-t-t77 11

tt22-t-t99 11

tt77-t-t99 22

……

tt55 11

tt77 22

tt99 22

……

Page 13: Enriching Bio-ontologies with Non-hierarchical Relations

13 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Analysis of co-occurring GO termsAnalysis of co-occurring GO terms

Statistical analysis: test independenceStatistical analysis: test independence Likelihood ratio test (GLikelihood ratio test (G22)) Chi-square test (Pearson’s Chi-square test (Pearson’s χχ22))

Example from GOA (Example from GOA (22,72022,720 annotations) annotations) C0006955 [BP]C0006955 [BP] Freq. = Freq. = 588588 C0008009 [MF]C0008009 [MF] Freq. = Freq. = 5353

presentpresent absentabsent TotalTotal

presentpresent 4646 542542 588588

absentabsent 77 21,58321,583 22,13222,132

totaltotal 5353 22,12522,125 22,72022,720

GO:0008009 immune responseimmune response

GO:0006955

chemokinechemokineactivityactivity

Co-oc. = Co-oc. = 4646

GG22 = 298.7 = 298.7p < 0.000p < 0.000

Page 14: Enriching Bio-ontologies with Non-hierarchical Relations

14 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Association rule miningAssociation rule mining

Annotationdatabase

Annotationdatabase

GO terms

Gen

es

g1

g2

gn

t1 t2 … tn

g2

t2 t7 t9

transaction

apriori• Rules: t1 => t2

• Confidence: > .9• Support: .05

Page 15: Enriching Bio-ontologies with Non-hierarchical Relations

15 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Examples of associationsExamples of associations

Vector space modelVector space model MF: ice binding BP: response to freezing

Co-occurring terms MF: chromatin binding CC: nuclear chromatin

Association rule mining MF: carboxypeptidase A activity BP: peptolysis and peptidolysis

Page 16: Enriching Bio-ontologies with Non-hierarchical Relations

16 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identifiedAssociations identified

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

7665 by at least one approach

Page 17: Enriching Bio-ontologies with Non-hierarchical Relations

17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identified Associations identified VSMVSM

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

MF: ice bindingBP: response to freezing

Page 18: Enriching Bio-ontologies with Non-hierarchical Relations

18 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identified Associations identified COCCOC

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

MF: chromatin bindingCC: nuclear chromatin

Page 19: Enriching Bio-ontologies with Non-hierarchical Relations

19 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Associations identified Associations identified ARMARM

VSM COC ARM LEX

MF-CC 499 893 362 917

MF-BP 3057 1628 577 2523

CC-BP 760 1047 329 2053

Total 4316 3568 1268 5493

MF: carboxypeptidase A activityBP: peptolysis and peptidolysis

Page 20: Enriching Bio-ontologies with Non-hierarchical Relations

20 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Limited overlap among approachesLimited overlap among approaches

Lexical vs. non-lexicalLexical vs. non-lexical Among non-lexicalAmong non-lexical

7665 5493

230

VSM COC

ARM

3689 2587

453

305 309

121201

Page 21: Enriching Bio-ontologies with Non-hierarchical Relations

Linking the Gene Ontologyto other biological ontologies

Page 22: Enriching Bio-ontologies with Non-hierarchical Relations

22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Related domainsRelated domains

OrganismsOrganisms Cell typesCell types Physical entitiesPhysical entities

Gross anatomyGross anatomy MoleculesMolecules

FunctionsFunctions Organism functionsOrganism functions Cell functionsCell functions

PathologyPathology

cytosolic ribosome (sensu Eukaryota)

T-cell activation

brain development

transferrin receptor activity

visual perception

T-cell activation

regulation of blood pressure

Page 23: Enriching Bio-ontologies with Non-hierarchical Relations

23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and other domainsGO and other domains

[adapted from B. Smith]

Physical Physical entityentity

FunctionFunction ProcessProcess

OrganismOrganismGross Gross anatomyanatomy

Organism Organism functionsfunctions

Organism Organism processesprocesses

CellCellCellular Cellular componentscomponents

Cellular Cellular functionsfunctions

Cellular Cellular processesprocesses

MoleculeMolecule MoleculesMoleculesMolecular Molecular functionsfunctions

Molecular Molecular processesprocesses

Page 24: Enriching Bio-ontologies with Non-hierarchical Relations

24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and other domains (revisited)GO and other domains (revisited)

[adapted from B. Smith]

ResolutionResolutionPhysical Physical wholewhole

PhysicalPhysicalpartpart

FunctionFunction ProcessProcess

OrganismOrganismOrganism Organism componentscomponents

Organism Organism functionsfunctions

Organism Organism processesprocesses

CellCellCellular Cellular componentscomponents

Cellular Cellular functionsfunctions

Cellular Cellular processesprocesses

MoleculeMoleculeMolecular Molecular componentscomponents

Molecular Molecular functionsfunctions

Molecular Molecular processesprocesses

Page 25: Enriching Bio-ontologies with Non-hierarchical Relations

25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Biological ontologies (OBO)Biological ontologies (OBO)

Page 26: Enriching Bio-ontologies with Non-hierarchical Relations

26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GO and other domains (revisited)GO and other domains (revisited)

[adapted from B. Smith]

ResolutionResolutionPhysical Physical wholewhole

PhysicalPhysicalpartpart

FunctionFunction ProcessProcess

OrganismOrganismOrganism Organism componentscomponents

Organism Organism functionsfunctions

Organism Organism processesprocesses

CellCellCellular Cellular componentscomponents

Cellular Cellular functionsfunctions

Cellular Cellular processesprocesses

MoleculeMoleculeMolecular Molecular componentscomponents

Molecular Molecular functionsfunctions

Molecular Molecular processesprocesses

Page 27: Enriching Bio-ontologies with Non-hierarchical Relations

27 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Integrating biological ontologiesIntegrating biological ontologies

Cellularcomponents

Cellularcomponents

Molecularfunctions

Molecularfunctions

MoleculeMolecule

Biologocalprocesses

Biologocalprocesses

OrganismOrganism

Other OBOontologies

Other OBOontologies

CellCell

Page 28: Enriching Bio-ontologies with Non-hierarchical Relations

Linking GO to ChEBI

[Burgun & al., SMBM 2005]

Page 29: Enriching Bio-ontologies with Non-hierarchical Relations

29 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ChEBIChEBI

Member of the OBO familyMember of the OBO family Ontology ofOntology of

ChChemical emical EEntities of ntities of BBiological iological IInterestnterest AtomAtom MoleculeMolecule IonIon RadicalRadical

10,516 entities10,516 entities 27,097 terms27,097 terms [Dec. 22, 2005]

Page 30: Enriching Bio-ontologies with Non-hierarchical Relations

30 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

MethodsMethods

Every ChEBI term searched in every GO termEvery ChEBI term searched in every GO term Maximize precisionMaximize precision

Ignored ChEBI terms of 3 characters or lessIgnored ChEBI terms of 3 characters or less Proper substringProper substring

Maximize recallMaximize recall Case insensitive matchesCase insensitive matches Normalized ChEBI namesNormalized ChEBI names

(generated singular forms from plurals)(generated singular forms from plurals)

Page 31: Enriching Bio-ontologies with Non-hierarchical Relations

31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ExamplesExamples

iron [CHEBI:18248]

uronic acid [CHEBI:27252]

carbon [CHEBI:27594]

BP iron ion transport [GO:0006826]

MF iron superoxide dismutase activity [GO:0008382]

CC vanadium-iron nitrogenase complex [GO:0016613]

BP response to carbon dioxide [GO:0010037]

MF carbon-carbon lyase activity [GO:0016830]

BP uronic acid metabolism [GO:0006063]

MF uronic acid transporter activity [GO:0015133]

Page 32: Enriching Bio-ontologies with Non-hierarchical Relations

32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Quantitative resultsQuantitative results

2,700 ChEBI entities 2,700 ChEBI entities (27%) identified in some (27%) identified in some GO termGO term

9,431 GO terms (55%) 9,431 GO terms (55%) include some ChEBI include some ChEBI entity in their namesentity in their names

10,516 entities 17,250 terms

20,497 associations

Page 33: Enriching Bio-ontologies with Non-hierarchical Relations

33 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

GeneralizationGeneralization

MFCC BP

CHEBI

FMA

Cell types

REX FIX

MousePathology

cellmembraneviscosity

enzymaticreaction

Leydic cell tumor

irondeposition

cerebellar aplasia

Page 34: Enriching Bio-ontologies with Non-hierarchical Relations

Conclusions

Page 35: Enriching Bio-ontologies with Non-hierarchical Relations

35 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Conclusions (1)Conclusions (1)

Links across OBO ontologies need to be made Links across OBO ontologies need to be made explicitexplicit Between GO terms across GO hierarchiesBetween GO terms across GO hierarchies Between GO terms and OBO termsBetween GO terms and OBO terms Between terms across OBO ontologiesBetween terms across OBO ontologies

Automatic approachesAutomatic approaches Effective (GO-GO, GO-ChEBI)Effective (GO-GO, GO-ChEBI) At least to bootstrap the processAt least to bootstrap the process Needs to be refinedNeeds to be refined

Page 36: Enriching Bio-ontologies with Non-hierarchical Relations

36 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Conclusions (2)Conclusions (2)

Affordable relationsAffordable relations Computer-intensive, not labor-intensiveComputer-intensive, not labor-intensive

Methods must be combinedMethods must be combined Cross-validationCross-validation Redundancy as a surrogate for reliabilityRedundancy as a surrogate for reliability Relations identified specifically by one approachRelations identified specifically by one approach

False positivesFalse positives Specific strength of a particular methodSpecific strength of a particular method

Requires (some) manual curationRequires (some) manual curation Biologists must be involvedBiologists must be involved

Page 37: Enriching Bio-ontologies with Non-hierarchical Relations

37 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Bodenreider O, Aubry M, Burgun A. Bodenreider O, Aubry M, Burgun A. Non-lexical approaches to Non-lexical approaches to identifying associative relations in the Gene Ontologyidentifying associative relations in the Gene Ontology. In: Altman RB, . In: Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-102.102.http://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdf

Burgun A, Bodenreider O. Burgun A, Bodenreider O. An ontology of chemical entities helps An ontology of chemical entities helps identify dependence relations among Gene Ontology termsidentify dependence relations among Gene Ontology terms. . Proceedings of the First International Symposium on Semantic Mining Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM-2005)in Biomedicine (SMBM-2005)Electronic proceedings: CEUR-WS/Vol-148Electronic proceedings: CEUR-WS/Vol-148http://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdfhttp://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdf

Page 38: Enriching Bio-ontologies with Non-hierarchical Relations

MedicalOntologyResearch

Olivier BodenreiderOlivier Bodenreider

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

Contact:Contact:Web:Web:

[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov