How to build cross-species interoperable ontologies Chris Mungall, LBNL Melissa Haendel, OHSU.
-
Upload
gerard-mcdonald -
Category
Documents
-
view
225 -
download
0
Transcript of How to build cross-species interoperable ontologies Chris Mungall, LBNL Melissa Haendel, OHSU.
How to build cross-species interoperable ontologies
Chris Mungall, LBNLMelissa Haendel, OHSU
The challenge..
• There are many fun and interesting issues involved in building and using cross-species ontologies– homology– evo-devo– reasoning using ontologies– connecting genomics databases to phenotypes
but…• Unfortunately, there are many more prosaic
issues with unsatisfying solutions– multiple ontologies already exist– limited cooperation between the developers of these
ontologies– they differ widely in every aspect imaginable– they are heavily embedded in existing databases and
applications and slow to change– tools and infrastructure support falls short of what we
need• FORTUNATELY, solutions are emerging..
Outline
• Anatomy Ontologies: Background• Case studies
– GO: A unified cross-species ontology– CL: Cell Ontology: Unifying multiple existing
efforts
• Building interoperable gross anatomy ontologies– (Melissa)
Ontologies• Computable qualitative representations of some
part of the world• Relationships with computable properties
– e.g. transitivity– languages and formats like owl and obo have a formal
semantics• Entities are grouped into classes• Relationships are statements about all the
members of a class– the most common form is the all-some statement
Ontologies are not smart• Deductive Logic is not flexible• Example
– Human knowledge:• chromosomes are found in the nucleus
– Naïve ontology encoding:• every chromosome part_of some nucleus
– But this is wrong• Ontologies don’t make exceptions!
– Solution:• (1) create location-specific subclasses
– nuclear chromosome– mitochondrial chromosome
• (2) – invert statement: every nucleus has chromosomes
Existing Anatomy Ontologies
• Human AOs• Model Organism AOs• Domain specific AOs• Cross-species AOs
FMA : Foundational Model of Anatomy
• Domain: adult human– no develops_from relationships, few embryonic structures
• Size: large (70k+ classes)• Language: frames• Approach
– formal, Strict single inheritance, Purely structural perspective– No computable definitions– Heavily pre-coordinated
• “Trunk of communicating branch of zygomatic branch of right facial nerve with zygomaticofacial branch of right zygomatic nerve”
• “Distal epiphysis of of distal phalanx of right little toe”– Extensive spatial relationships in selected areas
• e.g. veins, arteries• Uses
– not designed for one particular use
FMA Example / FMA:62955 ! Anatomical entity is_a FMA:61775 ! Physical anatomical entity is_a FMA:67165 ! Material anatomical entity is_a FMA:67135 ! Anatomical structure is_a FMA:67498 ! Organ is_a FMA:55670 ! Solid organ is_a FMA:55661 ! Parenchymatous organ is_a FMA:55662 ! Lobular organ is_a FMA:13889 ! Pituitary gland is_a FMA:20020 ! Vestibular gland is_a FMA:55533 ! Accessory thyroid gland is_a FMA:58090 ! Areolar gland is_a FMA:59101 ! Lacrimal gland is_a FMA:62088 ! Lactiferous gland is_a FMA:7195 ! Lung is_a FMA:7197 ! Liver is_a FMA:7198 ! Pancreas is_a FMA:7210 ! Testis is_a FMA:76835 ! Accessory pancreas is_a FMA:9597 ! Salivary gland is_a FMA:9599 ! Bulbo-urethral gland is_a FMA:9600 ! Prostate is_a FMA:9603 ! Thyroid gland
Model Organism Anatomy Ontologies
• Typically species-centric– FBbt : Drosophila melanogaster– WBbt: C elegans– ZFA: Danio rerio– XAO: Xenopus– MA: Adult Mouse (no develops from)– EMAP/EMAPA: developing mouse
• Uses– primarily gene expression, also phenotype description– others: Virtual FLy Brain, Phenoscape
• Approach:– use-case driven– practicality over formality– No computable definitions
• (exception FBbt)
Other anatomy ontologies• Developing human
– EHDAA2• Vectors
– TGMA – mosquito– TADS - tick
• Upper ontologies– CARO– AEO
• Domain-specific anatomy ontologies– NIF_Anatomy, NIF_Cell – neuroscience
• Phylogenetic or multi-taxon AOs– HAO – hymeoptera– PO – plant– TAO – telost– AAO – amphibian– SPD – Spider– …– we will return to these later..
Problem• These AOs are not developed in a coordinated
fashion– use of a shared upper ontology does not buy us much– even the 3 mammalian AOs are massively different
• Data annotated using these ontologies effectively becomes siloed
• There is redundancy of effort in areas of shared biology
• Are there lessons from existing ontologies?
Building ontologies that are interoperable across species
• Case Studies– GO– Cell Ontology
Gene Ontology
• Covers all kingdoms of life– viruses, bacteria, archaea– fungi, metazoans, plants
• Covers biology at different scales• Issues
– terminological confusion (e.g. “blood”)– large, difficult to maintain
How does GO deal with taxonomic variation?
• What GO says:– every nucleus is part_of some cell
• What GO does not say:– every cell has_part some nucleus
• wrong for bacteria (and mammalian erythrocytes)
• Take home:– Logical quantifiers are essential to understanding the
ontology– Saying what something is part of is safer than saying
what its parts are
Principle: avoidance of taxonomic differentia
• Not in GO:– vertebrate eye development– insect eye development– cephalopod eye development
• In GO:– eye development
• camera-type eye development• compound eye development
• Exceptions for usability:– cell wall
• fungal-type cell wall [differentia:cross-linked glycoproteins and carbohydrates, chitin / beta-glucan …]
• plant-type cell wall [differentia: cellulose, pectin, …]
} no implication ofhomology
The problem of vagueness in GO
• “limb development”• “wing development”
Adding taxonomic constraints to GO
• GO now includes two additional relations– only_in_taxon– never_in_taxon– See:
• Kusnierczyk, W: Taxonomy-based partitioning of the Gene Ontology, JBI 2008
• Deegan et al: Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development, BMC Bioinformatics 2010
Examples
• lactation only_in_taxon Mammalia (NCBITaxon:40674 )– OWL: lactation in_taxon only Mammalia
• odontogenesis never_in_taxon Aves (NCBITaxon:8782)– OWL: odontogenesis in_taxon only not Aves
• chloroplast only_in_taxon (Viridiplantae or Euglenozoa) (NCBITaxon:33682 or NCBITaxon:33090)
Uses of taxon relationships
1. Clarifying meaning of GO terms2. Detection of errors in electronic and manual
annotation• Automated reasoners• GO previously had chicken genes involved in
lactation, slime mold genes involved in fin regeneration…
3. Providing views over GO• e.g. subset of GO excluding terms that are never in
drosophila
Scalability of single-ontology approach: GO
• How does GO cope with wide taxonomic diversity?– conservation at molecular level, wide diversity of
phenotypes at level of gross anatomical development, physiology, and organismal behavior
• GO Development– Focused on model systems
• “beak development” added only recently
• GO Behavior– Very broad coverage– Some specific terms, e.g. drosophila courtship
Proposal: outsource portions of the ontology
Ontology Views• Ontologies, traditional
– independent standalone resources• Ontologies, new
– interconnected resources– multiple views possible
• Subsetting• Aggregation• Subsetting + Aggregation
– views can be manually specified (e.g. go slims) or automatically constructed
– Limited re-writing possible• e.g. names
Viewssubsetsubset aggregateaggregate
subsetsubset
aggregate+subset
aggregate+subset
subset
“slim”
domain/taxon-specificcut
scatteredsubset
Subset
of GO
vertebrate
subset
Outline
• Case studies– GO: A unified cross-species ontology– Cell Ontology: Unifying multiple existing efforts
• Gross Anatomy
Cell types• GO-Cell Component
– cell parts• CL – cell ontology• Anatomical Ontologies
– Includes cell types:• FBbt (Drosophila)• WBbt (C elegans)• ZFA, TAO (Danio rerio, Teleost)• FMA (Human)• PO (Plant)• FAO (Fungi)
– Excludes cell types:• MA (adult mouse)• EMAPA (developing mouse)• EHDAA2 (developing human)
Overlap (simplified view)
CLCL
MAMAFMAFMA
POPO
ZFAZFA
neuron
alveolarmacrophage
lung
brain
plantspore
NIFcellNIFcell
The Problem
• Duplicated work• No unified view• Confusion for users• Confusion for annotators
Alternative proposals
1. LUMP: Combine into one monolithic CL ontology
2. SPLIT: Taxon-specific cell types in taxon-centric ontologies
a) Obsolete generic cell types currently in tcAOs-vs-
b) Taxon-specific subclasses of generic cell types
LUMP
all cellsall cells
mousemousehumanhuman
plantsplants
fishfish
neuron
alveolarmacrophage
plantspore
CL Lumping proposal
• Advantages:– one stop shopping for CL
• (but this can be done with aggregate views)
• Disadvantages– tcAO IDs well-established– Little advantage to lumping plant cells with animal
cells– Harder to manage editorially– Cross-granular relationships
(Partial) Splitting proposal• Advantages:
– Easier to manage– Sensible subdivision of labor:
• Common cell types in shared common cell ontology– e.g. shared definition of “neuron”
• Taxon-specific subtypes in taxon-centric ontologies• Disadvantages
– Aggregate view is problematic• union of ontologies contains multiple classes labeled “neuron”
– Can be solved by obsoleting existing generic cell classes in tcAOs and replacing by CL IDs
• problem: cross-granular relationships
Current solution for CL: split and retain IDs
• Any cell type shared by two model taxa should be in CL
• tcAOs retain both generic and specific cell type classes– Formally connected to CL via subclass
relationships• or even stronger: taxon-specific equivalent
Example aggregate view
musclecell
musclecell
cellcellmuscleorgan
muscleorgan
i
i
p
musclecell
musclecell
cellcell
i
frontal pulsatile
organmuscle
frontal pulsatile
organmuscle
i
muscle cell
muscle cell
cellcell
i
i
FMAFMA FBbtFBbtCLCLCL-metazoa
Example aggregate+subset view
musclecell
musclecell
cellcell
i
i
musclecell
musclecell
cellcell
i
frontal pulsatile
organmuscle
frontal pulsatile
organmuscle
i
muscle cell
muscle cell
cellcell
i
i
FMAFMA FBbtFBbtCLCLCL-metazoa
Who maintains the connections and how?
• How:– maintained as xrefs for
convenience• Who:
– either tcAO or CL• Synchronization?
– hard– reasoning over aggregate
view
Who maintains the connections?[Term]id: CL:0000584name: enterocytedef: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:mhl]xref: FMA:62122is_a: CL:0000239 ! brush border epithelial cell
[Term]id: ZFA:0009269name: enterocytenamespace: zebrafish_anatomydef: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:curator]synonym: "enterocytes" EXACT PLURAL []xref: CL:0000584xref: TAO:0009269xref: ZFIN:ZDB-ANAT-070308-209is_a: ZFA:0009143 ! brush border epithelial cellrelationship: end ZFS:0000044 ! Adultrelationship: part_of ZFA:0005124 ! intestinal epitheliumrelationship: start ZFS:0000000 ! Unknown
cl.obo
zfa.obo
cl’s responsibilitycl’s responsibility
zfa’s responsibility
zfa’s responsibility
Issues with aggregate view
musclecell
musclecell
cellcell
i
i
musclecell
musclecell
cellcell
i
frontal pulsatile
organmuscle
frontal pulsatile
organmuscle
i
muscle cell
muscle cell
cellcell
i
i
FMAFMA FBbtFBbtCLCL
duplicate names
duplicate nameslattices =
hairballslattices = hairballs
Duplicate names• Searching for “muscle cell” returns
– CL:0000187 ! muscle cell– FBbt:00005074 ! muscle cell– FMA:67328 ! muscle cell– ZFA:0009114 ! muscle cell– NIF_Cell:sao519252327 ! Muscle Cell
• Proposed solutions1. rename in source ontology
• yuck2. make end-user applications smarter
• not practical for n applications3. auto-rename in ontology view
• best solution
Aggregate view[Term]id: CL:0000584name: enterocytedef: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:mhl]xref: FMA:62122is_a: CL:0000239 ! brush border epithelial cell
[Term]id: ZFA:0009269name: zebrafish enterocytedef: "An epithelial cell that has its apical plasma membrane folded into microvilli to provide ample surface for the absorption of nutrients from the intestinal lumen." [SANBI:curator]synonym: "enterocytes" EXACT PLURAL []xref: CL:0000584xref: TAO:0009269xref: ZFIN:ZDB-ANAT-070308-209is_a: CL:0000584 ! enterocyteis_a: ZFA:0009143 ! brush border epithelial cellrelationship: end ZFS:0000044 ! Adultrelationship: part_of ZFA:0005124 ! intestinal epitheliumrelationship: start ZFS:0000000 ! Unknown
cl-metazoa.obo
generated from xref
generated from xref
FMA class not shown, but it
would also subclass
FMA class not shown, but it
would also subclass
rewritten name(or syn – TBD)
rewritten name(or syn – TBD)
latticelattice
Summary: taxon variation in CL
• Current solution is a compromise– Constraints
• integrate with pre-existing tcAO ontologies• these ontologies have links to gross anatomy
– tcAOs loosely integrated with CL– plant cell types should be left to PO– Synchronization remains a challenge
zebrafishzebrafish
caro / allcaro / allcellcell tissuetissue
metazoametazoa
muscletissue
muscletissue
vertebratavertebrata
mesonephrosmesonephros
limblimb
arthropodaarthropoda
antennaantenna
teleostteleost
weberian ossicle
weberian ossicle
mammaliamammalia
mammary gland
mammary gland
nervous systemnervous system
molluscamollusca
footfoot
cephalopodcephalopod
tentacletentacle
mantlemantle
drosophiladrosophila
neuron types XYZ
neuron types XYZ
mushroom body
mushroom body
brachial lobebrachial lobe
NO ponsNO pons
vertebravertebra
vertebralcolumn
vertebralcolumn
circulatory system
circulatory system
appendageappendage
mesodermmesoderm
gutgut
tibiatibia
glandgland
bonebone
skeletaltissue
skeletaltissue
parietalbone
parietalbone
finfin
gonadgonad
tracheatrachea
respiratoryairway
respiratoryairway
cross-ontologylink (sample)
amphibiaamphibia
tibiafibulatibiafibula
larvalarva
shellshellcuticlecuticle
skeletonskeleton
import
mousemouse humanhuman
Lessons for gross anatomy
Conclusions
• Historically anatomy ontologies have been developed by different groups largely in isolation
• The Phenotype RCN should coordinate these efforts
• Dynamic Views• Explicit taxonomic relationships
• end
• Melissa Here
Idealized model (M0)• A single ontology for ontology editors and
consumers• Different editors have editing rights to different
ontology partitions– by taxon– by domain (e.g. neuroscience, skeletal anatomy)
• No taxon-specific subtypes– use structure, function etc as differentia
• Users obtain dynamic views according to their needs
Example M0cellcell tissuetissue
muscletissue
muscletissue
mesonephrosmesonephros
limblimb
antennaantenna
weberian ossicle
weberian ossicle
mammary gland
mammary gland
nervous systemnervous system
mollusc foot
mollusc foot
tentacletentacle
mantlemantle
pupal DN3 period neuron
pupal DN3 period neuron
mushroom body
mushroom body
brachial lobebrachial lobe
ponspons
vertebravertebra
vertebralcolumn
vertebralcolumn
circulatory system
circulatory system appendageappendage
mesoderm
mesoderm
gutgut
tibiatibia
glandgland
bonebone
skeletaltissue
skeletaltissue
parietalbone
parietalbone
finfin
gonadgonad
tracheatrachea
respiratoryairway
respiratoryairway
link(small sample)
tibiafibulatibiafibula
larvalarva
user/editorview
metencephalonmetencephalon
molluscview
neuroview
skeletalview
mammalianview
ventralnervecord
ventralnervecord
Slightly less idealized model (M1)
• Maintain series of ontologies at different taxonomic levels– euk, plant, metazoan, vertebrate, mollusc, arthropod,
insect, mammal, human, drosophila• Each ontology imports/MIREOTs relevant subset
of ontology “above” it– this is recursive
• Subtypes are only introduced as needed• Work together on commonalities at appropriate
level above your ontology
zebrafishzebrafish
Example M1caro / allcaro / allcellcell tissuetissue
metazoametazoa
muscletissue
muscletissue
vertebratavertebrata
mesonephrosmesonephros
limblimb
arthropodaarthropoda
antennaantenna
teleostteleost
weberian ossicle
weberian ossicle
mammaliamammalia
mammary gland
mammary gland
nervous systemnervous system
molluscamollusca
footfoot
cephalopodcephalopod
tentacletentacle
mantlemantle
drosophiladrosophila
neuron types XYZ
neuron types XYZ
mushroom body
mushroom body
brachial lobebrachial lobe
NO ponsNO pons
vertebravertebra
vertebralcolumn
vertebralcolumn
circulatory system
circulatory system
appendageappendage
mesodermmesoderm
gutgut
tibiatibia
glandgland
bonebone
skeletaltissue
skeletaltissue
parietalbone
parietalbone
finfin
gonadgonad
tracheatrachea
respiratoryairway
respiratoryairway
cross-ontologylink (sample)
amphibiaamphibia
tibiafibulatibiafibula
larvalarva
shellshellcuticlecuticle
skeletonskeleton
import
mousemouse humanhuman
Objections to M1
• Biological– homology vs analogy– functional grouping classes
• e.g. respiratory airway, eye
• Practical– tools– what about existing AOs?
• new AOs should be designed for integration from the ground up
Protocol for new AOs
1. Collect draft list of terms2. subdivide roughly into applicability at taxonomic
levels3. request new terms from existing AOs above you4. is a new mid-level AO required?
• yes – collaborate and create, go to 1.5. import subset from next AO above6. Build your ontology
Example: the octopus ontology
• Collect and subdivide terms:– cephalopod: tentacle, brachial lobe, subesophageal mass,
beak, visceropericardial coelum, swim bladder– mollusc: mantle– metazoan: nervous system, muscle tissue
• Mollusc anatomy ontology does not exist– either: (i) find collaborators and create– or: (ii) keep mollusc terms in your ontology for now, but
mark them as possibly migrating upwards• Import terms from mollusc AO(i), or metazoan if (ii) no
mollusc AO
How are things organized now?
• 3 examples:– PO– TAO/ZFA– Uberon
• In Melissa’s talk
Some AOs are cross-granular
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ipmuscle
cell protoplasm
musclecell
protoplasm
subcellular cell tissue and gross anatomy
Cross-granular relationships
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ip
Cross-granular relationships
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ip
musclecell
musclecell
cellcell
i
i
CL
Obsoleting generic classes in tcAOs
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ip
musclecell
musclecell
cellcell
i
i
CL
Migrating cross-granular relationships
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ip
musclecell
musclecell
cellcell
i
i
CL
“true path” violations
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ip
musclecell
musclecell
cellcell
i
i
CL FBbt
frontal pulsatile
organmuscle
frontal pulsatile
organmuscle
i
fix
musclecell
musclecell
cellcellmuscleorgan
muscleorgani
p
FMA
ip
musclecell
musclecell
cellcell
i
i
CL FBbt
frontal pulsatile
organmuscle
frontal pulsatile
organmuscle
i
muscle cell AND part of
some human
muscle cell AND part of
some human
PO: Plants
• Single unified ontologies for all plants– cell types and gross anatomy
• Generalized from ontology of flowering plants
TAO and ZFA
• Teleost and Zebrafish
Uberon• Designed to unify existing tcAOs• Uses modern ontology development techniques
– heavily axiomatized = less work for humans, leave it to reasoners
• automated QC• automated classification
• Current size: 5k+ classes• Multiple relationship types• Links to and from GO, CL• Aggregate views possible using xrefs maintained in
uberon
Uberon lessons• Original Design Goals
– Unify metazoan tcAOs for cross-species phenotype queries– Seed initial version from text matching
• Was this a good idea?– metazoans are fairly diverse
• many original dubious grouping classes have been eliminated or split• functional grouping classes remain• tissues, germ layers, etc less controversial • Uberon is really a vertebrate AO in which we’ve added placeholder metazoan
terms – labels are misleading
• high false +ve, false –ve from txt matching• starting from textbook comparative anatomy knowledge would have been
better (give time)