Creating Biological Ontologies for Applications Dr Andrew Gibson Swammerdam Institute for Life...
-
Upload
madlyn-shanon-adams -
Category
Documents
-
view
214 -
download
0
Transcript of Creating Biological Ontologies for Applications Dr Andrew Gibson Swammerdam Institute for Life...
Creating Biological Ontologies for Applications
Dr Andrew Gibson
Swammerdam Institute for Life SciencesUniversiteit van Amsterdam
Key Message of Today
An ontology in itself does not do anything until it is part of an application
There are many approaches to creating an ontology
so
Do not create an ontology until you know what it should do
or
You will have many philosophical argumentsYou will find it difficult to make your application work
Audience Survey
Who has used an ontology?
What ontologies can you name?
Who has done the Pizza tutorial?
Who has written an ontology?
“An ontology is an explicit specification of a conceptualization”
– Tom Gruber, 1993
What is an ontology?
“Vocabularies or representational terms – classes, relations, functions, object
constants – with agreed-upon definitions, in the form of human readable text and
machine enforceable, declarative constraints on their well formed use”
– Tom Gruber, 1991
What is an ontology?
What’s in an ontology?
Components for humans Labels
Terms Synonyms
Definitions Disambiguation
Metadata Contributors Comments Guidelines
Components for computers Format / Syntax Classes, properties,
instances, datatypes
Declarations, axioms, restrictions
(and maybe some other stuff…)
What do these things allow?
Components for humans
Searching Browsing Understanding Evaluation
Components for computers
Parsing Integration Reasoning
Inferences Consistency
checking
>10,000 RDF Graphs
BioHealth MESH / NCI
Controlled Vocabulary Gene Ontology / OBO
Medical Ontologies FMA / Galen
PizzaTutorial / ExampleTAMBISData Integration
Application myGrid
Philosophical BFO
Exchange Format BioPAX
What things are ontologies?
Semantic Web FOAF / Dublin Core
All of these ‘ontologies’ are useful…but have very different components to fit their purpose
OWL is ‘cool’
As you know by now, OWL is the W3C standard Web Ontology Language
Most of those ontologies are available as OWL? Is this always appropriate?
OWL(-DL) is very powerful You can include many statements that a reasoner can use
to make exciting inferences But you don’t have to for an OWL ontology to be useful…
OWL is very flexible Can be used to represent many different things Often, reasoning is never intended ? Is everything in OWL an ontology?
Pet hate (1); please avoid saying:
“We use OWL because it is a W3C standard” “[something about interoperability]” This is actually a valid statement… but Some advantages are not guaranteed If you say this, please be aware of the
definite advantages E.g. Interoperable with what?
In Reality…
People sometimes use OWL because “we don’t know what else to use” “everyone expects us to” “everyone else does” “there are a lot of tools available for OWL”
E.g. you can make nice things in RDF, but there isn’t much tool support
Maybe I’m just a cynical scientist Using OWL is good, but you should know why
Semantic Web Ontology Goalshttp://www.w3.org/TR/webont-req/
Sharing “Ontologies should be publicly available and different
data sources should be able to commit to the same ontology for shared meaning”
Re-Use “Interoperability requires agreements on the
definitions of identifiers” … but also requires human components
Extension “Often, shared ontologies are not sufficient” OWL allows you to extend other ontologies to suit your
requirements See OWL Imports*
Re-use is a general principle
Look around for existing ontologies It is common to be asked if you did this You can also learn by example
Try and evaluate your options properly The Gene Ontology may contain some terms that
you want It’s fine to conclude that it wasn’t suitable Be clear about why you had to make your own
ontology
An ontology can be seen as a theory Competing theories are not a bad thing There will never be any ‘one’ ontology
ALWAYS Re-use:
People FOAF
Metadata Dublin Core
Online Communities SIOC
Vocabularies SKOS
And more…
Some ‘ontologies’ are small and do not describe ‘domains’
Instead they provide the types and terms for common things in information systems
These are community driven pseudo-standards, and you should ALWAYS reuse these
Pet hate (2); please avoid saying:
“We use OWL-DL because it allows reasoning” Reasoning is not magic For reasoning to be effective
A lot of statements have to be made explicit in the ontology
And this is a lot of work! Instances need to have the appropriate
statements An ontology with no axioms
will be consistent and no inferences will be made
Where to start?
Once you have checked for existing ontologies Work out what you want your application to do Common applications:
Text mining Data mining Small structured data vocabulary Data Integration I - An exchange format Data Integration II – Annotating data Data Integration III – Querying multiple data sources Describing experiments Automatic classification of some Data A pure knowledge model to support an intelligent
system (domain ontology)
What next?
Work out which components you need Will you be doing any reasoning?
What is the main goal of the reasoning? Inferences? Classification? DL-Querying?
Consistency? Will you have instance data?
Do you just need Semantic Types? A little Semantics goes a long way
Will your ontology require community involvement?
This will help to show you where to focus your efforts
Community-based aspects
Things to consider before you start: Feedback Argumentation Versioning Deprecation Identifiers Documentation Metadata
Also consider SKOS A vocabulary in OWL for talking about vocabularies,
thesauri etc.
Unilaterally developed ontologies rarely get reused But if you have an application to create then you might
not want the world to tell you its wrong somehow
Knowledge Modelling
As biologists, we are good at: Being concise Using background knowledge Making assumptions Disambiguation Argumentation
Computers are not good at these things They only know what you tell them
And this can be a lot of work
Identify the core concepts in your domain Try to identify areas where ambiguities may cause
problems Work out the relationship between these things and what
your application will do
Modelling ‘The Truth’
An ontology for an application may include assumptions For sharing, re-use and extension, assumptions
should be explicit in the metadata Often this is not the case
It is tempting to start trying to encode everything known about a domain Work out what is relevant and useful Reasoning performance can be improved with
pragmatic modelling
Creating Class Hierarchies in OWL
This is a core step of organising your domain concepts But you should consider the semantics!
For computers: Class A subclass of Class B means All instances of class A are also instances of class B
For humans: Enzyme subclass of Protein
Every instance of Enzyme is an instance of Protein Probably OK, as long as no-one includes ribozymes
DNA subclass of Molecule Not so clear - DNA could also be “stuff”
If so, what’s an instance of DNA Better to call this class ‘DNA Molecule’
*Philosophical warning*
More Understanding OWL Semantics
Genes encode proteins Something you might say
Gene encodes some Protein Something you might put in an ontology
All instances of class Gene encode at least one instance of class Protein What the computer understands
How do I now describe a mutated gene, a silenced gene etc.
What biologists say is not always what biologists mean There are usually exceptions to biological “rules” These usually cause disagreements with experts
and are hard to resolve
The Balanced Ontologist
User Domain E.g. Molecular Biology
Information Technology Standards, Syntax,
Databases, Software Engineering
Artificial Intelligence Formalism, Logic,
Reasoning
Knowledge Representation
Ontology tools, Design Patterns, Methodologies
ArtificialIntelligence
KnowledgeRepresentation
InformationTechnology
User Domain
Problems you may encounter
Ontology Comprehension Other people understanding your
ontology Knowledge acquisition
Getting explicit knowledge so you can encode it
Inconsistencies Not necessarily bad things
Open World Assumption
Case Study: ComparaGRID
Aims Integrate data sources across species boundaries
Genomic mapping DNA sequence Evolutionary relationships Functional information
Inform and support genomic mapping Particularly in non-model organisms
Full genome sequencing is expensive and not a priority Microarrays not available
Biological Goal To map, indentify and understand genes behind
phenotypes Diseases Commercially important traits
Integration Strategy:Ontology Mediated
Specify an OWL ontology that captures the semantics of biological data in comparative genomics
The ontology acts as a ‘global schema’ for the heterogeneous data We take the data as individuals, assign Classes and
properties
The ontology contains OWL Classes that specify the types of data, and relationships between those classes The extra encoded knowledge allows the computer to
make extra inferences about the data
However, to integrate data, we have to retrieve it from the databases and convert ‘their meaning’ into ‘our meaning’
The ‘Rest’ of the Application
Now need to get the data out of the databases Step 1: Put a Web Service on the DB
Handles incoming user queries We only want data relevant to what the user is interested in
Step 2: Format conversion I.e. converts query results from native DB format to
RDF/XML
Step 3: Semantic transformation Need a set of rules for each database
Step 4: Integration Simple once data is in OWL with the correct semantics
Step 5: Presentation to the user Navigation of results and generation of further queries
Comparative Genomics Use Case
Agribusiness wants to identify the genetic basis of the ‘Tasty Bacon’ trait
Fundamentals of Genetics:Traits and Alleles
Diploid cells contain two copies of each chromosome (homologous pairs)
Alleles are alternative forms of genes
Phenotypic traits are determined by allele combination (genotype)
Complex phenotypic traits are controlled by more than one locus
Fundamentals of Genetics:Genetic Linkage and Distance
‘Genetic markers’ are identifiable features in a genome that exhibit genetic variation in a population
The observation of crossover frequency is used to determine the genetic distance between any two genetic markers
The co-inheritance of markers in a population is used to construct genetic maps
Genetic Maps
Genetic maps are probabilistic
Give an indication of the order and distance between markers
Adding more markers to the analysis can completely alter the map
Study Inheritance
Statistical Analysis
0
A CB D E
200GeneticMap
‘Actual’Genotype
Fundamentals of Genetics:Conservation of Synteny
“Conservation of (blocks of) gene order throughout chromosomal evolution”
As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc.
Blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and human)
Gene order in these blocks in one species can inform/predict the order and existence of evolutionarily related genes (orthologues) in other species
Technique: QTL Mapping
Quantitative Trait Locus (QTL) A region of DNA that is associated with a
particular phenotypic trait
QTL Mapping Find genetic markers that are significantly
more likely to co-occur with the trait than expected by chance
Identify a region of DNA closely linked to a gene responsible for the trait
Technique: QTL Mapping
Breeding study
Test Population
Measurement of trait values
Genotyping(Identify the markers)
Statistical Analysis
QTL Map
Back to the use case…
The ‘Tasty Bacon’ QTL has been genetically mapped
Some genetic factor in this chromosome region makes bacon tasty QTL (Genetic) Map
LOD
sco
re
Genetic Distance on Chromosome (cM)150 200
Integration with Pig Genetic Maps
The position of the QTL is correlated on various types of Pig genetic maps
Tasty Bacon
LinkageMap
RadiationHybrid Map
QTL Map
Integration with a Human Map There is a ‘known’ homology between a Pig
Marker/Sequence in this region and the human genome
Human
LinkageMap
RadiationHybrid Map
CytogeneticMap
Pig
QTL Map
DNA Sequence Similarity
Homology? Orthology?
Integration with Human Sequence Data
A physical map of BAC clones exists for this region of the Human genome
Human
BAC1
BAC2
BAC3
PhysicalMapping
LinkageMap
RadiationHybrid Map
CytogeneticMap
Pig
QTL Map
Integration with Chicken Data
There are known chicken expressed sequences homologous to Human Gene Sequences in this region Human
BAC1
BAC2
BAC3
PhysicalMapping
LinkageMap
RadiationHybrid Map
CytogeneticMap
Pig
QTL Map
EST2
EST1
Chicken
ESTLibrary
Integration of Gene Expression Data
Gene expression Data for these Chick ESTs might correlate with a trait similar to ‘Tastiness’
Human
BAC1
BAC2
BAC3
PhysicalMapping
LinkageMap
RadiationHybrid Map
CytogeneticMap
Pig
QTL Map
EST2
EST1
Chicken
ExpressionAnalysis
Finding Relevant Publications
Literature may detail functions of Human genes in this region, and homologies to genes in other species
Human
BAC1
BAC2
BAC3
PhysicalMapping
LinkageMap
RadiationHybrid Map
CytogeneticMap
Pig
QTL Map
EST2
EST1
Chicken
ExpressionAnalysis
LinkedReferences
Problem Overview
Challenges: How to discover all this data? How to integrate all this data by capturing its
meaning? How to use this information to discover ‘new’
information, make testable predictions etc?
Approach: Use Semantic Web technologies to query, retrieve
and represent the data from different data sources Computational reasoning to integrate the data:
If we have a set of facts in the form of data – the reasoner can deduce what else must be true, and fill in gaps in the data
Comparative Genomic Data
THE DATA: FACTS AND ASSERTIONS
Maps Positions of 'markers' on representations of chromosomes Genetic linkage, radiation hybrid, linkage association,
cytogenetic, QTLs etc.
Genomic Sequences from model organisms with annotations including gene positions and structures
Other DNA Sequence Databases Genes, clones, markers
Gene and Protein Function Databases
Gene Family and Homology Databases
ComparaGRID Ontology
Genetics Markers and Maps
Genomics Genomes and
Sequences
Comparative Aspects Evolutionary
relationships, Sequence Similarities
Physical entities Chromosomes,
Organisms
Ontology: Models and Representations
We ‘know’ things about chromosomes We also know different things about rendering
information about chromosomes What relates a genetic map and a dna sequence
of a chromosome is the chromosome itself
ChromosomeGenetic Map DNA Sequences
has model
is model of
has representation
is representation of
Genetic Maps – Ontology Modelling
Meaning here is captured by linking lines and blobs up to physical things
0 A CB D E
Map
(ordering of things on)chromosome or region
is model of
A CB D E
‘Line’
is model of
chromosome or region
‘Blob’
is model of
(detectable)region
Uncertainty in Data Integration
Conflicting data Almost certain that during integration,
conflicting data will arise Important to be able to deal with this Knowledge: A particular gene can only
be contained by one chromosome Conflicting Data:
0 A CB D E
Map of Chromosome 4
0 Z B T R
Map of Chromosome 7
Ontology Modelling for Uncertainty
We know that gene B cannot actually be on two chromosomes This is a fact we can encode in the ontology as knowledge Treated in the wrong way, this data would be inconsistent
We know that two experiments have been done that assert the location of gene B So… we treat the maps as models and as experimental
outcomes
0 A CB D E
Map of Chromosome 4
0 Z B T R
Map of Chromosome 7
Issues in Data Integration
What to do with conflicting data Which ‘fact’ do I believe? Provenance & Metadata
Where did it come from How was it generated How old is it
Disputed data Some ‘facts’ are really ‘assertions’ Homology implies that two genes shared a
common ancestor This can’t be directly proven, only evidence can
be presented The ontology needs to reflect this
DNA_Sequence_Pair_Similarity_Measure
Experimental Outcome
Is_a
DNA_Sequence_Representation DNA_Sequence_Representation
[one of our DNA entities] [one of our DNA entities]
Is_representation_of Is_representation_of
Homologous_DNA_Pair_Assertion
has_memberhas_member
Is_member_of Is_member_of
Is_outcome_ofExperiment
Is_outcome_of
Agent
Asserts
SequenceSimilarityExperiment
Is_aDatatypes of E-value etc.
Similar and Homologous DNA
Summary
Developing ontologies for applications requires: Planning and analysis of the problem Components for computers AND humans Awareness of Knowledge Representation Understanding of OWL semantics
But nothing beats experience See you in the practical session Go forth and ontologise!
Protein Classification
Expressive ontologies are good at classifying well described individuals
People classify things all the time This is time consuming They make mistakes
Wolstencroft et. al. (2006) studied how well an ontology and a reasoner can classify members of the protein phosphatase family
Protein phosphatase domain structure
The differences in domain architecture of the receptor tyrosine phosphatase subfamily
Red = phosphatase catalytic domain
Blue bar = transmembrane region
Green = immunoglobulin domain
Blue circle = fibronectin domain
Purple = MAM domain Yellow = carbonic anhydrase
domain Orange = adhesion
recognition site Black = glycosylation White = cadherin-like domain.
The Application
What you need to do this sort of reasoning An ontology
Axioms describing the criterion for classifying each type of protein phosphatase based on protein domains
Some instances The proteins Domains in those
proteins Which domains
And which not! How many
Not readily available, so an application is needed