Creating Biological Ontologies for Applications Dr Andrew Gibson Swammerdam Institute for Life...

59
Creating Biological Ontologies for Applications Dr Andrew Gibson Swammerdam Institute for Life Sciences Universiteit van Amsterdam

Transcript of Creating Biological Ontologies for Applications Dr Andrew Gibson Swammerdam Institute for Life...

Creating Biological Ontologies for Applications

Dr Andrew Gibson

Swammerdam Institute for Life SciencesUniversiteit van Amsterdam

Creating Biological Ontologies for Applications

The important bit

Some specific considerations

Key Message of Today

An ontology in itself does not do anything until it is part of an application

There are many approaches to creating an ontology

so

Do not create an ontology until you know what it should do

or

You will have many philosophical argumentsYou will find it difficult to make your application work

Audience Survey

Who has used an ontology?

What ontologies can you name?

Who has done the Pizza tutorial?

Who has written an ontology?

“An ontology is an explicit specification of a conceptualization”

– Tom Gruber, 1993

What is an ontology?

“Vocabularies or representational terms – classes, relations, functions, object

constants – with agreed-upon definitions, in the form of human readable text and

machine enforceable, declarative constraints on their well formed use”

– Tom Gruber, 1991

What is an ontology?

What’s in an ontology?

Components for humans Labels

Terms Synonyms

Definitions Disambiguation

Metadata Contributors Comments Guidelines

Components for computers Format / Syntax Classes, properties,

instances, datatypes

Declarations, axioms, restrictions

(and maybe some other stuff…)

What do these things allow?

Components for humans

Searching Browsing Understanding Evaluation

Components for computers

Parsing Integration Reasoning

Inferences Consistency

checking

>10,000 RDF Graphs

BioHealth MESH / NCI

Controlled Vocabulary Gene Ontology / OBO

Medical Ontologies FMA / Galen

PizzaTutorial / ExampleTAMBISData Integration

Application myGrid

Philosophical BFO

Exchange Format BioPAX

What things are ontologies?

Semantic Web FOAF / Dublin Core

All of these ‘ontologies’ are useful…but have very different components to fit their purpose

OWL is ‘cool’

As you know by now, OWL is the W3C standard Web Ontology Language

Most of those ontologies are available as OWL? Is this always appropriate?

OWL(-DL) is very powerful You can include many statements that a reasoner can use

to make exciting inferences But you don’t have to for an OWL ontology to be useful…

OWL is very flexible Can be used to represent many different things Often, reasoning is never intended ? Is everything in OWL an ontology?

Pet hate (1); please avoid saying:

“We use OWL because it is a W3C standard” “[something about interoperability]” This is actually a valid statement… but Some advantages are not guaranteed If you say this, please be aware of the

definite advantages E.g. Interoperable with what?

In Reality…

People sometimes use OWL because “we don’t know what else to use” “everyone expects us to” “everyone else does” “there are a lot of tools available for OWL”

E.g. you can make nice things in RDF, but there isn’t much tool support

Maybe I’m just a cynical scientist Using OWL is good, but you should know why

Semantic Web Ontology Goalshttp://www.w3.org/TR/webont-req/

Sharing “Ontologies should be publicly available and different

data sources should be able to commit to the same ontology for shared meaning”

Re-Use “Interoperability requires agreements on the

definitions of identifiers” … but also requires human components

Extension “Often, shared ontologies are not sufficient” OWL allows you to extend other ontologies to suit your

requirements See OWL Imports*

Re-use is a general principle

Look around for existing ontologies It is common to be asked if you did this You can also learn by example

Try and evaluate your options properly The Gene Ontology may contain some terms that

you want It’s fine to conclude that it wasn’t suitable Be clear about why you had to make your own

ontology

An ontology can be seen as a theory Competing theories are not a bad thing There will never be any ‘one’ ontology

ALWAYS Re-use:

People FOAF

Metadata Dublin Core

Online Communities SIOC

Vocabularies SKOS

And more…

Some ‘ontologies’ are small and do not describe ‘domains’

Instead they provide the types and terms for common things in information systems

These are community driven pseudo-standards, and you should ALWAYS reuse these

Pet hate (2); please avoid saying:

“We use OWL-DL because it allows reasoning” Reasoning is not magic For reasoning to be effective

A lot of statements have to be made explicit in the ontology

And this is a lot of work! Instances need to have the appropriate

statements An ontology with no axioms

will be consistent and no inferences will be made

Where to start?

Once you have checked for existing ontologies Work out what you want your application to do Common applications:

Text mining Data mining Small structured data vocabulary Data Integration I - An exchange format Data Integration II – Annotating data Data Integration III – Querying multiple data sources Describing experiments Automatic classification of some Data A pure knowledge model to support an intelligent

system (domain ontology)

What next?

Work out which components you need Will you be doing any reasoning?

What is the main goal of the reasoning? Inferences? Classification? DL-Querying?

Consistency? Will you have instance data?

Do you just need Semantic Types? A little Semantics goes a long way

Will your ontology require community involvement?

This will help to show you where to focus your efforts

Community-based aspects

Things to consider before you start: Feedback Argumentation Versioning Deprecation Identifiers Documentation Metadata

Also consider SKOS A vocabulary in OWL for talking about vocabularies,

thesauri etc.

Unilaterally developed ontologies rarely get reused But if you have an application to create then you might

not want the world to tell you its wrong somehow

Knowledge Modelling

As biologists, we are good at: Being concise Using background knowledge Making assumptions Disambiguation Argumentation

Computers are not good at these things They only know what you tell them

And this can be a lot of work

Identify the core concepts in your domain Try to identify areas where ambiguities may cause

problems Work out the relationship between these things and what

your application will do

Modelling ‘The Truth’

An ontology for an application may include assumptions For sharing, re-use and extension, assumptions

should be explicit in the metadata Often this is not the case

It is tempting to start trying to encode everything known about a domain Work out what is relevant and useful Reasoning performance can be improved with

pragmatic modelling

Creating Class Hierarchies in OWL

This is a core step of organising your domain concepts But you should consider the semantics!

For computers: Class A subclass of Class B means All instances of class A are also instances of class B

For humans: Enzyme subclass of Protein

Every instance of Enzyme is an instance of Protein Probably OK, as long as no-one includes ribozymes

DNA subclass of Molecule Not so clear - DNA could also be “stuff”

If so, what’s an instance of DNA Better to call this class ‘DNA Molecule’

*Philosophical warning*

More Understanding OWL Semantics

Genes encode proteins Something you might say

Gene encodes some Protein Something you might put in an ontology

All instances of class Gene encode at least one instance of class Protein What the computer understands

How do I now describe a mutated gene, a silenced gene etc.

What biologists say is not always what biologists mean There are usually exceptions to biological “rules” These usually cause disagreements with experts

and are hard to resolve

The Balanced Ontologist

User Domain E.g. Molecular Biology

Information Technology Standards, Syntax,

Databases, Software Engineering

Artificial Intelligence Formalism, Logic,

Reasoning

Knowledge Representation

Ontology tools, Design Patterns, Methodologies

ArtificialIntelligence

KnowledgeRepresentation

InformationTechnology

User Domain

Problems you may encounter

Ontology Comprehension Other people understanding your

ontology Knowledge acquisition

Getting explicit knowledge so you can encode it

Inconsistencies Not necessarily bad things

Open World Assumption

Case Study: ComparaGRID

Aims Integrate data sources across species boundaries

Genomic mapping DNA sequence Evolutionary relationships Functional information

Inform and support genomic mapping Particularly in non-model organisms

Full genome sequencing is expensive and not a priority Microarrays not available

Biological Goal To map, indentify and understand genes behind

phenotypes Diseases Commercially important traits

Integration Strategy:Ontology Mediated

Specify an OWL ontology that captures the semantics of biological data in comparative genomics

The ontology acts as a ‘global schema’ for the heterogeneous data We take the data as individuals, assign Classes and

properties

The ontology contains OWL Classes that specify the types of data, and relationships between those classes The extra encoded knowledge allows the computer to

make extra inferences about the data

However, to integrate data, we have to retrieve it from the databases and convert ‘their meaning’ into ‘our meaning’

The ‘Rest’ of the Application

Now need to get the data out of the databases Step 1: Put a Web Service on the DB

Handles incoming user queries We only want data relevant to what the user is interested in

Step 2: Format conversion I.e. converts query results from native DB format to

RDF/XML

Step 3: Semantic transformation Need a set of rules for each database

Step 4: Integration Simple once data is in OWL with the correct semantics

Step 5: Presentation to the user Navigation of results and generation of further queries

The ‘Rest’ of the Integration Strategy

Comparative Genomics Use Case

Agribusiness wants to identify the genetic basis of the ‘Tasty Bacon’ trait

Fundamentals of Genetics:Traits and Alleles

Diploid cells contain two copies of each chromosome (homologous pairs)

Alleles are alternative forms of genes

Phenotypic traits are determined by allele combination (genotype)

Complex phenotypic traits are controlled by more than one locus

Fundamentals of Genetics:Allelic Inheritance: Meiosis

Fundamentals of Genetics:Genetic Linkage and Distance

‘Genetic markers’ are identifiable features in a genome that exhibit genetic variation in a population

The observation of crossover frequency is used to determine the genetic distance between any two genetic markers

The co-inheritance of markers in a population is used to construct genetic maps

Genetic Maps

Genetic maps are probabilistic

Give an indication of the order and distance between markers

Adding more markers to the analysis can completely alter the map

Study Inheritance

Statistical Analysis

0

A CB D E

200GeneticMap

‘Actual’Genotype

Fundamentals of Genetics:Conservation of Synteny

“Conservation of (blocks of) gene order throughout chromosomal evolution”

As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc.

Blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and human)

Gene order in these blocks in one species can inform/predict the order and existence of evolutionarily related genes (orthologues) in other species

Technique: QTL Mapping

Quantitative Trait Locus (QTL) A region of DNA that is associated with a

particular phenotypic trait

QTL Mapping Find genetic markers that are significantly

more likely to co-occur with the trait than expected by chance

Identify a region of DNA closely linked to a gene responsible for the trait

Technique: QTL Mapping

Breeding study

Test Population

Measurement of trait values

Genotyping(Identify the markers)

Statistical Analysis

QTL Map

A QTL ResultLocalization of a blood pressure quantitative trait locus (QTL) in rats

Back to the use case…

The ‘Tasty Bacon’ QTL has been genetically mapped

Some genetic factor in this chromosome region makes bacon tasty QTL (Genetic) Map

LOD

sco

re

Genetic Distance on Chromosome (cM)150 200

Integration with Pig Genetic Maps

The position of the QTL is correlated on various types of Pig genetic maps

Tasty Bacon

LinkageMap

RadiationHybrid Map

QTL Map

Integration with a Human Map There is a ‘known’ homology between a Pig

Marker/Sequence in this region and the human genome

Human

LinkageMap

RadiationHybrid Map

CytogeneticMap

Pig

QTL Map

DNA Sequence Similarity

Homology? Orthology?

Integration with Human Sequence Data

A physical map of BAC clones exists for this region of the Human genome

Human

BAC1

BAC2

BAC3

PhysicalMapping

LinkageMap

RadiationHybrid Map

CytogeneticMap

Pig

QTL Map

Integration with Chicken Data

There are known chicken expressed sequences homologous to Human Gene Sequences in this region Human

BAC1

BAC2

BAC3

PhysicalMapping

LinkageMap

RadiationHybrid Map

CytogeneticMap

Pig

QTL Map

EST2

EST1

Chicken

ESTLibrary

Integration of Gene Expression Data

Gene expression Data for these Chick ESTs might correlate with a trait similar to ‘Tastiness’

Human

BAC1

BAC2

BAC3

PhysicalMapping

LinkageMap

RadiationHybrid Map

CytogeneticMap

Pig

QTL Map

EST2

EST1

Chicken

ExpressionAnalysis

Finding Relevant Publications

Literature may detail functions of Human genes in this region, and homologies to genes in other species

Human

BAC1

BAC2

BAC3

PhysicalMapping

LinkageMap

RadiationHybrid Map

CytogeneticMap

Pig

QTL Map

EST2

EST1

Chicken

ExpressionAnalysis

LinkedReferences

Problem Overview

Challenges: How to discover all this data? How to integrate all this data by capturing its

meaning? How to use this information to discover ‘new’

information, make testable predictions etc?

Approach: Use Semantic Web technologies to query, retrieve

and represent the data from different data sources Computational reasoning to integrate the data:

If we have a set of facts in the form of data – the reasoner can deduce what else must be true, and fill in gaps in the data

Comparative Genomic Data

THE DATA: FACTS AND ASSERTIONS

Maps Positions of 'markers' on representations of chromosomes Genetic linkage, radiation hybrid, linkage association,

cytogenetic, QTLs etc.

Genomic Sequences from model organisms with annotations including gene positions and structures

Other DNA Sequence Databases Genes, clones, markers

Gene and Protein Function Databases

Gene Family and Homology Databases

ComparaGRID Ontology

Genetics Markers and Maps

Genomics Genomes and

Sequences

Comparative Aspects Evolutionary

relationships, Sequence Similarities

Physical entities Chromosomes,

Organisms

Ontology: Models and Representations

We ‘know’ things about chromosomes We also know different things about rendering

information about chromosomes What relates a genetic map and a dna sequence

of a chromosome is the chromosome itself

ChromosomeGenetic Map DNA Sequences

has model

is model of

has representation

is representation of

Genetic Maps – Ontology Modelling

Meaning here is captured by linking lines and blobs up to physical things

0 A CB D E

Map

(ordering of things on)chromosome or region

is model of

A CB D E

‘Line’

is model of

chromosome or region

‘Blob’

is model of

(detectable)region

Uncertainty in Data Integration

Conflicting data Almost certain that during integration,

conflicting data will arise Important to be able to deal with this Knowledge: A particular gene can only

be contained by one chromosome Conflicting Data:

0 A CB D E

Map of Chromosome 4

0 Z B T R

Map of Chromosome 7

Ontology Modelling for Uncertainty

We know that gene B cannot actually be on two chromosomes This is a fact we can encode in the ontology as knowledge Treated in the wrong way, this data would be inconsistent

We know that two experiments have been done that assert the location of gene B So… we treat the maps as models and as experimental

outcomes

0 A CB D E

Map of Chromosome 4

0 Z B T R

Map of Chromosome 7

Issues in Data Integration

What to do with conflicting data Which ‘fact’ do I believe? Provenance & Metadata

Where did it come from How was it generated How old is it

Disputed data Some ‘facts’ are really ‘assertions’ Homology implies that two genes shared a

common ancestor This can’t be directly proven, only evidence can

be presented The ontology needs to reflect this

DNA_Sequence_Pair_Similarity_Measure

Experimental Outcome

Is_a

DNA_Sequence_Representation DNA_Sequence_Representation

[one of our DNA entities] [one of our DNA entities]

Is_representation_of Is_representation_of

Homologous_DNA_Pair_Assertion

has_memberhas_member

Is_member_of Is_member_of

Is_outcome_ofExperiment

Is_outcome_of

Agent

Asserts

SequenceSimilarityExperiment

Is_aDatatypes of E-value etc.

Similar and Homologous DNA

Summary

Developing ontologies for applications requires: Planning and analysis of the problem Components for computers AND humans Awareness of Knowledge Representation Understanding of OWL semantics

But nothing beats experience See you in the practical session Go forth and ontologise!

Thank you for your attention

Protein Classification

Expressive ontologies are good at classifying well described individuals

People classify things all the time This is time consuming They make mistakes

Wolstencroft et. al. (2006) studied how well an ontology and a reasoner can classify members of the protein phosphatase family

Protein phosphatase domain structure

The differences in domain architecture of the receptor tyrosine phosphatase subfamily

Red = phosphatase catalytic domain

Blue bar = transmembrane region

Green = immunoglobulin domain

Blue circle = fibronectin domain

Purple = MAM domain Yellow = carbonic anhydrase

domain Orange = adhesion

recognition site Black = glycosylation White = cadherin-like domain.

The Application

What you need to do this sort of reasoning An ontology

Axioms describing the criterion for classifying each type of protein phosphatase based on protein domains

Some instances The proteins Domains in those

proteins Which domains

And which not! How many

Not readily available, so an application is needed