Prosdocimi ucb cdao

29
Francisco Prosdocimi Brandon Chisham Enrico Pontelli Arlin Stoltzfus Julie Thompson Framework for a Comparative Data Analysis Ontology IGBMC Department Seminar February 2009, Strasbourg Linking Evolution and Integrative Biology

description

CDAO presentation. The idea of the comparative analysis ontoloty has been presented worldwide, including: NESCent (USA), IGBMC (France), UFRJ (Brazil). Providing a semantic framework for evolutionary analysis in a high-throughtput way after the next and third generation sequencing is the way to approach evolutionary-based studies into genome-wide analysis. The darwinian core of reasoning also allows CDAO to be used with other entities.

Transcript of Prosdocimi ucb cdao

Page 1: Prosdocimi ucb cdao

Francisco ProsdocimiBrandon Chisham

Enrico PontelliArlin Stoltzfus

Julie Thompson

Framework for a Comparative Data Analysis Ontology

IGBMC Department SeminarFebruary 2009, Strasbourg

Linking Evolution and Integrative Biology

Page 2: Prosdocimi ucb cdao

BackgroundBackground Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

An explosion of the number and quality of data to be analyzed

Nature4th September 2008

The Petabyte era (1015): a new generation ofDNA sequencers is up and runninggenome annotation, protein function and structure prediction, homologs searches, prediction of SNPs, etc

New tools are needed for the about-to-exist individual-based genomic sciences and medicine: populational genomics, farmacogenomics, evolutionary genomics

Lots of new data exiges large-scale automated analysis interactome, gene expression, microRNA evolution, etc

Integrative biologydata mining, analysis and integration

Page 3: Prosdocimi ucb cdao

Powerful tools for evolutionary analysis remain under-utilized and difficult to apply

Nowadays tools are mainly used in an expert-supervised approach, which is time-consuming, difficult to document, error-prone, and not scalable

Need for better documentation of the whole pipeline used for evolutionary analysis

Other ChallengesOther Challenges

Ortholog searches

MultipleAlignment

Alignmentrefinement

Phylogeneticreconstruction

Sequencing andBase-calling

DNAextraction

Statisticalanalysis

Extraction kitsConditions

PCR conditionsSequencerPHRED

BLAST BBHCOGnitorPSI-BLASTPhylogeny

ClustalT-CoffeMAFFTMultAlign

ManualLeonREFINERHMM

ParsimonyMax LikelihoodPAUPPhylip

BootstrapJacknifeBayesianMCMC

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

New tools are necessary to the automatic treatment of high-throughput data

Page 4: Prosdocimi ucb cdao

NESCent

Evo-Info@NESCent: a dozen scientific experts in phylogenetic software development got together to discuss these problems

Need to lower the technology barrier to apply the full force of evolutionary analysis to emerging problem areas (systems biology)

An integrated solution would make use of a combination of technologies, including: Clear workflow schemas User-friendly software and web-services Promotion of new databases and data standards Development of standard vocabulary to represent

evolutionary data C-DAO

What to do?What to do?

http://evoinfo.nescent.org/

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 5: Prosdocimi ucb cdao

Developing StandardsDeveloping Standards Standards for standards: formally

approved standards are defined by a number of international bodies, such as W3C

The modern way to standardize knowledgeis creating ontologies and they have beensuccessfully applied for a number of other biomedical applications

Standardization of knowledge is a crucial step forward to allow easy communication and data interoperability

Standardization does not remove diversity but does improve connection, documentation, annotation and scalability

obo

Connecting data, connecting people, connecting algorithms

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 6: Prosdocimi ucb cdao

What is an ontology?What is an ontology? Ontology from philosophy: study of the

nature of being, existence and reality

Ontology and Language: description of concepts (nouns) to describe events and entities in the real world and relations (actions or verbs) to relate these entities

Biomedical ontologiesPositive heuristicsfertile research program

“The positive heuristic of the programme saves

the scientist from becoming confused by

the ocean of anomalies.”

Imre Lakatos (1922-1974)

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

“the mathematician is said to speak not about numbers, functions and

infinite classes but merely about meaningless

symbols and formulas manipulated according to

given formal rules”

Rudolf Carnap(1891-1970)

Page 7: Prosdocimi ucb cdao

Hein? O que é mesmo?Hein? O que é mesmo? Conjunto de termos e relações entre termos que devem ser

utilizados para a descrição de algum fenômeno natural

A ontologia da pizza, definição de termos Relações (verbais) entre termos: temMassa, temBorda,

temIngrediente, temTopo, éMassaDe, éTopoDe Termos: Pan, Italiana, recheioCatupiry, recheioQueijo,

molhoDeTomate, Calabresa, Presunto, QuatroQueijos, Pimentão, Cebola, Ovo, Frango...

Instanciando a ontologia MinhaPizza temMassa Pan

MinhaPizza temBorda recheioQueijoMinhaPizza temIngrediente molhoDeTomateMinhaPizza temIngrediente FrangoMinhaPizza temTopo Catupiry

Gerando novas informações Valor nutricional, preço

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 8: Prosdocimi ucb cdao

A ontologia é a criação de uma linguagem formal com termos e relações entre termos que podem ser instanciados para a descrição formal de eventos do mundo real/natural.

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 9: Prosdocimi ucb cdao

Gene ontologyGene ontology Primeira ontologia criada em biologia molecular, 2000

Consórcio para a padronização da anotação gênica

Vocabulário padrão para a descrição de genes em três categorias Processos biológico Função molecular Localização celular

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 10: Prosdocimi ucb cdao

As sub-ontologias do GO

Anotação de genomas usando os mesmos termos

Comparação eficaz

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 11: Prosdocimi ucb cdao

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 12: Prosdocimi ucb cdao

Além do Gene ontologyAlém do Gene ontology OBO foundry: The open biomedical ontologies

Anatomy ontologies

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 13: Prosdocimi ucb cdao

GO X CDAOGO X CDAO Pré-CDAO ontologies (GO, anatomy, etc.)

Relações semânticas simples (is_a, part_of) entre os conceitos criados; ontologia descritiva

Relation ontology: limitação do número de relações (verbos) a serem utilizados na descrição

CDAO Relações semânticas complexas Tentativa de criar uma verdadeira linguagem lógico-formal

para a descrição de eventos Possibilidade de realização de inferências novas

Knowledge discovery Uma vez que os dados tenham sido anotados de acordo com

termos e relações fixas, programas conhecidos como reasoners são capazes de ler o vocabulário e realizar inferências automáticas → Petabyte-era

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 14: Prosdocimi ucb cdao

MIAPA integrationMIAPA integration

MIAME - Minimum InformationAbout a Microarray Experiment(Nat Genet. 2001) Documentação formal da informação

mínima necessária para a reprodução do experimento

MIAPA - Minimum InformationAbout a Phylogenetic Analysis(OMICS, 2006)

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 15: Prosdocimi ucb cdao

Algorithm for CDAOAlgorithm for CDAOIF Petabyte era, BIG-data

AND

Non-scalability of modern evolutionary analysis

AND

Science as language creation

AND

We know the standards to create standards

AND

Biomedical community know how to use ontologies (GO)

THENWe gonna create this evolutionary ontology and help people to use and talk about evolution! However...

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 16: Prosdocimi ucb cdao

“Nothing in biology makes sense except

in the light of evolution”

T. Dobzhansky

(1900-1975)

The central role of Evolutionary biology

Every single data collection made in biology can be viewed from an evolutionary perspective

CDAO must be able to represent virtually any data collection in the whole field of biology under an evolutionary perspective! From biochemistry to zoology, genetics to botany, genomics to ecology, microbiology to development, physiology and medicine and so on…

And... there are controversies among scholars... What is a species? What is an OTU? Should evolutionary

characters be homologous? Darwin’s selectionism or Kimura’s neutralism? Gradualism or punctuated equilibrium? Phenetics or cladistics? Parsimony or likelihood?

Evolution as the Evolution as the corecore Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Phenetics and cladistics data are both supported into C-DAO

Page 17: Prosdocimi ucb cdao

Aimed at the formalization of the structure of knowledge on evolutionary analysis

1. To represent both the data and the objective classification (tree) of compared entities, methods used on the analysis and relevant information

1. To map the stepwise history of evolution, including a chronicle of character-modification events

1. To make biological inferences about the present (propagating knowledge)

1. To cope with different views and paradigms applied on modern evolutionary biology field

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 18: Prosdocimi ucb cdao

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

1 Specification – Use casesProtein family alignment, Modelling character evolution, Functional inference, Human variation, Bayesian supertrees, Determine concordance between two or more phylogenies, Estimate divergence times, Determine genome-wide distribution of Ks (silent site substitutions), Tree reconciliation (orthology analysis), etc.

2 Representation

3 ConceptualizationDefine the conceptsDefine the relations between concepts (semantics)Define numeric restrictions

4 Implementation

5 Evaluation

Back to step3:Reconceptualization

Page 19: Prosdocimi ucb cdao

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 20: Prosdocimi ucb cdao

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

DataIntegration

Datarepresentation

Page 21: Prosdocimi ucb cdao

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 22: Prosdocimi ucb cdao

EvaluationEvaluation Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Translation of real test-cases represented in NEXUS files into C-DAO instances

C-DAO internal format

<cdao:Node rdf:ID="inode15"> <cdao:part_of rdf:resource="#Tree_con_50_majrule"/> <cdao:belongs_to_Edge rdf:resource="#edge_inode15_inode14" /> <cdao:belongs_to_Edge rdf:resource= "#edge_Athaliana_CAB79970_inode15" /> <cdao:belongs_to_Edge rdf:resource="#edge_Athaliana_AAD31363_inode15" /> <cdao:belongs_to_Edge_as_Child rdf:resource="#edge_inode15_inode14" /> <cdao:belongs_to_Edge_as_Parent rdf:resource="#edge_Athaliana_CAB79970_inode15" /> <cdao:belongs_to_Edge_as_Parent rdf:resource="#edge_Athaliana_AAD31363_inode15" /> <cdao:nca_node_of rdf:resource="#set_nca_44"/></cdao:Node>

<cdao:Directed_Edge rdf:ID="edge_Athaliana_CAB79970_1_inode15"> <cdao:part_of rdf:resource="#Tree"/> <cdao:has_Parent_Node rdf:resource="#node_inode15"/> <cdao:has_Child_Node rdf:resource="#node_Athaliana_CAB79970_1"/> <cdao:has_Annotation rdf:resource="#edge_Athaliana_CAB79970_1_inode15_length"/></cdao:Directed_Edge><cdao:Edge_Length rdf:ID="edge_Athaliana_CAB79970_1_inode15_length"> <cdao:has_Value rdf:datatype="&xsd;float"> 0.009539 </cdao:has_Value></cdao:Edge_Length>

http://www.cs.nmsu.edu/~bchisham/ontology/test_results/

Page 23: Prosdocimi ucb cdao

Allows the representation of large datasets (syntactics, data representation)

Allows different anomalous datasets to be combined (data integration)

Provides strict concepts making researchers speak in a standard vocabulary (avoids a Babel’s Tower problem)

Allows logical inferences and knowledge propagation to bemade automatically (semantics)

1. If TU1 has_annotation == GO:00062602. If TU2 has_annotation == “”; 3. If TU3 has_annotation == GO:00062604. If TU1, TU2 and TU3 form a monophyletic cladeTHEN TU2 has_annotation = GO:0006260

And so far, CDAO...And so far, CDAO...

TU1TU3 TU2

AN1

AN2

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 24: Prosdocimi ucb cdao

Future ChallengesFuture Challenges Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Verify the usability of the ontology by evolutionary biologists

Development of new tools for data format conversion

Integrate C-DAO into a generic workflow of evolutionary biology software (Arlin Stoltzfus)

Integrate CDAO with other ontologies (MAO, SO, AA, anatomy) for specific applications

Expand terms and concepts to allow a broader representation of evolutionary and comparative data

Page 25: Prosdocimi ucb cdao

ConclusionsConclusions C-DAO is a prototype for a well-annotated ontology

providing represention of key concepts in evolutionary analysis, such as:

Phylogenetic trees of entities-to-be-compared Character-state data representing the attributes of entities Methodological annotation of procedures used on the

analysis (integration with MIAPA) Evolutionary changes in characters over time

It aims to facilitate communication, annotation, program interoperability, data integration and automated analysis of large-scale evolutionary datasets

http://sourceforge.net/projects/cdao

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 26: Prosdocimi ucb cdao

PublicationsPublications Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Page 27: Prosdocimi ucb cdao

AcknowledgementsAcknowledgements

JonathanJoeMarkJohnSergei L.SudhirPaul O.AaronDavidWayneWeigangAndrewArlinDavid L.RutgerXuhuaChristian

EisenFelsensteinHolderHuelsenbeckKosakovsky PondKumarLewisMackeyMaddisonMaddisonQiuRambautStoltzfusSwoffordVosXiaZmasek

UC Davis Genome Center, UC Davis, CADepartment of Genome Sciences/ Biology, Seattle, WASchool of Computational Science, FSU, Tallahassee, FL University of California, San Diego, CA Antiviral Research Center, UC, San Diego, CACenter for Evolutionary Functional Genomics, Tempe, AZUniversity of Connecticutt, Storrs, CTGlaxoSmithKline, King of Prussia, PADepartment of Entomology, UA,Tucson, AZDepartments of Zoology and Botany, UBC, Vancouver, BCDepartment of Biological Sciences, HCCUNY, New York, NYZoology Department, University of Oxford, Oxford, UKInstitute of Evolutionary Biology, UE, Edinburgh, UKSchool of Computational Science, FSU, Tallahassee, FLUniversity of British Columbia, Vancouver, BC (Canada)Biology Department, University of Ottawa, Ottawa, ONBurnham Institute for Medical Research, La Jolla, CA

https://www.nescent.org/wg_evoinfo/

Evo-info working group

EvolHHuPro/LBGI working group Pierre Pontarotti, Elodie Darbo, Philippe GouretOlivier Poch and LBGI members

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Pós-graduação em ciências genômicas e biotecnologia - UCB

Page 28: Prosdocimi ucb cdao

Julie Thompson Enrico PontelliBrandon Chisham

Arlin Stoltzfus

Visit our web-page at

http://evolutionaryontology.org

Dr. Francisco Prosdocimi – [email protected]

Introduction/ Motivation

Development

Features

Evaluation

Application

Concluding remarks

Francisco Prosdocimi

Page 29: Prosdocimi ucb cdao

CDAO meeting

August, 2009

Las cruces, New Mexico