An Introduction to Ensembl Presented By Hilary O. Pavlidis.

29
An Introduction to Ensembl Presented By Hilary O. Pavlidis

Transcript of An Introduction to Ensembl Presented By Hilary O. Pavlidis.

An Introduction to Ensembl

Presented ByHilary O. Pavlidis

Objectives

• What is Ensembl?• Goals of Ensembl• Genomes and Ensembl• Ensembl Software• Ensembl Data Files• Get in and look at Ensembl

– Do an example

What is Ensembl?

• Ensembl is one of 3 main systems that are currently available that annotate and display genomic information– Ensembl

• http://www.ensembl.org

– UCSC Genome Browser• http://genome.ucsc.edu

– NCBI Genome Browser• http://www.ncbi.nlm.nih.gov

An Overview of Ensembl, 2004

What is Ensembl?

• Ensembl is a joint project between 3 organizations– EMBL- European Molecular Biology

Laboratory– EBI- European Bioinformatics Institute– WTSI – Wellcome Trust Sanger Institute

• Provides the primary financial support for this endeavor

Information from http://www.ensembl.org

Ensembl Audience

• Ensembl’s target audience– Researchers

• Want to download small datasets, and to do sequence similarity searches

– “Power Users”• Doing research that spans classes of genes, or certain

genomic regions

– Bioinformaticians• Doing bioinformatic research or supporting labs with

large data sets

An Overview of Ensembl, 2004

Major Challenges with Genomes

• Scientific challenge of decoding a genome from its nucleotides to a set of functional elements

• Development of software which is capable of storing, manipulating, and evaluating genomes

• Challenge of providing comprehensive and informative access to a large amount of data in a user friendly way

An Overview of Ensembl, 2004

Goals of Ensembl

• Goals are to provide (from website)

– Accurate, automatic analysis of genome data

– Analysis and annotation of current data– Presentation of the analysis to all via Web

access– Distribution of the analysis to other

bioinformatics laboratories

Information from http://www.ensembl.org

Genomes and Annotation

• Ensembl does not assembly any genome project directly– Works in relation with the sequencing centers that

generate the genome assembly

• Ensembl provides high quality annotation for genomes that do not have existing annotation– Works in relation with genomes that do have high

quality annotation

An Overview of Ensembl, 2004

Gene Building in Ensembl

• Two step process• Targeted Build

– Aligns all species specific protein and mRNA information to the genome sequence

• Similarity Build– Based on the information obtained from closely

related species– Aims to further advance the transcript predictions

An Overview of Ensembl, 2004

Integration and Comparative Genomics

• Genomes are inherently related

• Ensembl provides resources that are capable of taking advantage of this– Alignment of sequences between genomes– Pairing of orthologous gene pairs between

genomes– Derivation of long range blocks of synteny

An Overview of Ensembl, 2004

Mammalian Genomes

• Homo Sapiens– Human

• Pan troglodytes– Chimpanzee

• Macaca mulatta– Rhesus monkey

• Mus musculus– Mouse

• Rattas norvegicus– Rat

• Canis familiaris– Dog

• Bos taurus– Cow

• Monodelphis domestica– Opossum

• Pre Ensembl Genomes– Dasypus novemcinctus

• Nine banded armadillo

– Loxodonta africana• African Elephant

– Echinops telfairi• Madagascar hedgehog

Information from http://www.ensembl.org

Other Genomes

• Gallus Gallus– Chicken

• Xenopus tropicalis– Pipid frog

• Danio rerio– Zebra fish

• Fugu rubripes– Puffer fish

• Tetradon nigroviridis– Tetradon fish

• **Ciona intestinalis and C. savignyi– Sea squirt

• Drosophila melanogaster– Fruit fly

• Anopheles gambiae– Mosquito

• Apis mellifera– Honey bee

• Caenorhabditis elegans– Roundworm

• Saccharomyces cerevisiae– Yeast

• Aedes aegypti– Pre Ensembl Genome of

Egyptian mosquitoInformation from http://www.ensembl.org

Ensembl Software

Ensembl Website Construction

• Update datasets and software ten times per year– Yields a new version number that incorporates the

month and year• Ensembl v37-Feb 2006 is current version

• Ensembl now archives previous versions of for up to 2 years– Started in November of 2004

• Website is written in Perl

Information from http://www.ensembl.org

Ensembl Databases

• Ensembl uses MySQL to store information in relational databases

• 4 Main Databases– Ensembl Core Database– Ensembl EST Database– Ensembl Compara Database– Ensembl Variation Database

Information from http://www.ensembl.org

Ensembl Databases

• Ensembl also utilizes APIs

• Application Programme Interfaces (APIs)– Serve as a connection between the

databases and specific application programs– Ensembl has Perl API and Java API

• Perl API more “complete” than Java API

Information from http://www.ensembl.org

Ensembl Databases

• Ensembl Core Databases– Species specific Ensembl core databases that

store genome sequence and annotation information

• Gene, transcript, and protein models that are annotated by the Ensembl automated genome analysis

– Databases also stores information about cDNA and protein alignments, as well as external references

• Ex. - NCBI Numbers AB012211

Information from http://www.ensembl.org

Ensembl Databases

• Ensembl Compara Database – Is a multi-species database that stores the results

of genome wide species comparisons– The comparative genomic dataset allows for

pairwise whole genome alignments and synteny regions

– The comparative proteomics dataset allows for orthologue predictions and protein family clusters

Information from http://www.ensembl.org

Ensembl Tools

• 4 Main Tools– BioMart– Exonerate– SSAHA– Wise2

Information from http://www.ensembl.org

Ensembl Tools

• BioMart– Generic data management system built

specifically for use in Ensembl– Ensembl will build a BioMart database to

provide users with the ability to conduct fast and powerful searches

– It simplifies the task of integrating external data sets (provided by the user) with the Ensembl databases

Information from http://www.ensembl.org

Ensembl Tools

• Exonerate– A tool designed for pair wise sequence

comparison• Allows for the alignment of sequences using

many different models– Example – Dynamic Programming

Information from http://www.ensembl.org

Ensembl Tools

• SSAHA– Sequence Search and Alignment by

Hashing Algorithm– A tool that provides very fast matching and

alignment of DNA sequences– Speed is gained from converting sequence

information into a hash data structure

Information from http://www.ensembl.org

Ensembl Tools

• Wise2– Tool that is designed for the comparison of

biopolymers• Biopolymer Ex. – DNA and protein sequences

– Genewise and estwise are algorithms associated with the Wise2 package

Information from http://www.ensembl.org

Ensembl Data

Ensembl Data

• Data Exporting– All data generated is free for download via the

ftp.ensembl.org site• Includes gene sequences, transcript and protein

predictions

– Ensembl provides a dedicated ExportView page• Can be exported into HTML, text or zipped format

• Data Importing– Able to import your own dataset for analysis– For “large” personal datasets

• DAS server to help stabilize these datasets

Information from http://www.ensembl.org

Ensembl Data

• Searching Ensembl– Search for nucleotide or protein sequences– Many available search functions

• General text search across all available species genomes

• Search within a specific genome (Gallus Gallus)

• Use external numbering sequences– NCBI number – AB012211

• Blast search with a gene sequence– Your own data or from external source

Information from http://www.ensembl.org

Ensembl Reference

• T. Hubbard, D. Andrews, M. Caccamo, G. Cameron, Y. Chen, M. Clamp, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, J. Gilbert, M. Hammond, J. Herrero, H. Hotz, K. Howe, V. Iyer, K. Jekosch, A. Kahari, A. Kasprzyk, D. Keefe, S. Keenan, F. Kokocinsci, D. London, I. Longden, G. McVicker, C. Melsopp, P. Meidl, S. Potter, G. Proctor, M. Rae, D. Rios, M. Schuster, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark and E. Birney Ensembl 2005 Nucleic Acids Res. 2005 Jan 1;33 Database issue:D447-D453.

– Available as a free full text at: http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D447

• Include Current Version Release of Ensembl– Current version is Ensembl v37-Feb 2006

• Free full text access to other scientific publications concerning applications associated with Ensembl– http://www.ensembl.org/info/publications.html

Information from http://www.ensembl.org

Lets get familiarized with Ensembl

and view a Chicken Example

http://www.ensembl.org

Presentation References

• Ensembl Web Information– Available at http://www.ensembl.org

• Ensembl 2005– Hubbard et al., 2005. Nucleic Acids Research.

January 2005

• An Overview of Ensembl– Birney et al., 2004. Genome Research. Available

at: http://www.genome.org