An Introduction to Ensembl Presented By Hilary O. Pavlidis.
-
Upload
alexandra-burke -
Category
Documents
-
view
218 -
download
1
Transcript of An Introduction to Ensembl Presented By Hilary O. Pavlidis.
Objectives
• What is Ensembl?• Goals of Ensembl• Genomes and Ensembl• Ensembl Software• Ensembl Data Files• Get in and look at Ensembl
– Do an example
What is Ensembl?
• Ensembl is one of 3 main systems that are currently available that annotate and display genomic information– Ensembl
• http://www.ensembl.org
– UCSC Genome Browser• http://genome.ucsc.edu
– NCBI Genome Browser• http://www.ncbi.nlm.nih.gov
An Overview of Ensembl, 2004
What is Ensembl?
• Ensembl is a joint project between 3 organizations– EMBL- European Molecular Biology
Laboratory– EBI- European Bioinformatics Institute– WTSI – Wellcome Trust Sanger Institute
• Provides the primary financial support for this endeavor
Information from http://www.ensembl.org
Ensembl Audience
• Ensembl’s target audience– Researchers
• Want to download small datasets, and to do sequence similarity searches
– “Power Users”• Doing research that spans classes of genes, or certain
genomic regions
– Bioinformaticians• Doing bioinformatic research or supporting labs with
large data sets
An Overview of Ensembl, 2004
Major Challenges with Genomes
• Scientific challenge of decoding a genome from its nucleotides to a set of functional elements
• Development of software which is capable of storing, manipulating, and evaluating genomes
• Challenge of providing comprehensive and informative access to a large amount of data in a user friendly way
An Overview of Ensembl, 2004
Goals of Ensembl
• Goals are to provide (from website)
– Accurate, automatic analysis of genome data
– Analysis and annotation of current data– Presentation of the analysis to all via Web
access– Distribution of the analysis to other
bioinformatics laboratories
Information from http://www.ensembl.org
Genomes and Annotation
• Ensembl does not assembly any genome project directly– Works in relation with the sequencing centers that
generate the genome assembly
• Ensembl provides high quality annotation for genomes that do not have existing annotation– Works in relation with genomes that do have high
quality annotation
An Overview of Ensembl, 2004
Gene Building in Ensembl
• Two step process• Targeted Build
– Aligns all species specific protein and mRNA information to the genome sequence
• Similarity Build– Based on the information obtained from closely
related species– Aims to further advance the transcript predictions
An Overview of Ensembl, 2004
Integration and Comparative Genomics
• Genomes are inherently related
• Ensembl provides resources that are capable of taking advantage of this– Alignment of sequences between genomes– Pairing of orthologous gene pairs between
genomes– Derivation of long range blocks of synteny
An Overview of Ensembl, 2004
Mammalian Genomes
• Homo Sapiens– Human
• Pan troglodytes– Chimpanzee
• Macaca mulatta– Rhesus monkey
• Mus musculus– Mouse
• Rattas norvegicus– Rat
• Canis familiaris– Dog
• Bos taurus– Cow
• Monodelphis domestica– Opossum
• Pre Ensembl Genomes– Dasypus novemcinctus
• Nine banded armadillo
– Loxodonta africana• African Elephant
– Echinops telfairi• Madagascar hedgehog
Information from http://www.ensembl.org
Other Genomes
• Gallus Gallus– Chicken
• Xenopus tropicalis– Pipid frog
• Danio rerio– Zebra fish
• Fugu rubripes– Puffer fish
• Tetradon nigroviridis– Tetradon fish
• **Ciona intestinalis and C. savignyi– Sea squirt
• Drosophila melanogaster– Fruit fly
• Anopheles gambiae– Mosquito
• Apis mellifera– Honey bee
• Caenorhabditis elegans– Roundworm
• Saccharomyces cerevisiae– Yeast
• Aedes aegypti– Pre Ensembl Genome of
Egyptian mosquitoInformation from http://www.ensembl.org
Ensembl Website Construction
• Update datasets and software ten times per year– Yields a new version number that incorporates the
month and year• Ensembl v37-Feb 2006 is current version
• Ensembl now archives previous versions of for up to 2 years– Started in November of 2004
• Website is written in Perl
Information from http://www.ensembl.org
Ensembl Databases
• Ensembl uses MySQL to store information in relational databases
• 4 Main Databases– Ensembl Core Database– Ensembl EST Database– Ensembl Compara Database– Ensembl Variation Database
Information from http://www.ensembl.org
Ensembl Databases
• Ensembl also utilizes APIs
• Application Programme Interfaces (APIs)– Serve as a connection between the
databases and specific application programs– Ensembl has Perl API and Java API
• Perl API more “complete” than Java API
Information from http://www.ensembl.org
Ensembl Databases
• Ensembl Core Databases– Species specific Ensembl core databases that
store genome sequence and annotation information
• Gene, transcript, and protein models that are annotated by the Ensembl automated genome analysis
– Databases also stores information about cDNA and protein alignments, as well as external references
• Ex. - NCBI Numbers AB012211
Information from http://www.ensembl.org
Ensembl Databases
• Ensembl Compara Database – Is a multi-species database that stores the results
of genome wide species comparisons– The comparative genomic dataset allows for
pairwise whole genome alignments and synteny regions
– The comparative proteomics dataset allows for orthologue predictions and protein family clusters
Information from http://www.ensembl.org
Ensembl Tools
• 4 Main Tools– BioMart– Exonerate– SSAHA– Wise2
Information from http://www.ensembl.org
Ensembl Tools
• BioMart– Generic data management system built
specifically for use in Ensembl– Ensembl will build a BioMart database to
provide users with the ability to conduct fast and powerful searches
– It simplifies the task of integrating external data sets (provided by the user) with the Ensembl databases
Information from http://www.ensembl.org
Ensembl Tools
• Exonerate– A tool designed for pair wise sequence
comparison• Allows for the alignment of sequences using
many different models– Example – Dynamic Programming
Information from http://www.ensembl.org
Ensembl Tools
• SSAHA– Sequence Search and Alignment by
Hashing Algorithm– A tool that provides very fast matching and
alignment of DNA sequences– Speed is gained from converting sequence
information into a hash data structure
Information from http://www.ensembl.org
Ensembl Tools
• Wise2– Tool that is designed for the comparison of
biopolymers• Biopolymer Ex. – DNA and protein sequences
– Genewise and estwise are algorithms associated with the Wise2 package
Information from http://www.ensembl.org
Ensembl Data
• Data Exporting– All data generated is free for download via the
ftp.ensembl.org site• Includes gene sequences, transcript and protein
predictions
– Ensembl provides a dedicated ExportView page• Can be exported into HTML, text or zipped format
• Data Importing– Able to import your own dataset for analysis– For “large” personal datasets
• DAS server to help stabilize these datasets
Information from http://www.ensembl.org
Ensembl Data
• Searching Ensembl– Search for nucleotide or protein sequences– Many available search functions
• General text search across all available species genomes
• Search within a specific genome (Gallus Gallus)
• Use external numbering sequences– NCBI number – AB012211
• Blast search with a gene sequence– Your own data or from external source
Information from http://www.ensembl.org
Ensembl Reference
• T. Hubbard, D. Andrews, M. Caccamo, G. Cameron, Y. Chen, M. Clamp, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X. M. Fernandez-Suarez, J. Gilbert, M. Hammond, J. Herrero, H. Hotz, K. Howe, V. Iyer, K. Jekosch, A. Kahari, A. Kasprzyk, D. Keefe, S. Keenan, F. Kokocinsci, D. London, I. Longden, G. McVicker, C. Melsopp, P. Meidl, S. Potter, G. Proctor, M. Rae, D. Rios, M. Schuster, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark and E. Birney Ensembl 2005 Nucleic Acids Res. 2005 Jan 1;33 Database issue:D447-D453.
– Available as a free full text at: http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D447
• Include Current Version Release of Ensembl– Current version is Ensembl v37-Feb 2006
• Free full text access to other scientific publications concerning applications associated with Ensembl– http://www.ensembl.org/info/publications.html
Information from http://www.ensembl.org
Presentation References
• Ensembl Web Information– Available at http://www.ensembl.org
• Ensembl 2005– Hubbard et al., 2005. Nucleic Acids Research.
January 2005
• An Overview of Ensembl– Birney et al., 2004. Genome Research. Available
at: http://www.genome.org