1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial: .
Ensembl Tools - European Bioinformatics Institute · Annotation by Ensembl in collaboration with...
Transcript of Ensembl Tools - European Bioinformatics Institute · Annotation by Ensembl in collaboration with...
EBI is an Outstation of the European Molecular Biology Laboratory.
Ensembl Tools
Questions?
• We’ve muted all the mics• Ask questions in the Chat box in
the webinar interface• I will check the Chat box
periodically for questions• There’s no threading so please
respond with @name
Objectives
• What is Ensembl?
• What tools are available in Ensembl?
• How to use the online tools in Ensembl.
• Where to go for help and documentation.
Overview
• Introduction to Ensembl
• BLAST/BLAT
• Sequence searching
• Assembly Converter
• Convert files between genome assemblies
• Data Slicer
• Pull out sections of VCF and BAM files
• File Chameleon
• Custom download of reference files for NGS analysis
• Variant Effect Predictor (VEP)
• Analyse your own variants
Introduction
Why do we need genome browsers?
1977: 1st genome to be sequenced (5 kb)
2004: finished human sequence (3 Gb)
CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAAACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGCCCCTGGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCCACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCGAGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTCCAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCATCCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAACTTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAAACTGTTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCTAGTGGATAAAGAGGAAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGGCAGAAGTTGTCCAACTTTTTGGTTTCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATGTAGCTTACCATATTAGAAATTTAAAACTAAGAATTTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAATACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGAGGTGGAGGCAGGAGAATCGCTTGAACCCTGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCATGACTCTGTCTCAAAACAAACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACATTCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAATAGAGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGATATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCCAAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGATTGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGCAGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTCAAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCTTTTCTAGGCAGTATTGTACTTCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTGAGCTTCGAAATTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTGGTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCATCATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTT
We need to make the data mean something…
http://www. ensembl.org
http://www.ncbi.nlm.nih.gov/mapview
http://genome.ucsc.edu
Ensembl Features
• Gene builds for ~70 species
• Gene trees
• Regulatory build (ENCODE)
• Variation display and VEP
• Display of user data
• BioMart (data export)
• Programmatic access via the APIs
• Completely Open Source
Access scales
Whole genome
Groups
One by oneMain browserMobile site
BioMartREST APIVEP
Perl APIMySQL
FTP
Vertebrate species on Ensembl
Image obtained using Dendroscope:
Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks
D.H. Huson and C ScornavaccaSystematic Biology, 2012
Non-vertebrates on Ensembl genomes
FungiBacteria
Plants
Protists
Metazoa
www.ensemblgenomes.org
Ensembl and Ensembl GenomesEnsembl EnsemblGenomes
Released 2000 2009
Species Vertebrates (fly, worm and yeast as outgroups)
Non-vertebrates (protists, plants, fungi, metazoa, bacteria)
Annotation by Ensembl in collaboration with the scientific communities
URL www.ensembl.org www.ensemblgenomes.org
Release cycle
89May 2017
2-3 months
New genome assemblies
Updated variation
data
Updated regulation
data
New/updated interfaces
Updated gene sets
Compara on new genes and genomes
Underlying software updates
90July 2017
Ensembl Tools
Tools allow:
• Interpretation and processing of your own data• Custom download of Ensembl data for further
analysis
BLAST/BLAT for sequence searching
• Find Ensembl sequences that match your sequence using BLAST/BLAT
• Search:• Nucleotide sequences• Protein sequences• Short sequences (eg primers, morpholinos, siRNAs)
• Search against• Genomic sequences• cDNA sequences• Protein sequences
Hands on – BLAST/BLAT
• I’ve designed a pair of primers for RT-PCR against human BRCA2
• I want to make sure they don’t have any non-specific hits that will mess up my RT-PCR results
• The sequences are:
>fwdGAGGACTCCTTATGTCCAAATTT
>revGAGAATCAGCTTCTGGGGTAATAA
Assembly converter
• You have data mapped to an old genome assembly• You want to update your data to map it to a new one
What is a genome assembly?
CGGCCTTTGGGCTCCGCCTTCAGCTCAAGATCCGCCTTCAGCTCAAGACTTAACTTC
GGGCTCCGCCTTCAGCTC ACTTAACTTCCCTCCCAGCTGTCC
AACTTCCCTCCCAGCTTCCCAGCTGTCCCAGATGACGCCATC
CAGATGACGCC
CAGCTGTCCCAGATGACCGGCCTTTGGGCTCC
CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATC
Sequence reads
Match up overlaps
Genome assembly
CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA
TCCGCCTTCAGCTCAAGACTTAACTTC
GGGCTCCGCCTTCAGCTC
ACTTAACTTCCCTCCCAGCTGTCC
AACTTCCCTCCCAGCTTCCCAGCTGTCCCAGATGACGCCATC
CAGATGACGCC
CAGCTGTCCCAGATGAC
CGGCCTTTGGGCTCC
Genome contigs
CM
IM
AL
BL
BL102
AL476
CM
553IM
768
Reference alleles
IM
CM
AL
BL
BL102
AL476
CM
553IM
768
BL102
AGTCGTAGCTAGCTAGGCCATAGGCGA
Frequency T = 0.05, frequency G = 0.95G is the allele in all primatesT causes disease susceptibility
Perhaps G should be the reference allele?We can replace the region with a new contig
Genome Gaps
IM
CM
AL
BL
BL102
AL476
CM
553IM
768
BL102
AL476
Gap in the genome caused by:● Poor sequencing at this
region● No contig was ever
cloned
We can fill in the gap with a new contig
Incorrectly assembled contigs
IM
CM
AL
BL
BL102
AL476
CM
553IM
768
CM
AL
BL
BL102
AL476
CM
553IM
768
IM
New genome assemblies
• Fixing errors in the genome produces a new genome assembly
• New genome assemblies mean re-mapping of all genome features
• Ensembl will stop updating the old assembly when a new one is brought in
• You’ve got data mapped to the old assembly and you want to compare to the up-to-date Ensembl annotation
Assembly converter
• Converts genome coordinates to a different genome assembly.
• Works with:• BED (simple coordinates)• GFF (gene, transcript and exon coordinates)• GTF (gene, transcript and exon coordinates)• WIG (values plotted against the genome)• VCF (variants)
Hands-on – Assembly converter
• We’re going to convert a small BED file from the human genome assembly GRCh37 to the more recent GRCh38
• BED is a simple features format which lists the start and end coordinate of the feature.
5 36821734 37091336 P1
5 36731578 36978408 P2
5 36908654 37108773 P3
Data Slicer for variants
• Whole genome VCF files are unwieldy• They contain all variants in the genome• They contain all genotypes from all individuals studied• Sometimes you just want to analyse a small region and one
population• The Data Slicer allows you to take a slice of a VCF and narrow
down to only individuals and populations of interest
• Data Slicer currently only accesses the 1000 Genomes data• It is only available for human and only on GRCh37
Hands on – Data Slicer
• I want to get a VCF of the region containing the MC1R gene for the British population
• MC1R is found at 16:89978527-89987385 in GRCh37• The three-letter code for the British population in 1000
Genomes is GBR
FTP
• Files of our complete database:• Genomic, cDNA, CDS, ncRNA and protein sequence
(FASTA)
• Annotated sequence (EMBL, GenBank)
• Gene sets (GTF, GFF)
• Whole-genome multiple and gene-based multiple alignments (MAF)
• Variants (VCF, GVF)
• Constrained elements (BED)
• Regulatory features (BED, BigWig)
• RNA-Seq files (BAM, BigWig)
• MySQL database
Access FTP
Your favourite FTP client
FTP downloads pagehttp://www.ensembl.org/info/data/ftp/index.html
FTP siteftp://ftp.ensembl.org/pub/
FTP files are big
• Multiple Mb/Gb
• Lots of time to download/unzip
• Do you really need this data?
• Make sure it’s the right file before you download.
File chameleon for NGS analysis
• Although files on the Ensembl FTP site are in a standard format, different tools define the standards differently (sigh!)
• Your NGS analysis tool might need files that are slightly different to the Ensembl formats
• File chameleon allows you to download files with these adjustments
Hands on – File Chameleon
• I need a GFF3 file of cat for my RNA-seq analysis.• My tool requires:
• UCSC-style chromosome naming like chr1• Only genes shorter than 4 Mb• Transcript IDs in every line
• We will use File Chameleon to download this customised file.
Analyse your own variants with the VEP
• Find out the effects of your own variants on Ensembl genes• Analyse whole genome variant calls• Filter variants to find those that might be interesting
Your own variant dataVariant coordinates 1 881907 881906 -/C +
5 140532 140532 T/C +12 1017956 1017956 T/A +2 946507 946507 G/C +14 19584687 19584687 C/T -
HGVS notation ENST00000285667.3:c.1047_1048insC5:g.140532T>CNM_153681.2:c.7C>TENSP00000439902.1:p.Ala2233AspNP_000050.2:p.Ile2285Val
VCF #CHROM POS ID REF ALT20 14370 rs6054257 G A20 17330 . T A20 1110696 rs6040355 A G,T20 1230237 . T .
Variant IDs rs41293501COSM327779rs146120136FANCD1:c.475G>Ars373400041
Variation types
1) Small scale in one or few nucleotides of a gene
• Small insertions and deletions (DIPs or indels)
• Single nucleotide polymorphism (SNP)
A G A C T T G A C C T G T C T - A A C T G G AT G A C T T G A C - T G T C T G A A C G G G A
2) Large scale in chromosomal structure (structural variation)
• Copy number variations (CNV)
• Large deletions/duplications, insertions, translocations
deletion duplication insertion translocation
Variation consequences
ATG AAAAAAA
Regulatory
3’ UTRIntronic
CODINGNon-synonymous
CODINGSynonymous
Splice site5’ Upstream 5’ UTR 3’ Downstream
http://www.ensembl.org/info/docs/variation/predicted_data.html
Consequence terms
Predicting missense effects – SIFT and PolyPhen
SIFT and PolyPhen score changes in amino acid sequence based on:
• How well conserved the protein is
• The chemical change in the amino acid• 3D structure and domains (PolyPhen only)
• SIFT and PolyPhen are predictions, not facts• A prediction will never be as good as experimental validation
SIFT PolyPhen
1
0
0.05Deleterious
Tolerated
1
0
0.1Probably damaging
Benign
0.2Possibly damaging
Use the VEP
http://www.ensembl.org/info/docs/tools/vep/index.html
Species that work with the VEP
+ everything in Plants, Fungi, Metazoa, Protists and Bacteria
?
Set up a cache
- Speed up your VEP script with an offline cache.- Use prebuilt caches for Ensembl species.- Or make your own from GTF and FASTA files -
even for genomes not in Ensembl.
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html
✓
VEP plugins
• Plugins add extra functionality to the VEP• They may extend, filter or manipulate the output of the VEP.• Plugins may make use of external data or code.• Available on the web tool and with the script.
Hands on
• We’re going to look at a set of four variants to find out what genes they hit and what effect they have on them.
9 128328461 128328461 A/- + var1
9 128322349 128322349 C/A + var2
9 128323079 128323079 C/G + var3
9 128322917 128322917 G/A + var4
Questions?
• We’ve muted all the mics• Ask questions in the Chat box in
the webinar interface• I will check the Chat interface• There’s no threading so please
respond with @name
Host an Ensembl course
Browser course
½-2 day course on the Ensembl browser, aimed at wet-lab scientists.
One trainer.
REST API course
1-2 day course on the Ensembl Perl API, aimed at bioinformaticians.
1-2 trainers.
http://training.ensembl.org/
We can teach an Ensembl course at your institute for free (except trainers’ expenses).
Email us: [email protected]
Help and documentationCourse online http://www.ebi.ac.uk/training/online/subjects/11
Tutorials www.ensembl.org/info/website/tutorials
Flash animations
www.youtube.com/user/EnsemblHelpdesk
http://u.youku.com/Ensemblhelpdesk
Email us [email protected]
Ensembl public mailing lists [email protected], [email protected]
Follow us
www.facebook.com/Ensembl.org
@Ensembl
www.ensembl.info
Publications
Aken, B. et al
Ensembl 2017
Nucleic Acids Research
http://europepmc.org/articles/PMC5210575
Xosé M. Fernández-Suárez and Michael K. SchusterUsing the Ensembl Genome Server to Browse Genomic Sequence Data.Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010)www.ncbi.nlm.nih.gov/pubmed/20521244
Giulietta M Spudich and Xosé M Fernández-SuárezTouring Ensembl: A practical guide to genome browsingBMC Genomics 11:295 (2010)www.biomedcentral.com/1471-2164/11/295
http://www.ensembl.org/info/about/publications.html
Ensembl AcknowledgementsThe Entire Ensembl Team
Funding
Co-funded by the European Union