Sequence Analysis with Artemis
and
Artemis Comparison Tool (ACT)
Carribean Bioinformatics Workshop18th-29th January , 2010
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca
tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg
cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat
ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt
atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca
tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg
agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa
ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat
tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa
ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa
taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat
taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat
atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt
attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta
ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata
tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga
atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata
tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt
ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg
taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc
aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa
taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata
tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat
tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt
ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa
tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt
tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta
agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata
aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa
ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct
ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca
tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg
cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat
ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt
atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca
tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg
agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa
ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat
tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa
ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa
taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat
taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat
atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacagatgt
attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta
ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata
tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga
atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata
tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt
ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaagtttttcttcattatcaaaaatatttatttcctaattttttttttttg
taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc
aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa
taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata
tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat
tctgatcattgatccgtcttccttaggtgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt
ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa
tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt
tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta
agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata
aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa
ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct
ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa
Sequencing is just the
beginning of the process
Extracting information &
interpreting
What´s there
where are the genes
which genes
how to find them?
SEQUENCE ANNOTATION
Strategies for sequence annotation
Predictive methods
Comparative methods
Experimental methods
Interpretation of the DNA sequence into genes
according to rules
Strategies for sequence annotation
Predictive methods
Comparative methods
Experimental methods
Interpretation of the DNA sequence into genes
according to rules
Interpretation of the DNA sequence into genes
according to similarities with other sequences
Strategies for sequence annotation
Predictive methods
Comparative methods
Experimental methods
Interpretation of the DNA sequence into genes
according to rules
Interpretation of the DNA sequence into genes
according to similarities with other sequences
Interpretation of the DNA sequence into genes
according to experimental results (e.g. cDNA)
EST Blast Hit
Gene prediction programs:
ORFs and CDSs
ORFs are not equivalent to CDSs
Not all open reading frames are coding sequences
Gene prediction
Gene finderGlimmer
Orpheus PHAT
GeneMark
Gene finding programs
• Genefinding software packages use Hidden
Markov Models.
• Predict coding, intergenic and intron
sequences
• Need to be trained on a specific organism.
• Never perfect!
Gene prediction programs: Problems
• ORFs are not equivalent to CDSs
• Gene prediction programs find new genes that share
properties with a given set of genes.
• They can be confounded by:
– Sequence constraints (ribosomal proteins etc.)
– Sequence biases
– Different sets of genes
– Horizontal gene transfer
– Non-coding DNA
Gene prediction programs: Problems
Different gene training sets: Plasmodium falciparum
Original annotation
Updated annotation
Gene prediction programs: Problems
Non-protein coding regions: S. typhi ribosomal RNA genes
glimmer
genefinder
final
orpheus
glimmer
genefinder
final
orpheus
Gene prediction programs: ProblemsNon-protein coding regions: N. meningitidis DNA repeats
glimmer
orpheus
final
glimmer
orpheus
final
Gene prediction programs: Problems
Pseudogenes
M. leprae
Gene prediction programs: Problems
Pseudogenes: M. leprae
Glimmer
Gene prediction programs: Problems
Pseudogenes: M. lepraePseudogenes: M. leprae
ORPHEUS
Gene prediction programs: Problems
Pseudogenes: M. leprae
WUBLASTX vs. M. tuberculosis
Gene prediction programs: Problems
Pseudogenes: M. leprae
Final annotation
The Gene Prediction Process
DNA SEQUENCE
AN
NA
LY
SIS
SO
FT
WA
RE
Usefull
CDS
Prediction
Annotator
AT content
Gene finders
Codon Usage
BlastX
FASTA
ESTs
Eukaryotic gene
AAAAAAAAAACAP
AAAAAAAAAACAP
TTTTTTTTT
TTTTTTTTT
intron Exon II5’UTR Exon Istop
3’UTR
EST
cDNA
mRNA
EST
Exon III
ATG GT AG GT AG
AT content
• Coding regions have higher GC content in
AT rich genomes
AT content
CODON USAGE
• Codon bias is different for each organism.
• DNA content in coding regions is restricted
– but it is not restricted in non coding regions.
• The codon usage for any particular gene can influence expression.
Codon usage
• All organisms have a preferred set of
codons.
Malaria TrypanosomaGUU 0.41 GUU 0.28
GUC 0.06 GUC 0.19
GUA 0.42 GUA 0.14
GUG 0.11 GUG 0.39
Codon Usage
• http://www.kazusa.or.jp/codon/
Codon Usage in Artemis
Forward
frames
Reverse
frames
Codon usage & gene finding in : Leishmania
GC frame plot
• Plots the third position GC content of each
frame of a DNA sequence.
• In coding DNA the GC content of the 3rd
base is often higher.
• Good prediction of coding in malaria and
trypanosomes.
GC frame plot of tubulin gene cluster on T. brucei Chr 1
Homology Data
• Coding regions are more conserved than non
coding regions due to selective pressure.
• Comparing all possible translations against
all known proteins will give clues to known
genes.
• Blastx
Gene finding: using ACT
TBLASTX comparisons
P. knowlesi
P. falciparum
P. yoelii
Gene finding by RNA-Seq(Transcriptional landscape of Neospora caninum Tachyzoites
Day 3 Tachyzoites (RNAseq)
Day 4 Tachyzoites (RNAseq)
Day 3 Tachyzoites (RNAseq)
Day 4 Tachyzoites (RNAseq)
N. caninum Chr08
T. gondii Chr08
5’ UTR 3’ UTR
TBLASTX matches visualised in ACT
Transcriptome sequencing in Neospora(RNAseq is useful for predicting/confirming UTR boundaries)
RNA-Seq: correcting gene models
Before
%GC
After
%GC
__16hr, __32hr, __48hr
Top Related