Finding and Aligning Related Sequences (Martin Frith)
-
Upload
australian-bioinformatics-network -
Category
Documents
-
view
413 -
download
4
Transcript of Finding and Aligning Related Sequences (Martin Frith)
Finding and aligning related sequences
Martin C. Frith Computational Biology Research Center
AIST, Tokyo www.cbrc.jp/~martin
2012-12-09 @ BioinfoSummer, Adelaide
CBRC
2 www.cbrc.jp
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
3
Compare human and mouse genomes
gctagtgtac
||| || ||
gct--tgaac
aa-gtaca
|| |||||
aaggtaca
4
Human
Mouse
5
Compare DNA from a patient to a reference genome
Patient DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc
Reference genome sequence
ctagcttatcgt
DNA reads
6
What kinds of microbial genes are there?
Water from e.g. a hot spring
DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
DNA reads
ArgLysTyrProPheLeuLeuIsoArgLysPheAlaPro-ProGlyGlyAlaGly…
atatatatatattagccgt
|||...||| |||...|||
GlyGlyPhePheGlyAlaLeuCysCysTrpTrpAlaGlyAlaPro…
Database of all known proteins 7
More examples
• Compare ancient DNA to a reference genome
– Mammoth, neanderthal, Turin Shroud, …
• Align (potentially spliced) RNA sequences to a reference genome
– To see which genes are active
• Align short DNA reads to each other
– In order to assemble them
8
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
9
What are we really trying to do?
1. Find and align similar sequences?
2. Find and align homologous sequences?
3. Find and align orthologous sequences?
4. Find and align paralogous sequences?
10
Homology, orthology, paralogy
11
Homology: descent from a common ancestor
Orthology: descent from a common ancestor by genome division
Paralogy: descent from a common ancestor by duplication within a genome
Past
Present
Example
human mouse
β1-globin β2-globin
β-globin
12
Example
human
β1-globin β2-globin
β-globin
Orthologs
Paralogs
mouse
13
Example
human
β1-globin β2-globin
β-globin
• Orthology is not necessarily 1-to-1 • Orthology is not transitive Not an equivalence relation
mouse
Orthologs
Orthologs
14
What are we really trying to do?
1. Find and align similar sequences?
2. Find and align homologous sequences?
3. Find and align orthologous sequences?
4. Find and align paralogous sequences?
15
Compare human and mouse genomes
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
16
Compare human and mouse genomes
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
Do we want to align mouse α-globin with
human β-globin?
Probably not
17
Compare DNA from a patient to a reference genome
Patient DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc
Reference genome sequence
DNA reads
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
18
Compare DNA from a patient to a reference genome
Patient DNA Sequencer
ctatgctagtcgta
cctatagtctgtatg
atatatatattatta
ccctagtcgtatgg
tttaccagctgga
ctagtcgtagtgtgg
ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc
Reference genome sequence
DNA reads
What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs
Do we want to align the patient’s α-globin to the
reference’s β-globin?
19
What are we really trying to do?
1. Find and align similar sequences?
2. Find and align homologous sequences?
3. Find and align orthologous sequences?
4. Find and align paralogous sequences?
20
Aims and algorithms
• Sequence comparison algorithms basically find similar sequences
• Finding homologs is harder
• Finding orthologs is even harder
21
Similarity versus homology
Similar sequences
Homologous sequences
Convergent evolution
Rapid evolution over a long time span
22
• The most frequent case of convergent evolution is simple sequences
23
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
24
Simple sequences
• DNA (and RNA and protein) frequently has simple sequences:
atgatcgattatcgtagtctaggtcgtatgctatgatt
cgataaaaaaaaaaaaaaaaaaacggtatgcgtagctg
cgatcgtagtgactatatgagagaggattcgatgctaa
gttctctaggagaggcttaggctgagcgcgtatcactg
gctcgcggctgtgtgtgtgtgtgtgtgtgtgtgtgtga
cgtatcgcacatcgtcgattttgagattcccgatggcc
25
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatc
gtagtagtagtagtagta
26
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatca
gtagtagtagtagtagta
27
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcat
gtagtagtagtagtagta
28
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcatc
gtagtagtagtagtagta
29
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcatca
gtagtagtagtagtagta
30
How do simple sequences evolve?
• Strand slippage during DNA replication:
catcatcatcatcat
gtagtagtagtagtagta
31
How do simple sequences evolve?
• Strand slippage during DNA replication:
32
catcat
gtagtagtagtagtagta
cat
How do simple sequences evolve?
• Strand slippage during DNA replication:
33
catcat
gtagtagtagtagtagta
catc
How do simple sequences evolve?
• Strand slippage during DNA replication:
34
catcat
gtagtagtagtagtagta
catca
How do simple sequences evolve?
• Strand slippage during DNA replication:
35
catcat
gtagtagtagtagtagta
catcat
How do simple sequences evolve?
• Strand slippage during DNA replication:
36
catcat
gtagtagtagtagtagta
catcat
On the top strand, it has got longer
How do simple sequences evolve?
• An initial (short, mild) simple sequence occurs by chance
• Due to slippage, it gets longer…
• And longer…
37
Homology between human and banana?
• Probably not.
38
atatatatatatatatatatatatatatatatatatatata
|||||||||||||||||||||||||||||||||||||||||
atatatatatatatatatatatatatatatatatatatata
Human
Banana
Avoiding non-homologous alignments of simple sequences
• The standard way is to identify and “mask” them, before alignment
39
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
40
Repeat masking
• There are standard “repeat masking” tools
– RepeatMasker, DustMasker, SegMasker, TRF, …
• Most people just assume they work
41
Repeat confusion
atcttatgtctctctctctctctctctctggatgcttgaccac
cttgttattgctgatcgtcctctctgtaaattgttattgctgatcatgctttaac
Simple sequence:
Interspersed repeat:
They are both called “repeats”, but they are rather different. Don’t confuse them.
42
Test of avoiding non-homologous alignments
• Compare two sequences after reversing one of them
• Sequences never evolve by reversal, so there are no true homologs in this test
• But repeats may still cause strong similarities, if they are not suppressed
• Hello
43
atatatatatatatatatatatatatatatatatatatata
|||||||||||||||||||||||||||||||||||||||||
atatatatatatatatatatatatatatatatatatatata
Human
Banana
Test result
The C. elegans genome versus the reversed P. pacificus genome, after masking both with DustMasker:
Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 44
A spurious alignment
Upper sequence: from C. elegans Lower sequence: from reversed P. pacificus
Conclusion: DustMasker fails to mask some tandem repeats
45
Other methods?
46
Upper sequence: part of an animal protein Lower sequence: part of a reversed plant protein
• SegMasker does not work either:
• Nor does RepeatMasker, TRF…
A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.
Repeat masking
• There are standard “repeat masking” tools
– RepeatMasker, DustMasker, SegMasker, TRF, …
• Most people just assume they work
– Cargo cult science
• Genomic bioinformatics is riddled with it
47
New repeat-masking method
• tantan: http://www.cbrc.jp/tantan/
• It looks for slippery regions in sequences
• Slippery = similar to shifted versions of itself
• It integrates similarity at different slip distances, using a Forward-Backward algorithm
48
A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.
tantan test result
The C. elegans genome versus the reversed P. pacificus genome, after masking with tantan:
Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 49
Conclusion
• tantan prevents simple-sequence alignments
• Without masking an excessive amount
• It even works for extremely AT-rich DNA
– Plasmodium falciparum (malaria): 80% AT
– Dictyostelium discoideum (slime mould): 80% AT
50
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
51
Classic score-based alignment
52
1. Define a scoring scheme
2. Find alignments with high (maximum) scores
Alignment scoring scheme
53
a c g t
a 2 -3 -1 -3
c -3 2 -3 -1
g -1 -3 2 -3
t -3 -1 -3 2
Gap existence cost: 5 Gap extension cost: 1
Substitution score matrix Gap scores
Alignment scoring scheme
54
a c g t
a 2 -3 -1 -3
c -3 2 -3 -1
g -1 -3 2 -3
t -3 -1 -3 2
t a c g t g - - a g g t
| | | | | | | | |
t a c a t g c t a g g t
Gap existence cost: 5 Gap extension cost: 1
Substitution score matrix
Alignment score: 10
2 +2 +2 -1 +2 +2 -7 +2 +2 +2 +2
Gap scores
Example:
Classic score-based alignment
55
1. Define a scoring scheme
2. Find alignments with high (maximum) scores
tacgtg--aggt
||| || ||||
tacatgctaggt
ctatgctacgtgaggtgtggc
attacatgctaggtccac
How to find alignments with max score?
• Smith-Waterman algorithm – Exact: guarantees to find the max score – A bit slow
• BLAST, FASTA, etc – Heuristic: no guarantee – Faster
56
tacgtg--aggt
||| || ||||
tacatgctaggt
ctatgctacgtgaggtgtggc
attacatgctaggtccac
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
57
Alignment scoring scheme
• Where do these scores come from?
• Why is this a good method anyway?
58
a c g t
a 2 -3 -1 -3
c -3 2 -3 -1
g -1 -3 2 -3
t -3 -1 -3 2
Gap existence cost: 5 Gap extension cost: 1
Substitution score matrix Gap scores
Sxy =
Scores are log likelihood ratios
Sxy = t ´ logAxy
Px ´Qy
æ
èçç
ö
ø÷÷
Probability of x aligned to y in a true alignment
Probability of x in the first sequence
Probability of y in the second sequence
Model of homologous sequences
Model of independent
sequences
Different matrices for different tasks
60
a c g t
a 1 -3 -3 -3
c -3 1 -3 -3
g -3 -3 1 -3
t -3 -3 -3 1
a c g t
a 2 -6 -6 -6
c -6 2 -6 1
g -6 -6 2 -6
t -6 -6 -6 1
a c g t
a 2 -3 -2 -3
c -3 5 -3 -2
g -2 -3 5 -3
t -3 -2 -3 2
AT-rich DNA (e.g. malaria) Bisulfite-converted DNA
a c g t
a 1 -1 -1 -1
c -1 1 -1 -1
g -1 -1 1 -1
t -1 -1 -1 1
Strong similarities (~99% identity) Weak similarities (~75% identity)
What about gap scores?
Pair hidden Markov model
The arrows describe probabilities for insertions and deletions. (It looks more complicated than it really is.)
A useful formula
Prob alignment( ) µ exp alignment score / t( )
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
63
Alignment ambiguity
ctagctaaccgtatcgtgggc
||||| | ||||| | ||
ctagcca---gtatctagtgc
?
ctagctaaccgtatcgtgggc
||||| | ||||| | ||
ctagc---cagtatctagtgc
Or
64
Per-column probabilities
… g c a t c c t t g g g t c t c g a c a t …
… g c c t c g t t a g a - - t a g a t a g …
.99
.99
.99
.95
.93
.92
.90
.79
.55
.33
.16
.22
.49
.55
.59
.71
.93
.97
.98
.99
65
Importance
• Column reliability is important for:
– Studying the evolution of binding sites
– Identifying polymorphisms
– Finding recombination breakpoints
– …
66
… g c a t c c t t g g g t c t c g a c a t …
… g c c t c g t t a g a - - t a g a t a g …
.99
.99
.99
.95
.93
.92
.90
.79
.55
.33
.16
.22
.49
.55
.59
.71
.93
.97
.98
.99
How to calculate ambiguity
ctagctaaccgtatcgtgggc
||||| | ||||| | ||
ctagcca---gtatctagtgc
67
Prob(column) =sum of exp(score / t) for all alignments that include the column
sum of exp(score / t) for all alignments
Prob(column) = sum of probs of all alignments that include the column
An aligner that indicates ambiguity
68
Since 2008
http://last.cbrc.jp/
Warning: LAST was made by me and colleagues
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
69
Sequence quality data
• Some DNA sequencers estimate the error probability of every base
• We ought to use this information when comparing sequences
t a g c t g a
0.01 0.02 0.07 0.24 0.32 0.75 0.75
70
General case: compare 2 sequences with error probabilities
a t g c c …
0.01 0.02 0.02 0.09 0.17
g t a c c …
0.03 0.01 0.08 0.12 0.44
71
Sxy = logAxy
BxBy
é
ë ê
ù
û ú
a c g t a 2 -3 -1 -3 c -3 2 -3 -1 g -1 -3 2 -3 t -3 -1 -3 2
Sxpyq = log pqAxy
BxBy+ (1- pq)
é
ë ê
ù
û ú
Traditional sequence comparison Giga-sequencers
t a g c t
0.01 0.02 0.07 0.24 0.32
Error probabilities Score matrix
Real substitutions (mutation / evolution) Erroneous substitutions
Generalized log likelihood ratio:
72 Incorporating sequence quality data into alignment improves DNA read mapping. Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100
An aligner that combines score matrix & quality data
73
Since 2008
http://last.cbrc.jp/
Warning: LAST was made by me and colleagues
Finding and aligning related sequences
• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets
74
Why is BLAST too slow?
75
Why is BLAST too slow?
1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed
…atcgtatcgtatcgtactgctggcctagtggggga…
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
76
Why is BLAST too slow?
1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed
…atcgtatcgtatcgtactgctggcctagtggggga…
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
Problem
Non-uniform composition:
atatatatatatatatatata Alu
LINEs SINEs Isochores
CpG islands
too many seeds too many extensions too slow 77
Example
• Compare the human and chimp genomes
• Each genome has ~ 1 million Alu elements
• So we will get ~ 1012 seed matches…
Problem
Non-uniform composition:
atatatatatatatatatata Alu
LINEs SINEs Isochores
CpG islands
too many seeds too many extensions too slow 78
Solution: adaptive seeds
1. Find “seeds” (initial matches) of a fixed length rareness 2. Try extending an alignment from each seed
…atcgtatcgtatcgtactgctggcctagtggggga…
…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…
79
Adaptive seeds can be found efficiently by using a suffix array
An aligner that uses adaptive seeds
80
Since 2008
http://last.cbrc.jp/
Warning: LAST was made by me and colleagues
LAST run times
• Compare the human and chicken genomes
– 3.5 hours
• Align 1 million length-87 DNA reads to the human genome
– 6 minutes
• (Using 1 CPU core)
81
82 Sensitivity (% of reads that are correctly aligned) Run time (minutes)
Simulated DNA reads Error rate
(% of aligned reads that are wrong)
Method Time (min)
bwa 16
bwa-n10 67
last 41
last
last
novoalign 518
shrimp2 ?
stampy 72
stampy (sensitive) 248
For more detail
• Adaptive seeds tame genomic sequence comparison Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC Genome Research 2011 21:487
• Incorporating sequence quality data into alignment improves DNA read mapping Frith MC, Wan R, Horton P Nucleic Acids Research 2010 38:e100
• A mostly traditional approach improves alignment of bisulfite-converted DNA Frith MC, Mori R, Asai K Nucleic Acids Research 2012 40:e100
84
Summary
• It is feasible to use classic, statistical alignment approaches with large modern sequence datasets
– This is beneficial for modeling: diverged sequences, biased base frequencies, etc.
• Alignment ambiguity should be used more often
• Try to avoid cargo cult science!
85
Main collaborators
Paul Horton CBRC
Michiaki Hamada U of Tokyo / CBRC
Szymon Kielbasa Leiden University
Programming wisdom
• Measuring programming progress by lines of code is like measuring aircraft building progress by weight. – Bill Gates
• As you're about to add a comment, ask yourself, 'How can I improve the code so that this comment isn't needed?’ – Steve McConnell
• The key to performance is elegance, not battalions of special cases. – Jon Bently and M. Douglas McIlroy
• Weeks of programming can save you hours of planning. – Unknown
• Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. – Unknown
87