Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

Martin C. Frith Computational Biology Research Center

AIST, Tokyo www.cbrc.jp/~martin

2012-12-09 @ BioinfoSummer, Adelaide

http://www.cbrc.jp/~martin/



CBRC

2 www.cbrc.jp


• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

3

Compare human and mouse genomes

gctagtgtac

||| || ||

gct--tgaac

aa-gtaca

|| |||||

aaggtaca

4

Human

Mouse

5

Compare DNA from a patient to a reference genome

Patient DNA Sequencer

ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg

ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc

Reference genome sequence

ctagcttatcgt

DNA reads

6

What kinds of microbial genes are there?

Water from e.g. a hot spring

DNA Sequencer

ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg

DNA reads

ArgLysTyrProPheLeuLeuIsoArgLysPheAlaPro-ProGlyGlyAlaGly…

atatatatatattagccgt

|||...||| |||...|||

GlyGlyPhePheGlyAlaLeuCysCysTrpTrpAlaGlyAlaPro…

Database of all known proteins 7

More examples

• Compare ancient DNA to a reference genome

– Mammoth, neanderthal, Turin Shroud, …

• Align (potentially spliced) RNA sequences to a reference genome

– To see which genes are active

• Align short DNA reads to each other

– In order to assemble them

8



9

What are we really trying to do?

1. Find and align similar sequences?

2. Find and align homologous sequences?

3. Find and align orthologous sequences?

4. Find and align paralogous sequences?

10

Homology, orthology, paralogy

11

Homology: descent from a common ancestor

Orthology: descent from a common ancestor by genome division

Paralogy: descent from a common ancestor by duplication within a genome

Past

Present

Example

human mouse

β1-globin β2-globin

β-globin

12

Example

human


β-globin

Orthologs

Paralogs

mouse

13

Example

human


β-globin

• Orthology is not necessarily 1-to-1 • Orthology is not transitive Not an equivalence relation

mouse

Orthologs

Orthologs

14






15


What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs

16



Do we want to align mouse α-globin with

human β-globin?

Probably not

17



ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg



DNA reads


18



ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg



DNA reads


Do we want to align the patient’s α-globin to the

reference’s β-globin?

19






20

Aims and algorithms

• Sequence comparison algorithms basically find similar sequences

• Finding homologs is harder

• Finding orthologs is even harder

21

Similarity versus homology

Similar sequences

Homologous sequences

Convergent evolution

Rapid evolution over a long time span

22

• The most frequent case of convergent evolution is simple sequences

23



24

Simple sequences

• DNA (and RNA and protein) frequently has simple sequences:

atgatcgattatcgtagtctaggtcgtatgctatgatt

cgataaaaaaaaaaaaaaaaaaacggtatgcgtagctg

cgatcgtagtgactatatgagagaggattcgatgctaa

gttctctaggagaggcttaggctgagcgcgtatcactg

gctcgcggctgtgtgtgtgtgtgtgtgtgtgtgtgtga

cgtatcgcacatcgtcgattttgagattcccgatggcc

25

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatc

gtagtagtagtagtagta

26



catcatcatca

gtagtagtagtagtagta

27



catcatcatcat

gtagtagtagtagtagta

28



catcatcatcatc

gtagtagtagtagtagta

29



catcatcatcatca

gtagtagtagtagtagta

30



catcatcatcatcat

gtagtagtagtagtagta

31



32

catcat

gtagtagtagtagtagta

cat



33

catcat

gtagtagtagtagtagta

catc



34

catcat

gtagtagtagtagtagta

catca



35

catcat

gtagtagtagtagtagta

catcat



36

catcat

gtagtagtagtagtagta

catcat

On the top strand, it has got longer


• An initial (short, mild) simple sequence occurs by chance

• Due to slippage, it gets longer…

• And longer…

37

Homology between human and banana?

• Probably not.

38

atatatatatatatatatatatatatatatatatatatata

|||||||||||||||||||||||||||||||||||||||||


Human

Banana

Avoiding non-homologous alignments of simple sequences

• The standard way is to identify and “mask” them, before alignment

39



40

Repeat masking

• There are standard “repeat masking” tools

– RepeatMasker, DustMasker, SegMasker, TRF, …

• Most people just assume they work

41

Repeat confusion

atcttatgtctctctctctctctctctctggatgcttgaccac

cttgttattgctgatcgtcctctctgtaaattgttattgctgatcatgctttaac

Simple sequence:

Interspersed repeat:

They are both called “repeats”, but they are rather different. Don’t confuse them.

42

Test of avoiding non-homologous alignments

• Compare two sequences after reversing one of them

• Sequences never evolve by reversal, so there are no true homologs in this test

• But repeats may still cause strong similarities, if they are not suppressed

• Hello

43


|||||||||||||||||||||||||||||||||||||||||


Human

Banana

Test result

The C. elegans genome versus the reversed P. pacificus genome, after masking both with DustMasker:

Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 44

A spurious alignment

Upper sequence: from C. elegans Lower sequence: from reversed P. pacificus

Conclusion: DustMasker fails to mask some tandem repeats

45

Other methods?

46

Upper sequence: part of an animal protein Lower sequence: part of a reversed plant protein

• SegMasker does not work either:

• Nor does RepeatMasker, TRF…

A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.

Repeat masking

• There are standard “repeat masking” tools

– RepeatMasker, DustMasker, SegMasker, TRF, …

• Most people just assume they work

– Cargo cult science

• Genomic bioinformatics is riddled with it

47

New repeat-masking method

• tantan: http://www.cbrc.jp/tantan/

• It looks for slippery regions in sequences

• Slippery = similar to shifted versions of itself

• It integrates similarity at different slip distances, using a Forward-Backward algorithm

48

A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.

http://www.cbrc.jp/tantan/

tantan test result

The C. elegans genome versus the reversed P. pacificus genome, after masking with tantan:

Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 49

Conclusion

• tantan prevents simple-sequence alignments

• Without masking an excessive amount

• It even works for extremely AT-rich DNA

– Plasmodium falciparum (malaria): 80% AT

– Dictyostelium discoideum (slime mould): 80% AT

50



51

Classic score-based alignment

52

1. Define a scoring scheme

2. Find alignments with high (maximum) scores

Alignment scoring scheme

53

a c g t

a 2 -3 -1 -3

c -3 2 -3 -1

g -1 -3 2 -3

t -3 -1 -3 2

Gap existence cost: 5 Gap extension cost: 1

Substitution score matrix Gap scores


54

a c g t

a 2 -3 -1 -3

c -3 2 -3 -1

g -1 -3 2 -3

t -3 -1 -3 2

t a c g t g - - a g g t

| | | | | | | | |

t a c a t g c t a g g t


Substitution score matrix

Alignment score: 10

2 +2 +2 -1 +2 +2 -7 +2 +2 +2 +2

Gap scores

Example:

Classic score-based alignment

55

1. Define a scoring scheme

2. Find alignments with high (maximum) scores

tacgtg--aggt

||| || ||||

tacatgctaggt

ctatgctacgtgaggtgtggc

attacatgctaggtccac

How to find alignments with max score?

• Smith-Waterman algorithm – Exact: guarantees to find the max score – A bit slow

• BLAST, FASTA, etc – Heuristic: no guarantee – Faster

56

tacgtg--aggt

||| || ||||

tacatgctaggt

ctatgctacgtgaggtgtggc

attacatgctaggtccac



57


• Where do these scores come from?

• Why is this a good method anyway?

58

a c g t

a 2 -3 -1 -3

c -3 2 -3 -1

g -1 -3 2 -3

t -3 -1 -3 2


Substitution score matrix Gap scores

Sxy =

Scores are log likelihood ratios

Sxy = t ´ logAxy

Px ´Qy

æ

èçç

ö

ø÷÷

Probability of x aligned to y in a true alignment

Probability of x in the first sequence

Probability of y in the second sequence

Model of homologous sequences

Model of independent

sequences

Different matrices for different tasks

60

a c g t

a 1 -3 -3 -3

c -3 1 -3 -3

g -3 -3 1 -3

t -3 -3 -3 1

a c g t

a 2 -6 -6 -6

c -6 2 -6 1

g -6 -6 2 -6

t -6 -6 -6 1

a c g t

a 2 -3 -2 -3

c -3 5 -3 -2

g -2 -3 5 -3

t -3 -2 -3 2

AT-rich DNA (e.g. malaria) Bisulfite-converted DNA

a c g t

a 1 -1 -1 -1

c -1 1 -1 -1

g -1 -1 1 -1

t -1 -1 -1 1

Strong similarities (~99% identity) Weak similarities (~75% identity)

What about gap scores?

Pair hidden Markov model

The arrows describe probabilities for insertions and deletions. (It looks more complicated than it really is.)

A useful formula

Prob alignment( ) µ exp alignment score / t( )



63

Alignment ambiguity

ctagctaaccgtatcgtgggc

||||| | ||||| | ||

ctagcca---gtatctagtgc

?


||||| | ||||| | ||

ctagc---cagtatctagtgc

Or

64

Per-column probabilities

… g c a t c c t t g g g t c t c g a c a t …

… g c c t c g t t a g a - - t a g a t a g …

.99

.99

.99

.95

.93

.92

.90

.79

.55

.33

.16

.22

.49

.55

.59

.71

.93

.97

.98

.99

65

Importance

• Column reliability is important for:

– Studying the evolution of binding sites

– Identifying polymorphisms

– Finding recombination breakpoints

– …

66

… g c a t c c t t g g g t c t c g a c a t …

… g c c t c g t t a g a - - t a g a t a g …

.99

.99

.99

.95

.93

.92

.90

.79

.55

.33

.16

.22

.49

.55

.59

.71

.93

.97

.98

.99

How to calculate ambiguity


||||| | ||||| | ||

ctagcca---gtatctagtgc

67

Prob(column) =sum of exp(score / t) for all alignments that include the column

sum of exp(score / t) for all alignments

Prob(column) = sum of probs of all alignments that include the column

An aligner that indicates ambiguity

68

Since 2008

http://last.cbrc.jp/

Warning: LAST was made by me and colleagues




69

Sequence quality data

• Some DNA sequencers estimate the error probability of every base

• We ought to use this information when comparing sequences

t a g c t g a

0.01 0.02 0.07 0.24 0.32 0.75 0.75

70

General case: compare 2 sequences with error probabilities

a t g c c …

0.01 0.02 0.02 0.09 0.17

g t a c c …

0.03 0.01 0.08 0.12 0.44

71

Sxy = logAxy

BxBy

é

ë ê

ù

û ú

a c g t a 2 -3 -1 -3 c -3 2 -3 -1 g -1 -3 2 -3 t -3 -1 -3 2

Sxpyq = log pqAxy

BxBy+ (1- pq)

é

ë ê

ù

û ú

Traditional sequence comparison Giga-sequencers

t a g c t

0.01 0.02 0.07 0.24 0.32

Error probabilities Score matrix

Real substitutions (mutation / evolution) Erroneous substitutions

Generalized log likelihood ratio:

72 Incorporating sequence quality data into alignment improves DNA read mapping. Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100

An aligner that combines score matrix & quality data

73

Since 2008






74

Why is BLAST too slow?

75


1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed

…atcgtatcgtatcgtactgctggcctagtggggga…

…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…

76


1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed



Problem

Non-uniform composition:

atatatatatatatatatata Alu

LINEs SINEs Isochores

CpG islands

too many seeds too many extensions too slow 77

Example

• Compare the human and chimp genomes

• Each genome has ~ 1 million Alu elements

• So we will get ~ 1012 seed matches…

Problem

Non-uniform composition:

atatatatatatatatatata Alu

LINEs SINEs Isochores

CpG islands

too many seeds too many extensions too slow 78

Solution: adaptive seeds

1. Find “seeds” (initial matches) of a fixed length rareness 2. Try extending an alignment from each seed



79

Adaptive seeds can be found efficiently by using a suffix array

An aligner that uses adaptive seeds

80

Since 2008




LAST run times

• Compare the human and chicken genomes

– 3.5 hours

• Align 1 million length-87 DNA reads to the human genome

– 6 minutes

• (Using 1 CPU core)

81

82 Sensitivity (% of reads that are correctly aligned) Run time (minutes)

Simulated DNA reads Error rate

(% of aligned reads that are wrong)

Method Time (min)

bwa 16

bwa-n10 67

last 41

last

last

novoalign 518

shrimp2 ?

stampy 72

stampy (sensitive) 248

For more detail

• Adaptive seeds tame genomic sequence comparison Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC Genome Research 2011 21:487

• Incorporating sequence quality data into alignment improves DNA read mapping Frith MC, Wan R, Horton P Nucleic Acids Research 2010 38:e100

• A mostly traditional approach improves alignment of bisulfite-converted DNA Frith MC, Mori R, Asai K Nucleic Acids Research 2012 40:e100

84

Summary

• It is feasible to use classic, statistical alignment approaches with large modern sequence datasets

– This is beneficial for modeling: diverged sequences, biased base frequencies, etc.

• Alignment ambiguity should be used more often

• Try to avoid cargo cult science!

85

Main collaborators

Paul Horton CBRC

Michiaki Hamada U of Tokyo / CBRC

Szymon Kielbasa Leiden University

Programming wisdom

• Measuring programming progress by lines of code is like measuring aircraft building progress by weight. – Bill Gates

• As you're about to add a comment, ask yourself, 'How can I improve the code so that this comment isn't needed?’ – Steve McConnell

• The key to performance is elegance, not battalions of special cases. – Jon Bently and M. Douglas McIlroy

• Weeks of programming can save you hours of planning. – Unknown

• Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. – Unknown

87

Finding and Aligning Related Sequences (Martin Frith)

Documents

Transcript of Finding and Aligning Related Sequences (Martin Frith)