Finding and Aligning Related Sequences (Martin Frith)

87
Finding and aligning related sequences Martin C. Frith Computational Biology Research Center AIST, Tokyo www.cbrc.jp/~martin 2012-12-09 @ BioinfoSummer, Adelaide

Transcript of Finding and Aligning Related Sequences (Martin Frith)

Page 1: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

Martin C. Frith Computational Biology Research Center

AIST, Tokyo www.cbrc.jp/~martin

2012-12-09 @ BioinfoSummer, Adelaide

Page 2: Finding and Aligning Related Sequences (Martin Frith)

CBRC

2 www.cbrc.jp

Page 3: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

3

Page 4: Finding and Aligning Related Sequences (Martin Frith)

Compare human and mouse genomes

gctagtgtac

||| || ||

gct--tgaac

aa-gtaca

|| |||||

aaggtaca

4

Page 5: Finding and Aligning Related Sequences (Martin Frith)

Human

Mouse

5

Page 6: Finding and Aligning Related Sequences (Martin Frith)

Compare DNA from a patient to a reference genome

Patient DNA Sequencer

ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg

ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc

Reference genome sequence

ctagcttatcgt

DNA reads

6

Page 7: Finding and Aligning Related Sequences (Martin Frith)

What kinds of microbial genes are there?

Water from e.g. a hot spring

DNA Sequencer

ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg

DNA reads

ArgLysTyrProPheLeuLeuIsoArgLysPheAlaPro-ProGlyGlyAlaGly…

atatatatatattagccgt

|||...||| |||...|||

GlyGlyPhePheGlyAlaLeuCysCysTrpTrpAlaGlyAlaPro…

Database of all known proteins 7

Page 8: Finding and Aligning Related Sequences (Martin Frith)

More examples

• Compare ancient DNA to a reference genome

– Mammoth, neanderthal, Turin Shroud, …

• Align (potentially spliced) RNA sequences to a reference genome

– To see which genes are active

• Align short DNA reads to each other

– In order to assemble them

8

Page 9: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

9

Page 10: Finding and Aligning Related Sequences (Martin Frith)

What are we really trying to do?

1. Find and align similar sequences?

2. Find and align homologous sequences?

3. Find and align orthologous sequences?

4. Find and align paralogous sequences?

10

Page 11: Finding and Aligning Related Sequences (Martin Frith)

Homology, orthology, paralogy

11

Homology: descent from a common ancestor

Orthology: descent from a common ancestor by genome division

Paralogy: descent from a common ancestor by duplication within a genome

Past

Present

Page 12: Finding and Aligning Related Sequences (Martin Frith)

Example

human mouse

β1-globin β2-globin

β-globin

12

Page 13: Finding and Aligning Related Sequences (Martin Frith)

Example

human

β1-globin β2-globin

β-globin

Orthologs

Paralogs

mouse

13

Page 14: Finding and Aligning Related Sequences (Martin Frith)

Example

human

β1-globin β2-globin

β-globin

• Orthology is not necessarily 1-to-1 • Orthology is not transitive Not an equivalence relation

mouse

Orthologs

Orthologs

14

Page 15: Finding and Aligning Related Sequences (Martin Frith)

What are we really trying to do?

1. Find and align similar sequences?

2. Find and align homologous sequences?

3. Find and align orthologous sequences?

4. Find and align paralogous sequences?

15

Page 16: Finding and Aligning Related Sequences (Martin Frith)

Compare human and mouse genomes

What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs

16

Page 17: Finding and Aligning Related Sequences (Martin Frith)

Compare human and mouse genomes

What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs

Do we want to align mouse α-globin with

human β-globin?

Probably not

17

Page 18: Finding and Aligning Related Sequences (Martin Frith)

Compare DNA from a patient to a reference genome

Patient DNA Sequencer

ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg

ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc

Reference genome sequence

DNA reads

What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs

18

Page 19: Finding and Aligning Related Sequences (Martin Frith)

Compare DNA from a patient to a reference genome

Patient DNA Sequencer

ctatgctagtcgta

cctatagtctgtatg

atatatatattatta

ccctagtcgtatgg

tttaccagctgga

ctagtcgtagtgtgg

ctgattgcttatttacgttcgtatgctagctgatcgtagtcgtcgagcttatcgtgggc

Reference genome sequence

DNA reads

What is the aim? • Find similar sequences • Find homologs • Find orthologs • Find paralogs

Do we want to align the patient’s α-globin to the

reference’s β-globin?

19

Page 20: Finding and Aligning Related Sequences (Martin Frith)

What are we really trying to do?

1. Find and align similar sequences?

2. Find and align homologous sequences?

3. Find and align orthologous sequences?

4. Find and align paralogous sequences?

20

Page 21: Finding and Aligning Related Sequences (Martin Frith)

Aims and algorithms

• Sequence comparison algorithms basically find similar sequences

• Finding homologs is harder

• Finding orthologs is even harder

21

Page 22: Finding and Aligning Related Sequences (Martin Frith)

Similarity versus homology

Similar sequences

Homologous sequences

Convergent evolution

Rapid evolution over a long time span

22

Page 23: Finding and Aligning Related Sequences (Martin Frith)

• The most frequent case of convergent evolution is simple sequences

23

Page 24: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

24

Page 25: Finding and Aligning Related Sequences (Martin Frith)

Simple sequences

• DNA (and RNA and protein) frequently has simple sequences:

atgatcgattatcgtagtctaggtcgtatgctatgatt

cgataaaaaaaaaaaaaaaaaaacggtatgcgtagctg

cgatcgtagtgactatatgagagaggattcgatgctaa

gttctctaggagaggcttaggctgagcgcgtatcactg

gctcgcggctgtgtgtgtgtgtgtgtgtgtgtgtgtga

cgtatcgcacatcgtcgattttgagattcccgatggcc

25

Page 26: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatc

gtagtagtagtagtagta

26

Page 27: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatca

gtagtagtagtagtagta

27

Page 28: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatcat

gtagtagtagtagtagta

28

Page 29: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatcatc

gtagtagtagtagtagta

29

Page 30: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatcatca

gtagtagtagtagtagta

30

Page 31: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

catcatcatcatcat

gtagtagtagtagtagta

31

Page 32: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

32

catcat

gtagtagtagtagtagta

cat

Page 33: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

33

catcat

gtagtagtagtagtagta

catc

Page 34: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

34

catcat

gtagtagtagtagtagta

catca

Page 35: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

35

catcat

gtagtagtagtagtagta

catcat

Page 36: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• Strand slippage during DNA replication:

36

catcat

gtagtagtagtagtagta

catcat

On the top strand, it has got longer

Page 37: Finding and Aligning Related Sequences (Martin Frith)

How do simple sequences evolve?

• An initial (short, mild) simple sequence occurs by chance

• Due to slippage, it gets longer…

• And longer…

37

Page 38: Finding and Aligning Related Sequences (Martin Frith)

Homology between human and banana?

• Probably not.

38

atatatatatatatatatatatatatatatatatatatata

|||||||||||||||||||||||||||||||||||||||||

atatatatatatatatatatatatatatatatatatatata

Human

Banana

Page 39: Finding and Aligning Related Sequences (Martin Frith)

Avoiding non-homologous alignments of simple sequences

• The standard way is to identify and “mask” them, before alignment

39

Page 40: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

40

Page 41: Finding and Aligning Related Sequences (Martin Frith)

Repeat masking

• There are standard “repeat masking” tools

– RepeatMasker, DustMasker, SegMasker, TRF, …

• Most people just assume they work

41

Page 42: Finding and Aligning Related Sequences (Martin Frith)

Repeat confusion

atcttatgtctctctctctctctctctctggatgcttgaccac

cttgttattgctgatcgtcctctctgtaaattgttattgctgatcatgctttaac

Simple sequence:

Interspersed repeat:

They are both called “repeats”, but they are rather different. Don’t confuse them.

42

Page 43: Finding and Aligning Related Sequences (Martin Frith)

Test of avoiding non-homologous alignments

• Compare two sequences after reversing one of them

• Sequences never evolve by reversal, so there are no true homologs in this test

• But repeats may still cause strong similarities, if they are not suppressed

• Hello

43

atatatatatatatatatatatatatatatatatatatata

|||||||||||||||||||||||||||||||||||||||||

atatatatatatatatatatatatatatatatatatatata

Human

Banana

Page 44: Finding and Aligning Related Sequences (Martin Frith)

Test result

The C. elegans genome versus the reversed P. pacificus genome, after masking both with DustMasker:

Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 44

Page 45: Finding and Aligning Related Sequences (Martin Frith)

A spurious alignment

Upper sequence: from C. elegans Lower sequence: from reversed P. pacificus

Conclusion: DustMasker fails to mask some tandem repeats

45

Page 46: Finding and Aligning Related Sequences (Martin Frith)

Other methods?

46

Upper sequence: part of an animal protein Lower sequence: part of a reversed plant protein

• SegMasker does not work either:

• Nor does RepeatMasker, TRF…

A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.

Page 47: Finding and Aligning Related Sequences (Martin Frith)

Repeat masking

• There are standard “repeat masking” tools

– RepeatMasker, DustMasker, SegMasker, TRF, …

• Most people just assume they work

– Cargo cult science

• Genomic bioinformatics is riddled with it

47

Page 48: Finding and Aligning Related Sequences (Martin Frith)

New repeat-masking method

• tantan: http://www.cbrc.jp/tantan/

• It looks for slippery regions in sequences

• Slippery = similar to shifted versions of itself

• It integrates similarity at different slip distances, using a Forward-Backward algorithm

48

A new repeat-masking method enables specific detection of homologous sequences Frith MC. Nucleic Acids Research 2011 39:e23.

Page 49: Finding and Aligning Related Sequences (Martin Frith)

tantan test result

The C. elegans genome versus the reversed P. pacificus genome, after masking with tantan:

Red: observed number of alignments Black: expected number of alignments for random sequences (E-value) 49

Page 50: Finding and Aligning Related Sequences (Martin Frith)

Conclusion

• tantan prevents simple-sequence alignments

• Without masking an excessive amount

• It even works for extremely AT-rich DNA

– Plasmodium falciparum (malaria): 80% AT

– Dictyostelium discoideum (slime mould): 80% AT

50

Page 51: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

51

Page 52: Finding and Aligning Related Sequences (Martin Frith)

Classic score-based alignment

52

1. Define a scoring scheme

2. Find alignments with high (maximum) scores

Page 53: Finding and Aligning Related Sequences (Martin Frith)

Alignment scoring scheme

53

a c g t

a 2 -3 -1 -3

c -3 2 -3 -1

g -1 -3 2 -3

t -3 -1 -3 2

Gap existence cost: 5 Gap extension cost: 1

Substitution score matrix Gap scores

Page 54: Finding and Aligning Related Sequences (Martin Frith)

Alignment scoring scheme

54

a c g t

a 2 -3 -1 -3

c -3 2 -3 -1

g -1 -3 2 -3

t -3 -1 -3 2

t a c g t g - - a g g t

| | | | | | | | |

t a c a t g c t a g g t

Gap existence cost: 5 Gap extension cost: 1

Substitution score matrix

Alignment score: 10

2 +2 +2 -1 +2 +2 -7 +2 +2 +2 +2

Gap scores

Example:

Page 55: Finding and Aligning Related Sequences (Martin Frith)

Classic score-based alignment

55

1. Define a scoring scheme

2. Find alignments with high (maximum) scores

tacgtg--aggt

||| || ||||

tacatgctaggt

ctatgctacgtgaggtgtggc

attacatgctaggtccac

Page 56: Finding and Aligning Related Sequences (Martin Frith)

How to find alignments with max score?

• Smith-Waterman algorithm – Exact: guarantees to find the max score – A bit slow

• BLAST, FASTA, etc – Heuristic: no guarantee – Faster

56

tacgtg--aggt

||| || ||||

tacatgctaggt

ctatgctacgtgaggtgtggc

attacatgctaggtccac

Page 57: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

57

Page 58: Finding and Aligning Related Sequences (Martin Frith)

Alignment scoring scheme

• Where do these scores come from?

• Why is this a good method anyway?

58

a c g t

a 2 -3 -1 -3

c -3 2 -3 -1

g -1 -3 2 -3

t -3 -1 -3 2

Gap existence cost: 5 Gap extension cost: 1

Substitution score matrix Gap scores

Sxy =

Page 59: Finding and Aligning Related Sequences (Martin Frith)

Scores are log likelihood ratios

Sxy = t ´ logAxy

Px ´Qy

æ

èçç

ö

ø÷÷

Probability of x aligned to y in a true alignment

Probability of x in the first sequence

Probability of y in the second sequence

Model of homologous sequences

Model of independent

sequences

Page 60: Finding and Aligning Related Sequences (Martin Frith)

Different matrices for different tasks

60

a c g t

a 1 -3 -3 -3

c -3 1 -3 -3

g -3 -3 1 -3

t -3 -3 -3 1

a c g t

a 2 -6 -6 -6

c -6 2 -6 1

g -6 -6 2 -6

t -6 -6 -6 1

a c g t

a 2 -3 -2 -3

c -3 5 -3 -2

g -2 -3 5 -3

t -3 -2 -3 2

AT-rich DNA (e.g. malaria) Bisulfite-converted DNA

a c g t

a 1 -1 -1 -1

c -1 1 -1 -1

g -1 -1 1 -1

t -1 -1 -1 1

Strong similarities (~99% identity) Weak similarities (~75% identity)

Page 61: Finding and Aligning Related Sequences (Martin Frith)

What about gap scores?

Pair hidden Markov model

The arrows describe probabilities for insertions and deletions. (It looks more complicated than it really is.)

Page 62: Finding and Aligning Related Sequences (Martin Frith)

A useful formula

Prob alignment( ) µ exp alignment score / t( )

Page 63: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

63

Page 64: Finding and Aligning Related Sequences (Martin Frith)

Alignment ambiguity

ctagctaaccgtatcgtgggc

||||| | ||||| | ||

ctagcca---gtatctagtgc

?

ctagctaaccgtatcgtgggc

||||| | ||||| | ||

ctagc---cagtatctagtgc

Or

64

Page 65: Finding and Aligning Related Sequences (Martin Frith)

Per-column probabilities

… g c a t c c t t g g g t c t c g a c a t …

… g c c t c g t t a g a - - t a g a t a g …

.99

.99

.99

.95

.93

.92

.90

.79

.55

.33

.16

.22

.49

.55

.59

.71

.93

.97

.98

.99

65

Page 66: Finding and Aligning Related Sequences (Martin Frith)

Importance

• Column reliability is important for:

– Studying the evolution of binding sites

– Identifying polymorphisms

– Finding recombination breakpoints

– …

66

… g c a t c c t t g g g t c t c g a c a t …

… g c c t c g t t a g a - - t a g a t a g …

.99

.99

.99

.95

.93

.92

.90

.79

.55

.33

.16

.22

.49

.55

.59

.71

.93

.97

.98

.99

Page 67: Finding and Aligning Related Sequences (Martin Frith)

How to calculate ambiguity

ctagctaaccgtatcgtgggc

||||| | ||||| | ||

ctagcca---gtatctagtgc

67

Prob(column) =sum of exp(score / t) for all alignments that include the column

sum of exp(score / t) for all alignments

Prob(column) = sum of probs of all alignments that include the column

Page 68: Finding and Aligning Related Sequences (Martin Frith)

An aligner that indicates ambiguity

68

Since 2008

http://last.cbrc.jp/

Warning: LAST was made by me and colleagues

Page 69: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

69

Page 70: Finding and Aligning Related Sequences (Martin Frith)

Sequence quality data

• Some DNA sequencers estimate the error probability of every base

• We ought to use this information when comparing sequences

t a g c t g a

0.01 0.02 0.07 0.24 0.32 0.75 0.75

70

Page 71: Finding and Aligning Related Sequences (Martin Frith)

General case: compare 2 sequences with error probabilities

a t g c c …

0.01 0.02 0.02 0.09 0.17

g t a c c …

0.03 0.01 0.08 0.12 0.44

71

Page 72: Finding and Aligning Related Sequences (Martin Frith)

 

Sxy = logAxy

BxBy

é

ë ê

ù

û ú

a c g t a 2 -3 -1 -3 c -3 2 -3 -1 g -1 -3 2 -3 t -3 -1 -3 2

 

Sxpyq = log pqAxy

BxBy+ (1- pq)

é

ë ê

ù

û ú

Traditional sequence comparison Giga-sequencers

t a g c t

0.01 0.02 0.07 0.24 0.32

Error probabilities Score matrix

Real substitutions (mutation / evolution) Erroneous substitutions

Generalized log likelihood ratio:

72 Incorporating sequence quality data into alignment improves DNA read mapping. Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100

Page 73: Finding and Aligning Related Sequences (Martin Frith)

An aligner that combines score matrix & quality data

73

Since 2008

http://last.cbrc.jp/

Warning: LAST was made by me and colleagues

Page 74: Finding and Aligning Related Sequences (Martin Frith)

Finding and aligning related sequences

• Examples • What are we really trying to do? • Simple sequences • Repeat masking • Classic score-based alignment • Alignment & probability models • Alignment ambiguity • Using sequence quality data • Scaling to huge datasets

74

Page 75: Finding and Aligning Related Sequences (Martin Frith)

Why is BLAST too slow?

75

Page 76: Finding and Aligning Related Sequences (Martin Frith)

Why is BLAST too slow?

1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed

…atcgtatcgtatcgtactgctggcctagtggggga…

…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…

76

Page 77: Finding and Aligning Related Sequences (Martin Frith)

Why is BLAST too slow?

1. Find “seeds” (initial matches) of a fixed length (e.g. 11) 2. Try extending an alignment from each seed

…atcgtatcgtatcgtactgctggcctagtggggga…

…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…

Problem

Non-uniform composition:

atatatatatatatatatata Alu

LINEs SINEs Isochores

CpG islands

too many seeds too many extensions too slow 77

Page 78: Finding and Aligning Related Sequences (Martin Frith)

Example

• Compare the human and chimp genomes

• Each genome has ~ 1 million Alu elements

• So we will get ~ 1012 seed matches…

Problem

Non-uniform composition:

atatatatatatatatatata Alu

LINEs SINEs Isochores

CpG islands

too many seeds too many extensions too slow 78

Page 79: Finding and Aligning Related Sequences (Martin Frith)

Solution: adaptive seeds

1. Find “seeds” (initial matches) of a fixed length rareness 2. Try extending an alignment from each seed

…atcgtatcgtatcgtactgctggcctagtggggga…

…ctcgtcgatgctagtcgtactgctgatgctatatatatattaatg…

79

Adaptive seeds can be found efficiently by using a suffix array

Page 80: Finding and Aligning Related Sequences (Martin Frith)

An aligner that uses adaptive seeds

80

Since 2008

http://last.cbrc.jp/

Warning: LAST was made by me and colleagues

Page 81: Finding and Aligning Related Sequences (Martin Frith)

LAST run times

• Compare the human and chicken genomes

– 3.5 hours

• Align 1 million length-87 DNA reads to the human genome

– 6 minutes

• (Using 1 CPU core)

81

Page 82: Finding and Aligning Related Sequences (Martin Frith)

82 Sensitivity (% of reads that are correctly aligned) Run time (minutes)

Simulated DNA reads Error rate

(% of aligned reads that are wrong)

Page 83: Finding and Aligning Related Sequences (Martin Frith)

Method Time (min)

bwa 16

bwa-n10 67

last 41

last

last

novoalign 518

shrimp2 ?

stampy 72

stampy (sensitive) 248

Page 84: Finding and Aligning Related Sequences (Martin Frith)

For more detail

• Adaptive seeds tame genomic sequence comparison Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC Genome Research 2011 21:487

• Incorporating sequence quality data into alignment improves DNA read mapping Frith MC, Wan R, Horton P Nucleic Acids Research 2010 38:e100

• A mostly traditional approach improves alignment of bisulfite-converted DNA Frith MC, Mori R, Asai K Nucleic Acids Research 2012 40:e100

84

Page 85: Finding and Aligning Related Sequences (Martin Frith)

Summary

• It is feasible to use classic, statistical alignment approaches with large modern sequence datasets

– This is beneficial for modeling: diverged sequences, biased base frequencies, etc.

• Alignment ambiguity should be used more often

• Try to avoid cargo cult science!

85

Page 86: Finding and Aligning Related Sequences (Martin Frith)

Main collaborators

Paul Horton CBRC

Michiaki Hamada U of Tokyo / CBRC

Szymon Kielbasa Leiden University

Page 87: Finding and Aligning Related Sequences (Martin Frith)

Programming wisdom

• Measuring programming progress by lines of code is like measuring aircraft building progress by weight. – Bill Gates

• As you're about to add a comment, ask yourself, 'How can I improve the code so that this comment isn't needed?’ – Steve McConnell

• The key to performance is elegance, not battalions of special cases. – Jon Bently and M. Douglas McIlroy

• Weeks of programming can save you hours of planning. – Unknown

• Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. – Unknown

87