Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations...

72
Lecture 3. Heuristic Sequence Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Transcript of Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations...

Page 1: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Lecture 3. Heuristic Sequence AlignmentThe Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics

Page 2: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Lecture outline1. Computational complexity of optimal alignment

problems2. Heuristic methods– Dot plot– Pairwise sequence alignment:• FASTA

– The FASTA file format• BLAST

– Statistical significance• Variations

– Multiple sequence alignment:• Clustal

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 2

Page 3: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

COMPUTATIONAL COMPLEXITY OF OPTIMAL ALIGNMENT PROBLEMS

Part 1

Page 4: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Computational complexity• To align two sequences of lengths m and n by

dynamic programming, how much time and space do we need?– O(mn)– Can do better than filling the whole table, but still

expensive• To find the sequence in a database with l length-n

sequences that is closest to a query sequence of length m, how much time and space do we need?– Suppose we consider the database sequences one by

one– O(lmn) time– O(mn) space

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 4

Page 5: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Numbers in real situations• Scenario 1: Whole-genome alignment

between two species– m, n (e.g., C. elegans and C. briggsae): ~100 x 106

– mn » 1016

– If a computer can do 3 x 109 operations in a second, it would take ~3,000,000 seconds = 926 hours = 38.6 days

– Still ok if you can wait. But can we find a machine with 1016, i.e., 10PB RAM?

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 5

Image credit: wormbase.org

~1mm

Page 6: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Numbers in real situations• Scenario 2: Searching for a gene from a database– m (e.g., A human gene): ~3,000 on average– l (e.g., GenBank v238): 217,122,233 sequences– n (e.g., GenBank v238): 427,823,258,901 bases /

217,122,233 sequences = ~2,000– mn = ~6,000,000 (manageable)– lmn = ~1.3 x 1015

– If a computer can do 3 x 109 operations in a second, it would take ~430,000 seconds = 120 hours

– If you go to GenBank, it will take only a minute• Don’t forget it serves many users at the same time

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 6

Page 7: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Heuristics• How to perform alignment faster and with less

memory?1. Quickly identify regions with high similarity• By inspection• By considering short sub-sequences

2. Combine and refine these initial resultsThe results may not be optimal in terms of alignment score, but the process is usually much faster than dynamic programming– These methods are called heuristic methods

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 7

Page 8: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

HEURISTIC METHODSPart 2

Page 9: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Dot plot• Consider an alignment that we studied before:

• If we remove the details but only light up the matched characters, what are we going to see?

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 9

s

r

A C G G C G T f

A 3 2 2 2 0 -2 -4 -6

T 1 2 3 3 1 -1 -3 -5

G 1 2 3 4 2 0 -2 -4

C -1 0 1 2 3 1 -1 -3

G -3 -2 -1 0 1 2 0 -2

T -5 -4 -3 -2 -1 0 1 -1

f -7 -6 -5 -4 -3 -2 -1 0

Page 10: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

s

r

A C G G C G T f

A

T

G

C

G

T

f

Dot plot• What do we see?

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 10

Page 11: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Diagonals• We see “diagonals”

• Each diagonal marks a local exact match

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 11

s

r

A C G G C G T f

A

T

G

C

G

T

f

Page 12: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Dot plot• In general, the dot plot gives us information

about:– Conserved regions– Non-conserved regions– Inversions– Insertions and deletions– Local repeats– Multiple matches– Translocations

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 12

Page 13: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Alignment types• Information about the aligned sequences

based on a dot plot

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 13

s

r

A T G A A C G f

A

T

G

C

G

T

f

s

r

A C G T C G T f

G

G

C

G

T

A

f

s

r

A A G G C C T f

A

A

C

C

G

G

f

Insertion/deletion Duplication Rearrangement

Page 14: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Resolution• What we will see if we only highlight diagonal

runs of length at least 1, 3, and 5:

• Which one is the best?

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 14

s

r

A C G G C G T f

A

T

G

C

G

T

f

s

r

A C G G C G T f

A

T

G

C

G

T

f

s

r

A C G G C G T f

A

T

G

C

G

T

f

Resolution: 1 character Resolution: 3 characters Resolution: 5 characters

Page 15: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Real examples

• Notice the resolutions

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 15

Image credit: Tufarelli et al., Genome Research 14(4):623-630, (2004),http://cbcb.umd.edu/confcour/CMSC828H-materials/Lecture10_MUMmer_alignment.ppt

Page 16: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Quick quiz• What is the resulting double-stranded DNA if

the underlined part is inverted?5’ ACGGA 3’3’ TGCCT 5’

• Is the inversion of the following double-stranded DNA the same as itself?5’ ACCA 3’3’ TGGT 5’

• Can you give an example double-stranded DNA that is the same as its inversion?

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 16

Page 17: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Limitations of dot plot• Must be exact matches– Possible to allow some mismatches, but more

computations would be needed• We will come back to this topic when we study BLAST

• The whole plot takes a lot of space• Difficult to determine resolution– Show even one base match: Too many dots– Show only long matches: May miss important signals

• Mainly for visualization, not quantitativeWe now study how the ideas of dot plot can help us perform database search efficiently

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 17

Page 18: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Database search

• Problem: Given a query sequence r and a database D of sequences, find sequences in Dthat are similar to r– For each identified sequence s, good to

return a match score, sim(r, s)–Good to list multiple sequences with high

match scores instead of the best one only– |D| is usually very large (dynamic

programming is not quite feasible)

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 18

Page 19: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Heuristic database search• Two main steps, based on dot-plot ideas:

1. Instead of aligning the whole r with a whole sequence in D, we look for short sub-sequences of very high similarity using very quick methods

2. Combine and extend these initial results to get longer matches

• Notes:– We will start with one sequence s in D

• To search for high-similarity matches from the whole database, one may simply repeat the two steps for every sequence in the database

• There are ways to make it faster, by using index structures– The two steps do not guarantee to produce the best

matches (Can you come up with an example?)– We will only cover the high-level ideas

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 19

Page 20: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Illustrating the ideas

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 20

s

r

A C G G C G T f

A 3 2 2 2 0 -2 -4 -6

T 1 2 3 3 1 -1 -3 -5

G 1 2 3 4 2 0 -2 -4

C -1 0 1 2 3 1 -1 -3

G -3 -2 -1 0 1 2 0 -2

T -5 -4 -3 -2 -1 0 1 -1

f -7 -6 -5 -4 -3 -2 -1 0

s

r

A C G G C G T f

A

T

G

C

G

T

f

Dot plot (resolution: 1 character)

Optimal alignmentss

r

A C G G C G T f

A 3 2 2 2 0 -2 -4 -6

T 1 2 3 3 1 -1 -3 -5

G 1 2 3 4 2 0 -2 -4

C -1 0 1 2 3 1 -1 -3

G -3 -2 -1 0 1 2 0 -2

T -5 -4 -3 -2 -1 0 1 -1

f -7 -6 -5 -4 -3 -2 -1 0

s

r

A C G G C G T f

A 3 2 2 2 0 -2 -4 -6

T 1 2 3 3 1 -1 -3 -5

G 1 2 3 4 2 0 -2 -4

C -1 0 1 2 3 1 -1 -3

G -3 -2 -1 0 1 2 0 -2

T -5 -4 -3 -2 -1 0 1 -1

f -7 -6 -5 -4 -3 -2 -1 0

A_TGCGTACGGCGT

AT_GCGTACGGCGT

ATG_CGTACGGCGT

s

r

A C G G C G T f

A

T

G

C

G

T

f

Dot plot (resolution: 3 character)

Page 21: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Popular methods• We now study two popular methods– FASTA– BLAST

• Both use the mentioned ideas for performing local alignments, but in different ways

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 21

Page 22: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

FASTA• First version for protein sequences (FASTP)

proposed by David J. Lipman and William R. Pearson in 1985 (Lipman and Pearson, Rapid and Sensitive

Protein Similarity Searches, Science 227(4693):1435-1441, 1985) • Pronounced as “Fast-A” (i.e., fast for all kinds

of sequences)

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 22

Page 23: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Overview of FASTA• Step 1:

a) Find matches with stretches of at least k consecutive exact matches. Find the best (e.g., 10) matches using a simple scoring method.

b) Refine and re-evaluate the best matches using formal substitution matrices.

• Step 2: c) Combine the best matches with

gaps allowed.d) Use dynamic programming (DP)

on the combined matches. Banded DP: Considering only a band in the DP table

Let’s study more details about a and c in the coming slides

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 23

Image credit: Wikipedia

Page 24: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Finding local exact matches• Finding diagonals of fixed length k– Usually

• k=1-2 for protein sequences• k=4-6 for DNA sequences

• Key: Building a lookup table• Example (new):– Sequence r: ACGTTGCT– Sequence s in database D:0 1123456789012GCGTGACTTTCT

– Let’s use k=2 here

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 24

Length-2 subsequences of s PositionsAC 6CG 2CT 7, 11GA 5GC 1GT 3TC 10TG 4TT 8, 9

There are different types of lookup tables that can be used. Here we use one that includes every length-2 subsequence (the “2-mers”) of s sorted lexicographically.

Page 25: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Finding local exact matches• Sequence r: ACGTTGCT• Sub-sequences of r in the

lookup table:

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 25

sr

G C G T G A C T T T C T

A

C

G

T

T

G

C

T

Length-2 subsequences of s PositionsAC 6CG 2CT 7, 11GA 5GC 1GT 3TC 10TG 4TT 8, 9

Note: The dot plot is for illustration only. The FASTA program does not need to construct it.• Quick recap: Why is it not a good idea to

construct the dot plot?• How can the long matches be found without it?

Page 26: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Merging matches• Merge matches on same diagonal (e.g., r[2,3]=s[2,3]

and r[3,4]=s[3,4] imply r[2,4]=s[2,4])– More advanced methods also allow gaps

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 26

sr

G C G T G A C T T T C T

A

C

G

T

T

G

C

T

sr

G C G T G A C T T T C T

A

C

G

T

T

G

C

T

Page 27: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Remaining steps• Keep some (e.g., 10)

high-scoring matches• Merge matches in

different diagonals by allowing indels

• Perform local alignment by dynamic programming

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 27

sr

G C G T G A C T T T C T

A

C

G

T

T

G

C

T

Possible final results:

Page 28: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Missing the optimal alignment• When will FASTA miss an optimal alignment?– Good but not exact local matches (especially for

protein sequences) Þ Not included in the very first step• k too large• High-scored mismatches, especially for protein

sequences

– Too many local candidates Þ The algorithm keeps only a few “best” ones, but it happens that they are not involved in the optimal alignment

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 28

Page 29: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Time and space requirements• How much space is needed?– One entry per length-k sub-sequence. (n-k+1) of them for

a sequence of length n• How much time is needed?– One table lookup per length-k sub-sequence– In the worst case, it can still take O(mn) time

• Consider matching AAAAA with AAAAAAA– In practice, usually it is much faster

• There are other types of lookup table that allows finding the correct table entries efficiently

• When there are multiple sequences in the database, the lookup tables for different sequences can be combined. Need to record the original sequence of each subsequence in that case.

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 29

Page 30: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Indexing multiple sequences• Suppose we have the

following two sequences s1 and s2in the database D:– s1:0 1123456789012GCGTGACTTTCT

– s2:0 11234567890CTGGAGCTAC

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 30

Length-2 subsequences Sequences and positions

AC s1:6, s2:9AG s2:5CG s1:2CT s1:7, s1:11, s2:1, s2:7GA s1:5, s2:4GC s1:1, s2:6GG s2:3GT s1:3TA s2:8TC s1:10TG s1:4, s2:2TT s1:8, s1: 9

Lookup table:

Page 31: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Lookup table for large k• The way we have been using to construct the lookup

table is to store only the k-mers that appear in the sequences– Need searching to find the entry corresponding to a k-mer

• An alternative way is to store all possible k-mers, one per entry– If k is large (e.g., 20), the table can be huge (420 » 1 trillion

entries)– Many of these k-mers do not appear in the indexed

sequence Þ Wasting space• What can we do if we really want to use a large k?– Use one single entry to represent multiple k-mers:

“Collisions”

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 31

Page 32: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Hash table• A “hash” is a number computed from a certain

input• A hash table is used to map objects to table

entries• Example:– Let’s say we have a table with 10 entries. We

compute the hash of a k-mer by regarding it as a base-4 number and take its remainder when dividing by 10:• ACGC: 1´43 + 2´42 + 3´41 + 2´40 = 64 + 32 + 12 + 2 =

110 Þ Entry 0

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 32

Page 33: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Hash table example• Let’s store the 2-mers of the following sequence in a

hash table based on the hash function just described:– Sequence s in database D:0 1123456789012GCGTGACTTTCT

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 33

Hash value

2-mers of s Occurrence positions of 2-mers

0 TT 8, 9

1 CG 2

2 CT 7, 11

3 GA 5

4 GC 1

5

6 ACGT

63

7

8 TC 10

9 TG 4

Page 34: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Disadvantages• When looking for occurrence

positions of a k-mer, we need to identify the relevant positions of the entry– E.g., GT has a hash value of

(3´4+4) % 10 = 6, but AC also has the same hash value and they are stored in the same table entry

• This checking can be very time consuming if:– Many k-mers are hashed to

the same table entry– Many k-mers hashing to the

same table entry actually occur in the sequence

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 34

Hash value

2-mers of s Occurrence positions of 2-mers

0 TT 8, 9

1 CG 2

2 CT 7, 11

3 GA 5

4 GC 1

5

6 ACGT

63

7

8 TC 10

9 TG 4

Page 35: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

The FASTA file format• FASTA may not be the most frequently used

heuristic sequence alignment method, but the FASTA file format is probably the most frequently used format for sequence data

• The format (see http://en.wikipedia.org/wiki/FASTA_format):– Text-based– Can store multiple sequences, one after another– For each sequence:

• One line that starts with ‘>’, stating the metadata (e.g., ID) of the sequence

• One or more lines for the actual sequence. Usually each line contains no more than 80 characters to fit screen width

– Can add comment lines that start with ‘;’

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 35

Page 36: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Example

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 36

Image credit: Wikipedia

>SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

(Are these DNA or protein sequences?)

Page 37: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Quick quiz• What are the advantage and disadvantage of

having one separate table row for each possible k-mer, as compared to using a hash function to determine the row of each entry?

• Suppose h is a hash function and x is a k-mer, where h(x) = number of A’s in xIs h a good hash function or not? Why?

Last update: 18-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 37

Page 38: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

BLAST• Basic Local Alignment Search Tool• Proposed by Altschul et al. in 1990 (Altschul et al.,

J. Mol. Biol. 215(3):403-410, 1990)• Probably the most frequently used (and most

well-known) algorithm in bioinformatics

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 38

Page 39: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

BLAST vs. FASTA• BLAST also uses the two main ideas (finding local

matches, then extending and combining them)• Main differences between the original ideas of BLAST

and FASTA:1. FASTA considers exact matches in the first step. BLAST

allows high-scoring inexact matches2. BLAST tries to extend local matches regardless of the

presence of local matches in the same diagonal3. BLAST contains a way to evaluate statistical significance

of matched sequences• In later versions the two share more common ideas• Let’s study these differences in more details

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 39

Page 40: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

1. Local matches• Again, consider the query sequence r: ACGTTGCT

• Suppose k=3, the first sub-sequence (“word”) is ACG

• FASTA looks for the locations of ACG in the sequences in the database

• BLAST looks for the locations of ACG and other similar length-3 sequences– If match has +1 score, mismatch has -1 score,

and we only consider sub-sequences with score ³ 1, we will consider these sub-sequences:

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 40

ACGCCGGCGTCGAAGAGGATGACAACCACT

Page 41: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Faster or slower?• For the same word length, BLAST needs to

search for more related sub-sequences• However, BLAST is usually faster than FASTA

because– BLAST uses a larger k, and so there are fewer

matches (for DNA, usually BLAST uses 11 while FASTA uses 6-8)

– Couldn’t FASTA also use a large k? No, because it only considers exact matches. Many local matches would be missed if k is too large

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 41

Page 42: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

2. Extending and combining matches

• For each local match, BLAST extends it by including the adjacent characters in the two ends until the match score drops below a threshold

• Second version of BLAST (BLAST2) also tries to combine matches on the same diagonal

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 42

Page 43: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

3. Statistical significance• Besides being faster, another main contribution of

BLAST is evaluating the statistical significance of search results

• Statistical significance: the “E-value”– Suppose the query sequence r has length m, a sequence s

in the database has length n, and a match has score Q. What is the expected (i.e., mean) number of matches with score Q or larger for a pair of random sequences of lengths m and n respectively?

– What is the expected number in the whole database?• This expected number in the whole database is called the E-value

– A small E-value means it is unlikely to happen by chance, thus suggesting potential biological meaning• Logic: There must be a reason behind this high similarity. For

example, r and s may be evolutionarily or functional related.

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 43

Page 44: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Statistical significance• An illustration:

– r=A– s=AC– Best match: exact match, match score = 1

• How many matches are there with score ³ 1 for random r and s of lengths 1 and 2, respectively?– 0 matches:

• r=A, s=CC, CG, CT, GC, GG, GT, TC, TG, TT (9 cases)– 1 match:

• r=A, s=AC, AG, AT, CA, GA, TA (6 cases)– 2 matches:

• r=A, s=AA (1 case)– Expected number assuming equal chance of all cases (due to

symmetry, no need to consider r=C, r=G and r=T):• (0x9 + 1x6 + 2x1) / 16 = 0.5 – statistically not quite significant (usually call it

significant if <0.05 or <0.01)• In reality, need to estimate chance of each case from some large databases

instead of assuming uniform distribution

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 44

Page 45: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Computing statistical significance• For large m and n, we cannot list all cases to find the

expected number• Fortunately, the match score of two sub-sequences is

the maximum score among all possible local alignments– When m and n are large, the match score tends to follow

an extreme value distribution– There are known formulas to compute E-values

• For a match between r and s from database D, the size of D (and the length of its sequences) should be included in the calculation– See http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html for

some more details

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 45

Page 46: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Quick quiz• What are the main differences between FASTA

and BLAST?1. First step• FASTA: exact match only• BLAST: also high-scoring inexact matches

2. Result interpretation• FASTA: alignment score• BLAST: E-value

• What is the meaning of E-value?– The expected number of local alignments with an

alignment score as high or even higher than the observed one if for random query and database sequences of the same lengths as the real ones

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 46

Page 47: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Advanced database searching methods

• We have already seen that the data of different sequences in the database can be put into the same lookup table

• There are other methods to speed up local exact/inexact matches. For example (we will study later):– Suffix trees (see Further Readings)– Suffix arrays– Burrows-Wheeler transform tables

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 47

Page 48: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Variations• Depending on the type of sequences (query-

database):– Nucleotide-nucleotide BLAST (blastn)– Protein-protein BLAST (blastp)– Nucleotide 6-frame translation-protein BLAST (blastx)

• Perform 6-frame translation of nucleotide query, then compare with protein sequences in database

– Protein-nucleotide 6-frame translation BLAST (tblastn)• Compare query protein sequence with 6-frame translation of

nucleotide sequences in database– Nucleotide 6-frame translation-nucleotide 6-frame

translation BLAST (tblastx)• Perform 6-frame translation of query and database

nucleotide sequences, then perform comparisons

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 48

Page 49: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Six-frame translation revisited

+3 L V R T

+2 T C S Y+1 N L F V

5’-AACTTGTTCGTACA-3’3’-TTGAACAAGCATGT-5’

-1 K N T C

-2 S T R V-3 V Q E Y

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 49

Readingframe

Page 50: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Variations

• When do you want to use blastn and when to use tblastx?– blastn: If conservation is expected at nucleotide level (e.g.,

ribosomal RNA)– tblastx: If conservation is expected at the protein level (e.g.,

coding exons)

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 50

Tool Query sequence Database sequences Comparison

blastn Nucleotide Nucleotide Nucleotide-nucleotide

blastp Protein Protein Protein-protein

blastx Nucleotide Protein 6FT-protein

tblastn Protein Nucleotide Protein-6FT

tblastx Nucleotide Nucleotide 6FT-6FT

Page 51: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Iterative database search• Suppose you have a sequence and would want

to find a group of similar sequences in a database

• However, you are not sure whether your sequence has all the key properties of the group

• You can do an iterative database search

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 51

Page 52: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Illustration• Suppose there is a group of related sequences with two

properties:1. Mainly C’s in the first half2. Mainly T’s in the second half

• You have one of the sequencesr: CCCCTATG– It has perfect signature for #1, but not very clear for #2

• Suppose a database contains the following sequences from the same group:s1: CCCCTTTTs2: CCGCATTTs3: GCTCTTTTs4: AACCTTTT– If we use r to query the database, probably we can only get the

first one or two

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 52

Page 53: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

How to get all four?1. First, use BLAST to get the highly similar

sequences (let’s say you get s1 and s2)2. Then, construct a profile of these sequences– E.g., CC[CG]C[AT][AT]T[GT] (can also be PWM or

other models – will study more later)3. Use the model to BLAST again– Probably can get s3 and/or s4 now as the profile

contains more T’s in the second half than r4. Repeat #2 and #3 above until no more new

sequences are returned• This is similar to the Position-Specific Iterative

BLAST (PSI-BLAST) algorithm

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 53

r: CCCCTATGs1: CCCCTTTTs2: CCGCATTTs3: GCTCTTTTs4: AACCTTTT

Page 54: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Multiple sequence alignment (MSA)• In general, given a set of sequences, we want to

align them all at the same time so that related characters are put in the same column

• As mentioned before, with 3 or more sequences, it quickly becomes infeasible to get the optimal solution by dynamic programming– Again, we need heuristics

• Now let’s study these topics:– How to evaluate the goodness of a MSA (i.e.,

computing the alignment score of a MSA)– How to form a good MSA

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 54

Page 55: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Alignment score• Suppose we have already got an alignment

with 3 or more sequences. We want to evaluate how good it is. How can we compute an alignment score?

• Two possible ideas:– All pairs (e.g. average of 1 vs. 2, 1 vs. 3 and 2 vs. 3)– Compare each with a profile• Consensus sequence• Position weight matrix (PWM)• ...

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 55

Page 56: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Example• Suppose we have this alignment:r1:ACGGCTr2:GCGGTTr3:TGGG_Tr4:TCGG_T

• Match:+1 score, mismatch/indel:-1 score

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 56

Page 57: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

All pairs• Scoring matrix:

• Average alignment score = (2 + 0 + 0 + 2 + 2 + 3) / 6 = 9 / 6 = 1.5

• Note: the alignment between sequences r3 and r4involves a “gap only” column – we simply ignore it

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 57

r1 r2 r3 r4r1 6 2 0 2

r2 2 6 0 2

r3 0 0 5 3

r4 2 2 3 5

r1:ACGGCTr2:GCGGTTr3:TGGG_Tr4:TCGG_T

Page 58: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Consensus sequence• Suppose we represent the alignment

by the consensus sequence TCGGCT• Alignment scores between each input

sequence and consensus:– r1: 4– r2: 2– r3: 2– r4: 4– Average = (4 + 2 + 2 + 4) / 4 = 12 / 4 = 3

• Better allow choices or use a probabilistic model such as PWM

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 58

r1:ACGGCTr2:GCGGTTr3:TGGG_Tr4:TCGG_T

Page 59: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Performing multiple sequence alignment

• Many methods:– Clustal (ClustalW, ClustalX, Clustal Omega, etc.)– T-Coffee– MAFFT– MUSCLE– ...

• We will study the main ideas behind Clustal

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 59

Page 60: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Clustal• First proposed by Giggins and Sharp in 1988• The popular version ClustalW (Clustal weighted)

was proposed by Thompson et al in 1994• Main steps:– Compute distance matrix between all pairs of

sequences– Construct a tree that captures the relationship

between the sequences according to the distance matrix

– Progressively align the sequences based on the tree

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 60

Page 61: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Distance matrix• Distance matrix: similar to a

scoring matrix, but larger number means more dissimilar

• Let’s say we use Needleman-Wunsch to get optimal alignment and distance = length of alignment – alignment score– Here we only have the raw

sequences and don’t have the MSA yet

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 61

r1 r2 r3 r4r1 0 4 6 4

r2 4 0 6 4

r3 6 6 0 2

r4 4 4 2 0

r1:ACGGCTr2:GCGGTTr3:TGGGTr4:TCGGT

Page 62: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

From distance matrix to tree• Tree: Close sequences are put close to each

other in the tree, branch length indicates distance

• Forming a tree (one possible way): repeatedly group the two closest sequences together (hierarchical clustering)– We will study more about tree construction later

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 62

Page 63: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

From distance matrix to tree• A possible tree:

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 63

r1:ACGGCT

r2:GCGGTT

r3:TGGGT

r4:TCGGT

r1 r2 r3 r4r1 0 4 6 4

r2 4 0 6 4

r3 6 6 0 2

r4 4 4 2 0

r1:ACGGCTr2:GCGGTTr3:TGGGTr4:TCGGT

Page 64: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

From tree to MSA• Alignment order:

1. r3 vs. r4

2. r1 vs. r2

3. (r1, r2) vs. (r3, r4)• May align consensus

sequence between r1 and r2with consensus sequence between r3 and r4

• Or may use PWM

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 64

r1:ACGGCT

r2:GCGGTT

r3:TGGGT

r4:TCGGT

Page 65: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

A complete example• r1 = ATT, r2 = CGT, r3 = ATGT• Match: 1; mismatch: -1; indel: -2• Best alignment between r1 and r2:

ATTCGT(score = -1, distance = 4)

• Best alignment between r1 and r3:AT_TATGT(score = 1, distance = 3)

• Best alignments between r2 and r3:C_GT _CGTATGT and ATGT(score = -1, distance = 5)

• Consensus between r1 and r3:r13 = ATT or ATGT

• Best alignments between r13 and r2:ATT ATGT ATGTCGT or C_GT or _CGT(for ATT) (for ATGT)

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 65

r1:ATT

r3:ATGT

r2:CGT

Resulting tree:

AT_TATGT

ATTCGTorATGTC_GTorATGT_CGT

Multiple sequence alignments:AT_T AT_T AT_TCG_T C_GT _CGTATGT or ATGT or ATGT

Page 66: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

A complete example

Last update: 8-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 66

r1:ATT

r3:ATGT

r2:CGT

AT_TATGT

ATTCGT

Multiple sequence alignment:AT_TCG_TATGT

r1:ATT

r3:ATGT

r2:CGT

AT_TATGT

ATGTC_GT

Multiple sequence alignment:AT_TC_GTATGT

r1:ATT

r3:ATGT

r2:CGT

AT_TATGT

ATGT_CGT

Multiple sequence alignment:AT_T_CGTATGT

Page 67: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

CASE STUDY, SUMMARY AND FURTHER READINGS

Epilogue

Page 68: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Case study: High-impact work• The 1990 BLAST paper by Altschul et al. has been

cited 38,000 times, ranked 12th in the most highly-cited papers of all time by ISI Web of Science in 2014– PSI-BLAST was the 14th, with ~36,000 citations– (Check out more details about the list by yourself at

http://www.nature.com/news/the-top-100-papers-1.16224)• The method itself is one of the most used ones in

bioinformatics.– The work is not only well-received in academia, but

also heavily used in practice.• Why the success?

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 68

Page 69: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Case study: High-impact work• Why the success?– Exponential growth in the amount of sequencing

data– Optimal methods are too slow• BLAST is much faster• Seldom necessary to find “optimal” solution –

mathematically optimal does not guarantee biological significance

– E-value• Interpretability: What cutoff score would we use to

define a “good” alignment?• Statistical basis

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 69

Page 70: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Case study: High-impact work• Some ingredients of high-impact work:– Real needs• Not only now, but also future• No good solutions exist yet

– Balance between theoretical elegance and practicality

– User-friendliness• Easy-to-interpret inputs and outputs

– Availability, stability and scalability– An appropriate name

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 70

Page 71: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Summary• We need heuristic alignment methods because

dynamic programming is infeasible for very long sequences and/or many sequences

• For pairwise alignment, FASTA and BLAST first find local matches, then extend/combine them to get longer matches– There are ways to evaluate statistical significance of

matches• For multiple sequence alignment, one way is to

perform a series of pairwise alignments in a greedy manner

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 71

Page 72: Lecture 3. Heuristic Sequence Alignmentkevinyip/csci3220/CSCI3220... · Numbers in real situations •Scenario 1: Whole-genome alignment between two species –m, n(e.g., C. elegansand

Further readings• Chapter 4 of Algorithms in Bioinformatics: A Practical

Introduction– Methods for aligning whole genomes– Free slides available

• Chapter 5 of Algorithms in Bioinformatics: A Practical Introduction– More about E-values of BLAST– Additional searching algorithms– Free slides available

• Chapter 6 of Algorithms in Bioinformatics: A Practical Introduction– More details and additional methods for MSA– Free slides available

Last update: 10-Aug-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 72