Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for...

61
Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Transcript of Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for...

Page 1: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

Lecture 4. Short Read Alignment

The Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics

Page 2: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Massively parallel sequencing and short

reads– The short read alignment problem

2. Suffix trie/tree/array3. Burrows-Wheeler Transform (BWT)

Last update: 20-Sep-2015

Page 3: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

MASSIVELY PARALLEL SEQUENCING AND SHORT READS

Part 1

Page 4: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 4

DNA sequencing• DNA sequencing is the experimental

procedures to find out the exact text string of a DNA sequence– Input: Multiple copies of an unknown DNA

(biological) sequence• Blood sample of a patient• Some cultured bacteria• A worm• ...

– Output: (Text) sequences of fragments of the DNA sequence

Last update: 20-Sep-2015

Page 5: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 5

Multiple copies of an unknown DNA biological sequence

Illustration

Last update: 20-Sep-2015

TACCAGCGGACCGCTGACTACCAGCGGACCGCTGACTACCAGCGGACCGCTGAC

Breaking down into fragmentsSequencing

Text sequences of fragments

TACCAG GGACCG

TACCAG

TACCAGCCGGAC

CGCTGACCTGAC

CGGACCGCT

GAC

Page 6: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 6

Sequencing by synthesis• Use one strand as template, synthesize the

other strand• Different ways to detect what base is added:– Give a different color for each type of nucleotide– Supply only one type of nucleotide at a time, and

see if some signals (e.g., light) can be detected– Stop whenever a certain nucleotide is added. Then

deduce the nucleotide by DNA lengths• Can only handle up to a certain length of DNA

Last update: 20-Sep-2015

Page 7: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 7

Sequencing by synthesis

Last update: 20-Sep-2015

Image credit: Illumina

Page 8: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 8

Massively parallel sequencing (MPS)• Sequencing many short fragments in parallel– Also called “next-generation” or “deep” sequencing

Last update: 20-Sep-2015

Image credit: Metzker, Nature Reviews Genetics 11:31-46, (2010); Azco Biotech

Page 9: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 9

Shotgun sequencing• Breaking down the long DNA sequence into

multiple fragments due to the experimental limitation

Last update: 20-Sep-2015

Whole genome shotgun Hierarchical approach: slightly easier to get back the original sequence

Image credit: Jennifier et al., Biological Procedures Online 11(1):52-78, (2009)

Page 10: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 10

Short reads• The output of MPS is a list of short sequences– Each is called a read– Example: ACA, ATA, ATA, ATT, TAG, TAT, TTC

• Some properties of current MPS reads:– About 100-200 nucleotides long (very short as compared to

the human genome)– May overlap, since multiple copies of the original DNA are

sequenced• Millions or even billions of reads from one experiment

– The DNA sample may contain variations due to heterozygosity, somatic mutations and mixed population of cells

– May also have contamination and sequencing errors

Last update: 20-Sep-2015

Page 11: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 11

Computational problems• Two main computational problems• Sequence alignment (this lecture):– Given a reference sequence s of length n, how to find out

the position of each read r of length m in the reference?– Example situations:

• Sequencing the DNA in a cancer sample – The sequence of normal human DNA can serve as a reference

• Sequencing the DNA of a strain of a bacteria – The sequence of other strains of the bacteria can serve as a reference

• Sequence assembly (next lecture):– Is it possible to assemble the short reads back to the

original DNA?

Last update: 20-Sep-2015

Page 12: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 12

Short read alignment• Example:– Original sequence: s=TATACATTAG– Short reads:ACA, ATA, ATA, ATT, TAG, TAT, TTC

– Alignment:TATACATTAG ACA ATA ATA ATT TAGTAT TTC

Last update: 20-Sep-2015

Image source: http://img1.etsystatic.com/000/0/6103070/il_fullxfull.203233493.jpg

Variation or error

Page 13: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 13

Short read alignment• Basically a local alignment problem, but need

to align millions or billions of short sequences with a very long reference sequence, expecting almost exact matches

• Need to build indexes on reads or reference– Once the indexes are built, the searching time

should depend only on the size of searching results (number of hits and their locations), not the length of the reference

– We will mainly study methods for exact matches

Last update: 20-Sep-2015

Page 14: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 14

Indices• Main considerations:– Space requirement– Time requirement for building index– Time requirement for searching

• Main approaches:– Hash-table-based (Similar to FASTA and BLAST)• BFAST, ELAND, MAQ, MOSAIK, SHRiMP, SOAP, ZOOM, ...

– Suffix-tree or Burrows-Wheeler-Transform-based• Bowtie, BWA, SOAP2, ...

Last update: 20-Sep-2015

Page 15: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 15

Hashing (revision)• Recall how k-mers of a sequence

are put into a table• s= TATACATTAG

12345678900 1

• If we want to know whether a particular k-mer appears in this sequence, we need to know which row represents this k-mer– How?

Last update: 20-Sep-2015

2-mer PositionAC 4AG 9AT 2, 6CA 5TA 1, 3, 8TT 7

Page 16: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 16

Scheme 1• Store only the k-mers that exist

in the sequence. One k-mer per table entry.

• Finding if a k-mer is in the table: binary search

Last update: 20-Sep-2015

2-mer PositionAC 4AG 9AT 2, 6CA 5TA 1, 3, 8TT 7

Page 17: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 17

Scheme 2• Have one entry for every

possible k-mer, no matter they exist in the sequence or not

• Finding if a k-mer is in the table: Direct calculation of row number– E.g., for CT, the row number is

1*4+3+1 = 8• May need more space than

Scheme 1, but constant searching time

Last update: 20-Sep-2015

Row num. 2-mer Position

1 AA

2 AC 4

3 AG 9

4 AT 2, 6

5 CA 5

6 CC

7 CG

8 CT

9 GA

10 GC

11 GG

12 GT

13 TA 1, 3, 8

14 TC

15 TG

16 TT 7

Page 18: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 18

Scheme 3• Use a function to map (“hash”)

each k-mer into a small number• Finding if a k-mer is in the table:

Compute hash value, then check the k-mers in the entry– For example, row number equals

the number of A’s in the k-mer plus one

• Index size and searching time depend on hash function

Last update: 20-Sep-2015

Row num. 2-mer Position

1 TT 7

2 ACAGATCATA

492, 651, 3, 8

3

Page 19: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 19

Hashing• In general, these schemes all have to face the

same two problems– Some allocated space would be wasted if the k-

mers do not appear in the sequence– Collisions could occur– A typical tradeoff between space and time

• One cause of the problem: The hash function has no knowledge about the sequence– We now study some data structures that make use

of some information about the sequence

Last update: 20-Sep-2015

Page 20: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

SUFFIX TRIE/TREE/ARRAYPart 2

Page 21: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 21

Suffixes• Given a sequence s[1..n], a suffix is either a sub-sequence s[i..n] for any i

between 1 and n, or the empty string (which is sometimes represented by s[n+1..n])

• Example: s[1..10]=TATACATTAG• Suffixes:

– s[1..10] TATACATTAG– s[2..10] ATACATTAG– s[3..10] TACATTAG– s[4..10] ACATTAG– s[5..10] CATTAG– s[6..10] ATTAG– s[7..10] TTAG– s[8..10] TAG– s[9..10] AG– s[10..10] G– Empty string

Last update: 20-Sep-2015

Page 22: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 22

Suffixes with end symbol• To show the empty string and to mark where the sequence ends, we will

use the symbol $ to indicate the end of a sequence, and define s[n+1] to be $

• Example: s[1..11]=TATACATTAG$• Suffixes:

– s[1..11] TATACATTAG$– s[2..11] ATACATTAG$– s[3..11] TACATTAG$– s[4..11] ACATTAG$– s[5..11] CATTAG$– s[6..11] ATTAG$– s[7..11] TTAG$– s[8..11] TAG$– s[9..11] AG$– s[10..11] G$– s[11..11] $

Last update: 20-Sep-2015

Page 23: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 23

Subsequence and suffixes• Important concept: Every subsequence of s is

a prefix of a suffix of s (recall: optimal local alignment)

• Example:– s=TATACATTAG$– The subsequence s[4..7]=ACAT is a prefix of the

suffix s[4..11]=ACATTAG$• Therefore, finding whether a short read

appears in a reference sequence is equivalent to checking whether the short read is a prefix of a suffix of the reference– To facilitate the searching of subsequences, we

can put the suffixes into a tree

Last update: 20-Sep-2015

Suffixes:TATACATTAG$ ATACATTAG$ TACATTAG$ ACATTAG$ CATTAG$ ATTAG$ TTAG$ TAG$ AG$ G$ $

Page 24: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 24

Suffix trie• Tree:

– A set of nodes– A set of edges, each connecting two nodes– No cycles

• Suffix trie of sequence s:– A rooted tree– Every edge is labeled with one character from s

• Sibling nodes are ordered alphabetically, with the end-of-sequence character $ ordered before all other characters, i.e., $ < A < C < G < T

– Every path from the root to a leaf represents a suffix of s– Every suffix of s is represented by a path from the root to a

leaf– Suffixes can share edges for their common prefixes

Last update: 20-Sep-2015

Page 25: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 25

Suffix trie• s=TATACATTAG$• Suffixes:

TATACATTAG$ ATACATTAG$ TACATTAG$ ACATTAG$ CATTAG$ ATTAG$ TTAG$ TAG$ AG$ G$ $

Last update: 20-Sep-2015

A C

$A

T

T

A

G

$

A

A

G

C

A

T

T

A

G

$

$

$

C

A

T

T

A

G

$

$

C

A

T

T

A

G

$

A

A

G

$

C

A

T

T

A

G

A

T

T

G T

$ G

G T

T

$

Page 26: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 26

Suffix trie• s=TATACATTAG$• To search for a length-m

subsequence, simply follow the path from the root until– The subsequence is found

(the subsequence appears in s), e.g., ACAT OR

– You cannot go any further (the subsequence does not appear in s), e.g., CATC

• Both cases take O(m) time – independent of n– Since each layer has no more

than 5 nodes

• A suffix trie can be constructed in time proportional to its size– Worst case O(n2) nodes

Last update: 20-Sep-2015

A C

$A

T

T

A

G

$

A

A

G

C

A

T

T

A

G

$

$

$

C

A

T

T

A

G

$

$

C

A

T

T

A

G

$

A

A

G

$

C

A

T

T

A

G

A

T

T

G T

$ G

G T

T

$

Page 27: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 27

Suffix tree• A suffix tree is a compact

form of a suffix trie, where non-branching paths are collapsed to a single edge

Last update: 20-Sep-2015

A CATTAG$

A

CATTAG$

CATTAG$

ACATTAG$

T

TAG$

G$ TACATTAG$

$ G$

G$ T

TAG$

A C

$A

T

T

A

G

$

A

A

G

C

A

T

T

A

G

$

$

$

C

A

T

T

A

G

$

$

C

A

T

T

A

G

$

A

A

G

$

C

A

T

T

A

G

A

T

T

G T

$ G

G T

T

$Suffix trie Suffix tree

Page 28: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 28

Suffix tree• The tree has no more

than 2n nodes. Why?Hint: How many leaf nodes are there?

• The tree can be constructed in O(n) time– We do not go into details

• How much space does each node require?

• No need to store the long edge labels as strings in the tree. Can use pointers to the original sequence s.

Last update: 20-Sep-2015

A CATTAG$

A

CATTAG$

CATTAG$

ACATTAG$

T

TAG$

G$ TACATTAG$

$ G$

G$ T

TAG$

Page 29: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 29

Suffix tree• The tree has no more

than 2n nodes. Why?Hint: How many leaf nodes are there?

• The tree can be constructed in O(n) time– We do not go into details

• How much space does each node require?

• No need to store the long edge labels as strings in the tree. Can use pointers to the original sequence s.– Constant space per node– O(n) space for the whole

tree

Last update: 20-Sep-2015

2-2:A

5-11:CATTAG$

2-2:A

5-11:CATTAG$

5-11:CATTAG$

4-11:ACATTAG$

1-1:T

8-11:TAG$

10-11:G$

3-11:TACATTAG$

11-11:$

10-11:G$

10-11:G$

3-3:T

8-11:TAG$

Page 30: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 30

Some limitations• While suffix tree is already quite space- and

time-efficient, it also has some drawbacks:– Some construction algorithms are quite complex– The total space needed is usually 20n bytes or

more for a sequence of length n due to overheads originated from the tree structure and position indices• Think about the length of the human genome and the

maximum amount of memory for a 32-bit machine

• An alternative data structure: suffix array

Last update: 20-Sep-2015

Page 31: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 31

Suffix array• An array storing the original locations of the suffixes when

they are sorted in lexicographic order• s=TATACATTAG$

Last update: 20-Sep-2015

Suffix Location

TATACATTAG$ 1

ATACATTAG$ 2

TACATTAG$ 3

ACATTAG$ 4

CATTAG$ 5

ATTAG$ 6

TTAG$ 7

TAG$ 8

AG$ 9

G$ 10

$ 11

Before sorting:Suffix Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

After sorting:

The suffix array

Page 32: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 32

Using a suffix array• Recall that a

subsequence of s is also a prefix of a suffix of s, therefore– Finding a subsequence

can be done by a binary search on the suffix array

– All occurrences of the subsequence in s must be located on adjacent rows

– Example: Searching for the subsequence AT

Last update: 20-Sep-2015

Suffix Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

s[5]=A? No. s[5]=C>A

s[9]=A? Yess[9+1]=T? No. s[9+1]=G<T

s[2]=A? Yess[2+1]=T? YesAT found in s!

Straight-forward use of suffix array requires O(m log n) timeCan improve to O(m) time by using extended suffix array – We don’t study here

Page 33: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 33

Time and space requirements• A suffix array can be constructed in O(n) time

– Can read off from suffix tree• If each position is stored as an integer, and each integer

takes 4 bytes, the whole suffix array needs 4n bytes– For large n, we cannot assume 4 bytes are sufficient. In general,

each index takes log n bits. The total size is thus O(n log n)• Already quite good. Can we do even better?

Goal: O(n log ||), where is the alphabet– For DNA, ||= |{A, C, G, T}| = 4 << n

• Methods:– Compressed suffix arrays (not discussed here)– Burrows-Wheeler Transform (our next topic)

Last update: 20-Sep-2015

Page 34: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

BURROWS-WHEELER TRANSFORMPart 3

Page 35: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 35

Burrows-Wheeler Transform (BWT)• Proposed by Michael Burrows and David Wheeler in

1994.• A very compact structure that can be used for text

search• Input: sequence s• Conceptual* method:– Find all rotations of s and put them in a matrix– Sort the rows of the matrix in lexicographic order– Output the sequence in the last column, b

*: In this lecture, whenever you see a method described as “conceptual”, it means it is used to illustrate some key ideas, but is usually too slow or too memory-demanding to be practical, and we will discuss better alternatives.

Last update: 20-Sep-2015

Page 36: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 36

Rotations and transformed string• Input: s=TATACATTAG$

• Output: b=GTTTCAAAT$A

Last update: 20-Sep-2015

Rotations:TATACATTAG$ATACATTAG$TTACATTAG$TAACATTAG$TATCATTAG$TATAATTAG$TATACTTAG$TATACATAG$TATACATAG$TATACATTG$TATACATTA$TATACATTAG

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

Amazingly, we can use b to check

whether an input string is a sub-sequence of s

efficiently

Page 37: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 37

Rotations and transformed string• Input: s=TATACATTAG$

• Output: b=GTTTCAAAT$A– Did you notice the correspondence between the sorted rotations with

the sorted suffixes (due to the unique $)?– Without the $ symbol, it may not be true. Consider CAA

Last update: 20-Sep-2015

Rotations:TATACATTAG$ATACATTAG$TTACATTAG$TAACATTAG$TATCATTAG$TATAATTAG$TATACTTAG$TATACATAG$TATACATAG$TATACATTG$TATACATTA$TATACATTAG

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

Suffixes Location

TATACATTAG$ 1

ATACATTAG$ 2

TACATTAG$ 3

ACATTAG$ 4

CATTAG$ 5

ATTAG$ 6

TTAG$ 7

TAG$ 8

AG$ 9

G$ 10

$ 11

Sorted suffixes Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

Page 38: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 38

Things to learn about BWT1. How to construct b conceptually (i.e., slowly)2. How to construct b efficiently3. Basic properties of the sorted rotations4. Getting s back from b conceptually5. Getting s back from b efficiently6. Using b to search for sub-sequences of s

Last update: 20-Sep-2015

Page 39: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 39

Quick construction of b• The procedure we have described for constructing the output b

is very slow and memory demanding• It can be quickly obtained from the suffix array

– Recall that a suffix array can be constructed in linear time and space for a fixed alphabet

– s=TATACATTAG$s=12345678901s=0 1

– b=GTTTCAAAT$A– First character of b is the character

before the first letter in the first row of the sorted rotations• b[1] = s[11-1] = s[10] = G• b[2] = s[4-1] = s[3] = T• b[3] = s[9-1] = s[8] = T• ...

Last update: 20-Sep-2015

Sorted suffixes Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

Page 40: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 40

Properties of the sorted rotations• Simple properties for warm-up• Property 1: All rows in the sorted

rotation matrix are different– Due to the $ symbol

• Property 2: Every column in the matrix has the whole set of characters in s

Last update: 20-Sep-2015

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

s=TATACATTAG$

b=GTTTCAAAT$A

Page 41: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 41

Properties of the sorted rotations• Property 3: Different occurrences of the

same character tend to cluster in b– E.g., three of the A’s are clustered, so are three

of the T’s– Why? Because if a length-k pattern appears

multiple times in s (e.g., TA), some rotations will have:• The length-(k-1) suffix of the pattern (A) at the

beginning of the rotation these rotations will be close in the matrix (though not always next to each other – check AT)

• The first character of the pattern (T) in the last column

– Significance? Easier to perform data compression

Last update: 20-Sep-2015

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

s=TATACATTAG$

b=GTTTCAAAT$A

Page 42: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 42

Properties of the sorted rotations• Property 4: The input s can be obtained back

from the output b• Conceptual method:

1. Create an empty matrix2. Add b as the leftmost column of the matrix3. Sort the rows of the matrix4. Repeat 2 and 3 until the matrix has n+1 columns5. s can be read from the first row by moving the

leading $ back to the tail

Last update: 20-Sep-2015

Page 43: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 43

Getting back the original sequence• s=TATACATTAG$, b=GTTTCAAAT$A

Last update: 20-Sep-2015

GTTTCAAAT$A

$AAAACGTTTT

G$TATATACAACAGATTT$TAT

$TACAGATATCAG$TATATATT

G$TTACTAGTATCATACAAG$ATATTA$TAATT

$TAACAAG$ATAATTCATG$TTACTAGTATTTA

G$TATACATAG$TATACATTACATAG$TATACTTAG$TATATTA

$TATACATAG$TATACATTACATTG$TATACATAG$TATATTAG

G$TATTACATTAG$TTATACCATTAACATTAG$TAATACATTAG$$TATAATTAG

$TATAACATTAG$TAATACAATTAGCATTAG$TATTACATTAG$TTATACTTAG$

G$TATATACATTTAG$TATATACACATTAGACATTAAG$TATATACATTTAG$T$TATACATTAG$

$TATACACATTAAG$TATATACATATTAG$CATTAGG$TATATACATTTAG$TATATACATTAG$T

G$TATACTACATTATAG$TATTATACATCATTAG$ACATTAGAG$TATAATACATTTTAG$TA$TATACAATTAG$T

$TATACAACATTAGAG$TATAATACATTATTAG$TCATTAG$G$TATACTACATTATAG$TATTATACATTTAG$TA

G$TATACATACATTAGTAG$TATATATACATTCATTAG$TACATTAG$AG$TATACATACATTATTAG$TAT$TATACATATTAG$TA

$TATACATACATTAG$AG$TATACATACATTAATTAG$TACATTAG$TG$TATACATACATTAGTAG$TATATATACATTTTAG$TAT

G$TATACATTACATTAG$TAG$TATACTATACATTACATTAG$TAACATTAG$TAG$TATACAATACATTAGTTAG$TATA$TATACATTATTAG$TAT

$TATACATTACATTAG$TAG$TATACAATACATTAGATTAG$TATCATTAG$TAG$TATACATTACATTAG$TAG$TATACTATACATTATTAG$TATA

G$TATACATTTACATTAG$TTAG$TATACATATACATTAGCATTAG$TATACATTAG$TAAG$TATACATATACATTAG$TTAG$TATAC$TATACATTAATTAG$TATA

$TATACATTAACATTAG$TAAG$TATACATATACATTAG$ATTAG$TATACATTAG$TATG$TATACATTTACATTAG$TTAG$TATACATATACATTAGTTAG$TATAC

G$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$CATTAG$TATAACATTAG$TATAG$TATACATTATACATTAG$TTTAG$TATACA$TATACATTAGATTAG$TATAC

$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

Page 44: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 44

Getting back the original sequence• Why the procedure works?– Essentially we are reconstructing the sorted rotation matrix– When the reconstruction matrix contains only one column,

after sorting it is exactly the first column of the sorted rotation matrix

– When we add b as the new first column of the reconstruction matrix, it is like placing the last column of the sorted rotation matrix before the first column

– When this matrix is sorted, we get the first two columns of the sorted rotation matrix• Every row contains a different subsequence of s

– And so on

Last update: 20-Sep-2015

Page 45: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 45

Getting back the original sequence• s=TATACATTAG$, b=GTTTCAAAT$A

• Good, but the method seems very slow?– Yes, and we will see how to get back s from b faster

Last update: 20-Sep-2015

GTTTCAAAT$A

$AAAACGTTTT

G$TATATACAACAGATTT$TAT

$TACAGATATCAG$TATATATT

G$TTACTAGTATCATACAAG$ATATTA$TAATT

$TAACAAG$ATAATTCATG$TTACTAGTATTTA

G$TATACATAG$TATACATTACATAG$TATACTTAG$TATATTA

$TATACATAG$TATACATTACATTG$TATACATAG$TATATTAG

G$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$CATTAG$TATAACATTAG$TATAG$TATACATTATACATTAG$TTTAG$TATACA$TATACATTAGATTAG$TATAC

$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

...

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

Reconstruction:

Page 46: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 46

Properties of the sorted rotations• Property 5: The i-th occurrence

of a character x in the last column corresponds to the i-th occurrence of x in the first column– E.g., The second T in the last

column is also the second T in the first column• Which is the one at position 8 in s• These T’s can have a different order in s

– Why? Consider the following:• Order of the rotations starting with x• Order of the rotations ending with x

Both depend on the remaining n characters

Last update: 20-Sep-2015

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

s=TATACATTAG$s=12345678901s=0 1

b=GTTTCAAAT$A

Page 47: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 47

Applications of property 5• First application: Getting back the original sequence fast• b=GTTTCAAAT$A• First column of sorted rotation matrix (by sorting characters in the last column or counting

the number of occurrences of each character): $AAAACGTTTT• Conceptual back-tracing:

– Character before $: G– Character before G: second A (A)– Character before second A: second T (T)– Character before second T: fourth T (T)– Character before fourth T: fourth A (A)– Character before fourth A: C– Character before C: first A (A)– Character before first A: first T (T)– Character before first T: third A (A)– Character before third A: third T (T)– Character before third T: $

• Therefore the original sequence is s=TATACATTAG$• If we have stored the location of the first occurrence of each character in the first column,

back-tracing can be done very fast (without really storing the first column).

Last update: 20-Sep-2015

GTTTCAAAT$A

$ A A A A C...G T T T T

Page 48: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 48

Applications of property 5• Second application : Text search• Suppose we want to search for a sub-

sequence r from s. All occurrences of r appear as prefixes in the sorted rotation matrix, and are in adjacent rows.– For example: TA

• Therefore, we only need to find out the row numbers of the first and last rows that start with r– Now we study how we can find these

numbers if we only have b without materializing the rotation matrix

Last update: 20-Sep-2015

Sorted rotations:$TATACATTAGACATTAG$TATAG$TATACATTATACATTAG$TATTAG$TATACCATTAG$TATAG$TATACATTATACATTAG$TATAG$TATACATTATACATTAG$TTAG$TATACA

Page 49: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 49

Applications of property 5• Say we want to search for TA.

Conceptually:– From b, we can get back the first

column of the rotation matrix – We know that A appears between the

2nd and 5th rows in the first column– We then check the corresponding

entries in b, and find TA between the 1st and 3rd occurrences of T

– We can then find out their actual locations in s from the suffix array• We can either save the array on disk or

save only a portion in memory, and compute the remaining on the fly

Last update: 20-Sep-2015

$ GA TA TA TA CC...AG AT AT TT $T A

Sorted suffixes Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

s=TATACATTAG$

Page 50: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 50

Applications of property 5• Another example: CAT

– 1st to 4th occurrences of T rows 8-11– 3rd to 4th occurrences of A rows 4-5– 1st to 1st occurrences of C rows 6-6

• How to make this conceptual procedure fast?• From occurrence to row number: store the row number

of the first occurrence of each character in the first column– $: 1, A: 2, C: 6, G: 7, T: 8– 3rd occurrence of A is on row 2+3-1 = 4

• From row number to occurrences: store the number of times a character appears up to the current row in the last column– A: 00000123334– Up to row 7, 2 A’s have occurred in b– Up to row 11, 4 A’s have occurred in b– Therefore rows 8-11 contain the 3rd and 4th occurrences of A– With these numbers, we do not need to store the first

column– Again, may precompute only some of these numbers

Last update: 20-Sep-2015

$ GA TA TA TA CC...AG AT AT TT $T A

Sorted suffixes Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

s=TATACATTAG$

Page 51: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 51

Summary of BWT• What we need to store?

– The last column b of the sorted rotation matrix• O(n) construction time by using suffix array• O(n log||) space, where || is the size of the alphabet (4

for DNA sequences)

– Location of the first occurrence of each character in the first column• O(|| log n) construction time by using suffix array• O(||) space

– Number of times each character occurs in the last column within the first i rows for all i• O(n) construction time• O(||n log n) space – Can be stored in a special way that

requires much less space

– The suffix array• O(n) construction time• O(n log n) space – No need to reside in memory

Last update: 20-Sep-2015

Sorted suffixes Location

$ 11

ACATTAG$ 4

AG$ 9

ATACATTAG$ 2

ATTAG$ 6

CATTAG$ 5

G$ 10

TACATTAG$ 3

TAG$ 8

TATACATTAG$ 1

TTAG$ 7

s=TATACATTAG$

Page 52: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 52

Summary of BWT• Getting back the original sequence s– Trace back in n steps, using either the suffix array

or the array that stores the location of the first occurrence of each character

• Searching for a query sequence r– Iteratively compute the range of rows involved for

different suffixes of r

Last update: 20-Sep-2015

Page 53: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 53

Complete searching example

Searching for AT:1. O[11, T]=4 occurrences of T

in total2. In the first column these T’s

appear on row F[T]=8 to row F[T]+4-1=11

3. On row 8-1=7, O[8-1, A]=2 A’s have appeared in last column

4. On row 11, O[11, A]=4 A’s have appeared in last column

5. Therefore these rows cover the (2+1)=3rd to the 4th occurrences of A

6. In the first column these A’s appear on row F[A]+3-1=4 to row F[A]+4-1=5

7. These AT’s appear at position SA[4]=2 and position SA[5]=6 of the input string s

Last update: 20-Sep-2015

s T A T A C A T T A G $

Last column (b) G T T T C A A A T $ A

A C G T

First occurrence position in the first column (F) 2 6 7 8

First column $ A A A A C G T T T T

Position 1 2 3 4 5 6 7 8 9 10 11

Suffix array (SA) 11 4 9 2 6 5 10 3 8 1 7

Occurrences within the first i rows in last column (O)

i

1 2 3 4 5 6 7 8 9 10 11

A 0 0 0 0 0 1 2 3 3 3 4

C 0 0 0 0 1 1 1 1 1 1 1

G 1 1 1 1 1 1 1 1 1 1 1

T 0 1 2 3 3 3 3 3 4 4 4

Not stored

Fully stored

Stored on disk

Stored in a special way

Page 54: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 54

Approximate matching• So far we have been studying exact matching• Some approximate matching strategies:– Search for exact matches of length-k subsequences of

the query r, then combine the results– Search for exact matches of sequences that are

within a certain distance from r• We have seen how to do that with a hash table• For a suffix tree, we need to traverse the tree with

backtracking• For BWT, we do something equivalent to traversing the

suffix tree, but some bounds can be calculated to reduce the amount of traversal needed

Last update: 20-Sep-2015

Page 55: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CASE STUDY, SUMMARY AND FURTHER READINGS

Epilogue

Page 56: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 56

Case study: Competitions• Many different methods have been proposed for

performing short read alignments– Which one is the best?

• In order to propose a new method, you need to show that the method has some advantages over the previous ones– Consumes less memory (theoretically or in practice)– Runs faster– Provides more features (e.g., inexact matches)– Is simpler– Has an efficient implementation– ...

Last update: 20-Sep-2015

Page 57: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 57

Case study: Competitions• One potential problem: Cherry picking– How can we know that one is not showing the

best results of his/her method, based on a carefully chosen set of data and parameter values?• May perform well only in this setting

– Benchmark datasets• Can still “overfit” if you know the answers

– Public competitions

Last update: 20-Sep-2015

Page 58: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 58

Case study: Competitions• Some famous public competitions related to

bioinformatics:– Assemblathon for sequence assembly– CAPRI (Critical Assessment of PRediction of Interactions)– CASP (Critical Assessment of protein Structure Prediction)– DREAM (Dialogue for Reverse Engineering Assessments and

Methods)– RGASP (RNAseq Genome Annotation Assessment Project)– Some competitions on TopCoder– Some of the KDD Cup competitions associated with the yearly

KDD (Knowledge Discovery and Data Mining) conference• Some have attractive awards!

Last update: 20-Sep-2015

Page 59: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 59

Summary• Massively parallel sequencing allows the sequencing of

many short DNA fragments in parallel, achieving high throughput– Many: millions or even billions– Short: In the order of a hundred nucleotides

• Different strategies to map the short reads to a reference:– Hash table: Tradeoff between space (unused slots) and time

(resolving collisions)– Suffix trie/tree/array: Compact structures, proportional to the

length of the indexed sequence– Burrows-Wheeler Transform (BWT): Similar to suffix array, but

usually requires less space in main memory due to less bits per input character and compression possibility

Last update: 20-Sep-2015

Page 60: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 60

Other practical issues• We have only focused on methods for finding a

short read from a long sequence efficiently. In real applications, there are many other issues:– Non-unique mapping: One read can map to multiple

places of s– Incorporating quality scores from sequencing

machines– Larger structural variants (e.g., indels) that cannot be

handled by inexact matches– Parallelization using multiple cores/machines– ...

Last update: 20-Sep-2015

Page 61: Lecture 4. Short Read Alignment The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015 61

Further readings• Chapter 3 of Algorithms in Bioinformatics: A Practical

Introduction– More details about suffix trees (such as suffix links)– Additional applications (such as finding longest common

prefix of two sequences)– More detailed complexity analyses– Free slides available

• A paper that describes a method called BWA for aligning short sequencing reads using BWT– Li and Durban,

Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754-1760, (2009)

Last update: 20-Sep-2015