BLAT – The B LAST- L ike A lignment T ool

40
BLAT The BLAST-Like Alignment Tool Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨巨巨 巨巨巨

description

BLAT – The B LAST- L ike A lignment T ool. Kent, W.J. Genome Res. 2002 12: 656-664 Presenter: 巨彥霖 田知本. BLAT overview. Use an index to find regions in genome homologous to query. Do a detailed alignment between query and homologous regions. - PowerPoint PPT Presentation

Transcript of BLAT – The B LAST- L ike A lignment T ool

Page 1: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLAT – The BLAST-Like Alignment Tool

Kent, W.J.

Genome Res. 2002 12: 656-664

Presenter: 巨彥霖 田知本

Page 2: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLAT overview

• Use an index to find regions in genome

homologous to query.

• Do a detailed alignment between query

and homologous regions.

• Use dynamic programming to stitch

together detailed alignments regions

into detailed alignment of whole.

Page 3: BLAT  –  The  B LAST- L ike  A lignment  T ool

Index

• Database : non-overlapping

• Query : overlapping

K-merK-mer

…K-mer

…K-merK-mer

Page 4: BLAT  –  The  B LAST- L ike  A lignment  T ool

Example

• Database: cacaattatcacgaccgc

3-mers: cac aat tat cac gac cgc

Index: aat 3 gac 12

cac 0,9 tat 6

cgc 15

• Query: aattctcac

3-mers: aat att ttc tct ctc tca cac

0 1 2 3 4 5 6

Page 5: BLAT  –  The  B LAST- L ike  A lignment  T ool

Search Criteria

• Single Perfect Matches

• Single Near Perfect Matches

• Multiple Perfect Matches

Page 6: BLAT  –  The  B LAST- L ike  A lignment  T ool

Notation

• K : K-mer size

• M : The match ratio between homologous

area

• H : Homologous region size

• G : Query sequence size

• A : The alphabet size

Page 7: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Perfect Matches (1)

K-mer

Perfect Match

kMp 1

Homologous

region

Page 8: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Perfect Matches (2)

KHkMP /)1(1

Homologous

region

The prob of at least one k-mer perfect match :

H

K K K K K K K

(Sensitivity)

Page 9: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Perfect Matches (3)

• The number of k-mer in the database = G / K• The number of k-mer in the query = Q – K + 1

The number of k-mer that are expected to

matched by chance : KAKGKQF )/1()/()1( (Specificity)

Page 10: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Perfect Nucleotide K-mer Matches as Search Criterion

Page 11: BLAT  –  The  B LAST- L ike  A lignment  T ool

Case (perfect match)

• Comparing mouse and human coding sequences at the nucleotide level :

H = 100

M = 86%

Sensitivity = 0.99

max K = 7

chance matches = 13078962

(query = 500 , database = 3 billion)

Page 12: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Near Perfect Matches (1)

K-mer

Near Perfect Match

)1(11 MMKMp Kk

Homologous

region

Almost Perfect : One letter may mismatch

Page 13: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Near Perfect Matches (2)

• Sensitivity

• Specificity

KHkpP /1 )1(1

))/1())/1(1()/1(()/()1( 1 KK AAAKKGKQF

Page 14: BLAT  –  The  B LAST- L ike  A lignment  T ool

Case (near perfect match)

• Comparing mouse and human coding sequences at the nucleotide level :

H = 100

M = 86%

Sensitivity = 0.99

max K = 12

chance matches = 275671

(query = 500 , database = 3 billion)

Page 15: BLAT  –  The  B LAST- L ike  A lignment  T ool

Single Near Perfect Nucleotide K-mer Matches as Search Criterion

Page 16: BLAT  –  The  B LAST- L ike  A lignment  T ool

Multiple Perfect Matches

• Hit is triggered :– there must be N perfect matches– each no further than W letters from each other

in the database coordinate– have the same diagonal coordinate

Page 17: BLAT  –  The  B LAST- L ike  A lignment  T ool

Example

W

a

b

c

d

The hits a, b, c, and d are all k letters long. Hits b and d have the same diagonal coordinate within W letters of each other. Therefore, they would match the 2 perfect K-mer search criteria.

Target Coordinate

Query C

oordinate

Page 18: BLAT  –  The  B LAST- L ike  A lignment  T ool

Multiple Perfect Nucleotide K-mer Matches as Search Criterion

Page 19: BLAT  –  The  B LAST- L ike  A lignment  T ool

Default

• Nucleotide– two perfect 11-mer

• Protein– single perfect 5-mer for standalone version– three perfect 4-mer for client/server version

Page 20: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLAST

1) Build the hash table for Sequence A.

2) Scan Sequence B for hits.

3) Extend hits.

Page 21: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLASTStep 1: Build the hash table for Sequence A. (3-tuple example)

For DNA sequences:

Seq. A = AGATCGAT 12345678AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..

TTT

For protein sequences:

Seq. A = ELVIS

Add xyz to the hash table if Score(xyz, ELV) T;≧Add xyz to the hash table if Score(xyz, LVI) T;≧Add xyz to the hash table if Score(xyz, VIS) T;≧

Page 22: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLASTStep2: Scan sequence B for hits.

Page 23: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLASTStep2: Scan sequence B for hits.

Step 3: Extend hits.

hit

Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

BLAST 2.0 saves the time spent in extension, and

considers gapped alignments.

Page 24: BLAT  –  The  B LAST- L ike  A lignment  T ool

Algorithm

1. Search Stage– Use an index to find regions in genome

homologous to query

2. Alignment Stage– Do a detailed alignment between query and

homologous regions

3. Stitching and Filling In– Use dynamic programming to stitch together

detailed alignments regions into detailed alignment of whole

Page 25: BLAT  –  The  B LAST- L ike  A lignment  T ool

Search Stage

• Build an index which contains positions of each K-mer in database.

• Step through each overlapping K-mer in query and look it up in index

• Get list of ‘hits’ - positions in query and in database that match for K bases

• Cluster hits to find homologous regions

Page 26: BLAT  –  The  B LAST- L ike  A lignment  T ool

Search Stage

• Clump hits

Page 27: BLAT  –  The  B LAST- L ike  A lignment  T ool

• Clump ‘clumps’

• Eliminate small clumps

homologous region

Search Stage

Page 28: BLAT  –  The  B LAST- L ike  A lignment  T ool

Alignment Stage (nucleotide)

• Start from scratch with regions defined with K-mers

• Index on smaller K-mers, but extend each K-mer until it becomes specific

• Extend in both direction without mismatches or gaps and merge overlapping or continues alignments

• Recurse on gaps with smaller K until gap or hits are eliminated

Page 29: BLAT  –  The  B LAST- L ike  A lignment  T ool

Alignment Stage (nucleotide)

recursive

Page 30: BLAT  –  The  B LAST- L ike  A lignment  T ool

Alignment Stage (protein)

• Extend hits into maximal scoring ungapped alignment (HSPs) with +2/-1 scoring scheme

• Create a graph of all possible HSP merges

• Use dynamic programming to traverse the graph

Page 31: BLAT  –  The  B LAST- L ike  A lignment  T ool

Alignment Stage (protein)

Page 32: BLAT  –  The  B LAST- L ike  A lignment  T ool

Alignment Stage (protein)

query

homologous region

HSP

Page 33: BLAT  –  The  B LAST- L ike  A lignment  T ool

Stitching and Filling In

• The alignment of gene is often scattered across multiple homologous regions found in the search stage

query

database

Page 34: BLAT  –  The  B LAST- L ike  A lignment  T ool

Stitching and Filling In

query

database

homologous region

Page 35: BLAT  –  The  B LAST- L ike  A lignment  T ool

Evaluation

• Comparison with Other Tools:– mRNA/Genome Alignments– Remapped 713 mRNAs corresponding to annotated

chromosome 22– BLAT took 26 sec while Sim4 took 17,468 sec

(almost 5h)

Est_genome Sim4 BLAT

Relative speed 1 333 223,000

Base accuracy N/A 99.66% 99.99%

Gene accuracy 77.7% 93.4% 99.5%

Page 36: BLAT  –  The  B LAST- L ike  A lignment  T ool

Evaluation• Comparison with Other Tools:

– Translated Mouse/Human Alignments– 13 million mouse genomic reads vs. human

chromosome 22

WU-TBLASTX BLAT

Relative Speed 1x 73x

% RefSeq Covered 84.5% 86.7%

% Genome Covered 2.67% 2.89%

Page 37: BLAT  –  The  B LAST- L ike  A lignment  T ool

BLAT vs. BLAST

• Index– Query vs. Database

• Hits– Perfect vs. Near Perfect

• Alignment– Separate vs. Together

Page 38: BLAT  –  The  B LAST- L ike  A lignment  T ool

Magic Time !

Page 39: BLAT  –  The  B LAST- L ike  A lignment  T ool

Magic

4

4

3

3

2

1

.5

Prediction !No

mind !Great !