Sequence Comparison – Identification of remote homologues

55
Sequence Comparison – Identification of remote homologues Amir Harel Moran Yassour

description

Sequence Comparison – Identification of remote homologues. Amir Harel Moran Yassour. Overview. Homologues proteins Protein Sequence comparison BLAST and its improvements PSI-BLAST. Homologous Proteins. Proteins that share a common ancestor are called homologous. - PowerPoint PPT Presentation

Transcript of Sequence Comparison – Identification of remote homologues

Page 1: Sequence Comparison – Identification of remote homologues

Sequence Comparison – Identification of remote homologues

Amir HarelMoran Yassour

Page 2: Sequence Comparison – Identification of remote homologues

Overview

Homologues proteins

Protein Sequence comparison

BLAST and its improvements

PSI-BLAST

Page 3: Sequence Comparison – Identification of remote homologues

Homologous Proteins

Proteins that share a common ancestor are called homologous.

Common three dimensional folding structure

Page 4: Sequence Comparison – Identification of remote homologues

Homologous Proteins

Homology refers to a similarity that spans an entire folding domain.

The difficulty in defining homology

Page 5: Sequence Comparison – Identification of remote homologues

Why is homology important? Prediction of protein’s properties

Classification of proteins to families

Evolution tree

Page 6: Sequence Comparison – Identification of remote homologues

How to identify homology?

Using sequence similarities Aligning two proteins Giving a score to the alignment

Page 7: Sequence Comparison – Identification of remote homologues

Global & Local Alignments

Global alignment –alignment of the entire

sequence

Local alignment –alignment of a segment of the sequence

Page 8: Sequence Comparison – Identification of remote homologues

How to score an alignment Substitution Matrix – Sij = a value

proportional to the probability that amino acid i mutated into amino acid j

Page 9: Sequence Comparison – Identification of remote homologues

Types of Substitution Matrices

PAM – comparison of closely related sequences

BLOSUM – multiple alignments of distantly related sequences

Page 10: Sequence Comparison – Identification of remote homologues

Substitution Matrices

Different matrices reflect different evolutionary distances: 1 PAM represents the evolutionary

distance of 1 amino acid substitution per 100 amino acids.

BLOSUM X: all sequences with a similarity higher than X were summarized into one

Page 11: Sequence Comparison – Identification of remote homologues

Gap costs

The most widely used Gap score is-(a+bk) for a gap of length k.

Long gaps do not cost much more than short ones since a single mutation may cause a large gap.

Page 12: Sequence Comparison – Identification of remote homologues

Basic Sequence Comparison Smith & Waterman (1981) –

dynamic programming of sequence comparison

O(mn)

m

n

Page 13: Sequence Comparison – Identification of remote homologues

Complexity issue

When DBs become larger, m grows Time complexity Space complexity

Page 14: Sequence Comparison – Identification of remote homologues

Intuition to Solution

Go over less than the whole matrix Put the spotlight on segments that

can be a part of the best path and extend them.

The best path is close to a diagonal

Less than O(mn) m

n

Page 15: Sequence Comparison – Identification of remote homologues

Heuristic procedures

Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer.

There is no guarantee to find the best match.

Page 16: Sequence Comparison – Identification of remote homologues

BLAST – Basic Local Alignment Search Tool

BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n)

Each hit is extended in both directions as long as the score hasn’t dropped too much.

Page 17: Sequence Comparison – Identification of remote homologues

- - - - - - - x - - - - - - x - - - x- x - - - - - - - - - x - - - - x - -- - - x - - - - - - - - x - - - - - -- - - - - - - - - - - - - - - x - - -- x - - - x - - - - x - - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - - - - x xx - - - x - - - - - - - - x - - - - -- - - - x - - - - x - - - - - - - - -- - - - - x - - - - - - - - - - - x -- x - - - - x - - - - - - - - - - - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - - x- - - - - - x - - - - - - - x - - x -- - - x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - x - -- - - - - x - - - - - - - - - - x - -x - - - x - - - - x - - - x - - - - -

BLAST

Page 18: Sequence Comparison – Identification of remote homologues

A word about the parameter T

Small T:greater sensitivity, more hits to expand

large T: lower sensitivity, fewer hits to expand

Page 19: Sequence Comparison – Identification of remote homologues

Gapped BLAST

The original BALST was un-gapped

Soon after came gapped BLAST

Page 20: Sequence Comparison – Identification of remote homologues

BLAST - Results

P value – The probability of an alignment occurring with score S or better.

E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance.

Lower E value –> more significant score.

Page 21: Sequence Comparison – Identification of remote homologues

E-value and Homology Non significant score does not

necessarily imply non-homology:

Page 22: Sequence Comparison – Identification of remote homologues

E-value and Homology

Page 23: Sequence Comparison – Identification of remote homologues

Use it wisely

Choose your Substitution Matrix

Choose your DB

Page 24: Sequence Comparison – Identification of remote homologues

Example 1 – remote homology Frequently, identification of a remote

homology will require several database searches.

The glutathione transferase family

Page 25: Sequence Comparison – Identification of remote homologues

Remote homology

Page 26: Sequence Comparison – Identification of remote homologues

Remote homology

Testing the possibility that elongation factors share homology with glutathione S-transferases :

There is a clear relationship between this elongation factor and the class-theta glutathione transferases.

Page 27: Sequence Comparison – Identification of remote homologues

Example 2 - mapping

Three different families of G-protein coupled receptors: the R family (the largest) the C/S family the G receptor family

Page 28: Sequence Comparison – Identification of remote homologues

Finding links between families

E-valueScoreName02347OPSD_HUMAN RHODOPSIN.01791OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO 01002OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO

3.10E-30527OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE 1.10E-23435NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ 1.50E-23431SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5. 3.50E-22419TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR. 6.40E-142835H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7) 8.50E-14280CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKR- 1.50E-13278ETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E1.60E-13276AA2B_RAT ADENOSINE A2B RECEPTOR.

0.007133MAS_MOUSE MAS PROTO-ONCOGENE. 0.007130PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA 0.009135OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12. 0.01131MAS_RAT MAS PROTO-ONCOGENE. 0.01130CAR1_DICDI CYCLIC AMP RECEPTOR 10.02129OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2. 0.05124CAR3_DICDI CYCLIC AMP RECEPTOR 3. 0.06120MAS_HUMAN MAS PROTO-ONCOGENE. 0.17117OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1. 0.23121PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE.

Page 29: Sequence Comparison – Identification of remote homologues

Finding links between families

E-valueScoreFamilyName02678CAR1_DICDI CYCLIC AMP RECEPTOR 1. 01524CAR3_DICDI CYCLIC AMP RECEPTOR 3. 01497CAR2_DICDI CYCLIC AMP RECEPTOR 2.

0.00042167C/SCALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.00073161RIL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.00087162C/SCLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) 0.00095162C/SCLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B)

0.0045150C/SDIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). 0.012145C/SCALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.012145C/SGLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). 0.016141RIL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.022139RRDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG 0.061133RG10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D 0.085130ROPSD_HUMAN RHODOPSIN. 0.098131C/SVIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP

0.11129ROPSD_SPHSP OPSIN. 0.13129C/SSCRC_RAT SECRETIN RECEPTOR PRECURSOR. 0.14127RIL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A 0.16143.1C/SGLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO 0.16126RAG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2

Page 30: Sequence Comparison – Identification of remote homologues

Building Proteins tree

Page 31: Sequence Comparison – Identification of remote homologues

Conclusions

Searches with high-scoring, related or unrelated sequences, is a very important tool.

Homology is a transitive relation…

Page 32: Sequence Comparison – Identification of remote homologues

BLAST – Pros & Cons

Pros: It works

Cons: Statistical evaluations rather than

biological one. Converged Evolution Weak but biologically relevant

similarities may be overlooked (PSI will improve this issue)

Page 33: Sequence Comparison – Identification of remote homologues

BLAST improvements

Running time improvements : Two-hit method Seed extension

PSI-BLAST

Page 34: Sequence Comparison – Identification of remote homologues

The two-hit method

The extension step accounts for more than 90% of BLAST’s execution time

Invoke an extension only when two non-overlapping hits are found within a certain distance of one another

Page 35: Sequence Comparison – Identification of remote homologues

- - - - - - x x x - - - - - x - - x x- x - - - x - - - - x x - - - - x - -- - - x - - - - - - - - x - - x - - -- - x - - - - - - - - - x - - x - - -- x - - - x - - - - x x - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - x - - x xx - - - x - - - - - - - - x - - - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x - - - - - x - -- x - - - x - - - - x x - - - - x - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - x x- - - - - - x x x - - - - - x - - x x- - x x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x x - - - - x - -x - - - x - - - - x - - - x - - - - -

first hit

second hit

two-hit extension

The two-hit method

Page 36: Sequence Comparison – Identification of remote homologues

Seed Extension

Page 37: Sequence Comparison – Identification of remote homologues

PSI-BLAST

Evolution pressure

Needle in a hey stack

PSI-BLAST comes to solve this problem

Page 38: Sequence Comparison – Identification of remote homologues

Evolution reveals itself

Giving more significance to the conserved areas and to ignoring the background noises

PSI-BLAST = Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM

Page 39: Sequence Comparison – Identification of remote homologues

Position-Specific Matrix - PSSM Pij = proportional to the probability of

finding the ith amino acid in the jth position in these sequences

Page 40: Sequence Comparison – Identification of remote homologues

PSSM

Represents the distribution of the amino acids in each position in a collection of sequences

Page 41: Sequence Comparison – Identification of remote homologues

Steps in the PSI-BLAST Initiation:

Running gapped BLAST on the query, outputting a collection of matching sequences

Iteration: Constructing the PSSM based on the best

sequences in this collection

The PSSM is compared to the protein DB, again, seeking alignments

Page 42: Sequence Comparison – Identification of remote homologues

PSI-BLAST Example

We start with an uncharacterized protein – MJ0414

When submitting the query we set the E-value threshold to 0.01 (higher than usual)

Page 43: Sequence Comparison – Identification of remote homologues

Result of initial gapped BLAST

Page 44: Sequence Comparison – Identification of remote homologues

First iteration –

Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005

Page 45: Sequence Comparison – Identification of remote homologues

Second iteration –

Page 46: Sequence Comparison – Identification of remote homologues

Interpretation of the results Considering a strong unrelated protein

will shift the PSSM to its direction

E-values retrieved in later iterations should not be taken as automatic proof of homology

Page 47: Sequence Comparison – Identification of remote homologues

Was the ligase a right choice?

Page 48: Sequence Comparison – Identification of remote homologues

PSI-BLAST Conclusions Uncovers protein relationships missed

by single-pass database-search methods

Errors are easily amplified by iterations.

PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret

Page 49: Sequence Comparison – Identification of remote homologues

Running time evaluation

Running time can be highly influenced by modifying parameters

Smith Waterman

Original BLAST

Gapped BLAST

PSI BLAST

Normalized Running time

36 1.0 0.34 0.87

Page 50: Sequence Comparison – Identification of remote homologues

Future Improvements

Accepting PSSM as input from other programs

Realignment – improve the alignment before going over the DB

Automatic domain recognition

Page 51: Sequence Comparison – Identification of remote homologues

Summary

In BLAST use multiple searches for maximum knowledge

BLAST improvements are considerably faster, and enhance significantly the abilities of DB search

For many queries the PSI BLAST can greatly increase sensitivity to weak, but biologically relevant sequence relationships

Page 52: Sequence Comparison – Identification of remote homologues

Questions time

Thank You

Page 53: Sequence Comparison – Identification of remote homologues

References Pearson WR. (1997) Identifying distantly related protein

sequences. Comput Appl Biosci., 13, 325-332

Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402

Altschul SF, Koonin EV. (1998) Iterated profile searches with PSI-BLAST – a tool for discovery in protein databases. Trends Biochem Sci., 23, 444-447

Page 54: Sequence Comparison – Identification of remote homologues

Sites

http://www.ncbi.nlm.nih.gov/BLAST http://www.cs.huji.ac.il/~cbio http://www.people.virginia.edu/~w

rp/ http://www-lmmb.ncifcrf.gov/

Page 55: Sequence Comparison – Identification of remote homologues

Appendix - Statistics

2ln

ln'

kSS

E

NS 2log'

nmN

'2 SN

E