Sequence Comparison – Identification of remote homologues

Amir HarelMoran Yassour

Overview

Homologues proteins

Protein Sequence comparison

BLAST and its improvements

PSI-BLAST

Homologous Proteins

Proteins that share a common ancestor are called homologous.

Common three dimensional folding structure

Homologous Proteins

Homology refers to a similarity that spans an entire folding domain.

The difficulty in defining homology

Why is homology important? Prediction of protein’s properties

Classification of proteins to families

Evolution tree

How to identify homology?

Using sequence similarities Aligning two proteins Giving a score to the alignment

Global & Local Alignments

Global alignment –alignment of the entire

sequence

Local alignment –alignment of a segment of the sequence

How to score an alignment Substitution Matrix – Sij = a value

proportional to the probability that amino acid i mutated into amino acid j

Types of Substitution Matrices

PAM – comparison of closely related sequences

BLOSUM – multiple alignments of distantly related sequences

Substitution Matrices

Different matrices reflect different evolutionary distances: 1 PAM represents the evolutionary

distance of 1 amino acid substitution per 100 amino acids.

BLOSUM X: all sequences with a similarity higher than X were summarized into one

Gap costs

The most widely used Gap score is-(a+bk) for a gap of length k.

Long gaps do not cost much more than short ones since a single mutation may cause a large gap.

Basic Sequence Comparison Smith & Waterman (1981) –

dynamic programming of sequence comparison

Complexity issue

When DBs become larger, m grows Time complexity Space complexity

Intuition to Solution

Go over less than the whole matrix Put the spotlight on segments that

can be a part of the best path and extend them.

The best path is close to a diagonal

Less than O(mn) m

Heuristic procedures

Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer.

There is no guarantee to find the best match.

BLAST – Basic Local Alignment Search Tool

BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n)

Each hit is extended in both directions as long as the score hasn’t dropped too much.

- - - - - - - x - - - - - - x - - - x- x - - - - - - - - - x - - - - x - -- - - x - - - - - - - - x - - - - - -- - - - - - - - - - - - - - - x - - -- x - - - x - - - - x - - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - - - - x xx - - - x - - - - - - - - x - - - - -- - - - x - - - - x - - - - - - - - -- - - - - x - - - - - - - - - - - x -- x - - - - x - - - - - - - - - - - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - - x- - - - - - x - - - - - - - x - - x -- - - x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - x - -- - - - - x - - - - - - - - - - x - -x - - - x - - - - x - - - x - - - - -

A word about the parameter T

Small T:greater sensitivity, more hits to expand

large T: lower sensitivity, fewer hits to expand

Gapped BLAST

The original BALST was un-gapped

Soon after came gapped BLAST

BLAST - Results

P value – The probability of an alignment occurring with score S or better.

E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance.

Lower E value –> more significant score.

E-value and Homology Non significant score does not

necessarily imply non-homology:

E-value and Homology

Use it wisely

Choose your Substitution Matrix

Choose your DB

Example 1 – remote homology Frequently, identification of a remote

homology will require several database searches.

The glutathione transferase family

Remote homology

Testing the possibility that elongation factors share homology with glutathione S-transferases :

There is a clear relationship between this elongation factor and the class-theta glutathione transferases.

Example 2 - mapping

Three different families of G-protein coupled receptors: the R family (the largest) the C/S family the G receptor family

Finding links between families

E-valueScoreName02347OPSD_HUMAN RHODOPSIN.01791OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO 01002OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO

3.10E-30527OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE 1.10E-23435NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ 1.50E-23431SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5. 3.50E-22419TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR. 6.40E-142835H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7) 8.50E-14280CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKR- 1.50E-13278ETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E1.60E-13276AA2B_RAT ADENOSINE A2B RECEPTOR.

0.007133MAS_MOUSE MAS PROTO-ONCOGENE. 0.007130PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA 0.009135OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12. 0.01131MAS_RAT MAS PROTO-ONCOGENE. 0.01130CAR1_DICDI CYCLIC AMP RECEPTOR 10.02129OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2. 0.05124CAR3_DICDI CYCLIC AMP RECEPTOR 3. 0.06120MAS_HUMAN MAS PROTO-ONCOGENE. 0.17117OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1. 0.23121PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE.

Finding links between families

E-valueScoreFamilyName02678CAR1_DICDI CYCLIC AMP RECEPTOR 1. 01524CAR3_DICDI CYCLIC AMP RECEPTOR 3. 01497CAR2_DICDI CYCLIC AMP RECEPTOR 2.

0.00042167C/SCALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.00073161RIL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.00087162C/SCLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) 0.00095162C/SCLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B)

0.0045150C/SDIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). 0.012145C/SCALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). 0.012145C/SGLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). 0.016141RIL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B 0.022139RRDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG 0.061133RG10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D 0.085130ROPSD_HUMAN RHODOPSIN. 0.098131C/SVIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP

0.11129ROPSD_SPHSP OPSIN. 0.13129C/SSCRC_RAT SECRETIN RECEPTOR PRECURSOR. 0.14127RIL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A 0.16143.1C/SGLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO 0.16126RAG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2

Building Proteins tree

Conclusions

Searches with high-scoring, related or unrelated sequences, is a very important tool.

Homology is a transitive relation…

BLAST – Pros & Cons

Pros: It works

Cons: Statistical evaluations rather than

biological one. Converged Evolution Weak but biologically relevant

similarities may be overlooked (PSI will improve this issue)

BLAST improvements

Running time improvements : Two-hit method Seed extension

PSI-BLAST

The two-hit method

The extension step accounts for more than 90% of BLAST’s execution time

Invoke an extension only when two non-overlapping hits are found within a certain distance of one another

- - - - - - x x x - - - - - x - - x x- x - - - x - - - - x x - - - - x - -- - - x - - - - - - - - x - - x - - -- - x - - - - - - - - - x - - x - - -- x - - - x - - - - x x - - - - x - -- - - - - - x - - - - - - - x - - x x- - - - - - - x x - - - - - x - - x xx - - - x - - - - - - - - x - - - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x - - - - - x - -- x - - - x - - - - x x - - - - x - -- - - - - - - - - - - - x - - x - - -- - - - - - - - - - - - - - - - - x x- - - - - - x x x - - - - - x - - x x- - x x - - - - - - - - x - - x - - -x - - - x - - - - x - - - x - - - - -- x - - - x - - - - x x - - - - x - -x - - - x - - - - x - - - x - - - - -

first hit

second hit

two-hit extension

The two-hit method

Seed Extension

PSI-BLAST

Evolution pressure

Needle in a hey stack

PSI-BLAST comes to solve this problem

Evolution reveals itself

Giving more significance to the conserved areas and to ignoring the background noises

PSI-BLAST = Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM

Position-Specific Matrix - PSSM Pij = proportional to the probability of

finding the ith amino acid in the jth position in these sequences

Represents the distribution of the amino acids in each position in a collection of sequences

Steps in the PSI-BLAST Initiation:

Running gapped BLAST on the query, outputting a collection of matching sequences

Iteration: Constructing the PSSM based on the best

sequences in this collection

The PSSM is compared to the protein DB, again, seeking alignments

PSI-BLAST Example

We start with an uncharacterized protein – MJ0414

When submitting the query we set the E-value threshold to 0.01 (higher than usual)

Result of initial gapped BLAST

First iteration –

Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005

Second iteration –

Interpretation of the results Considering a strong unrelated protein

will shift the PSSM to its direction

E-values retrieved in later iterations should not be taken as automatic proof of homology

Was the ligase a right choice?

PSI-BLAST Conclusions Uncovers protein relationships missed

by single-pass database-search methods

Errors are easily amplified by iterations.

PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret

Running time evaluation

Running time can be highly influenced by modifying parameters

Smith Waterman

Original BLAST

Gapped BLAST

PSI BLAST

Normalized Running time

36 1.0 0.34 0.87

Future Improvements

Accepting PSSM as input from other programs

Realignment – improve the alignment before going over the DB

Automatic domain recognition

Summary

In BLAST use multiple searches for maximum knowledge

BLAST improvements are considerably faster, and enhance significantly the abilities of DB search

For many queries the PSI BLAST can greatly increase sensitivity to weak, but biologically relevant sequence relationships

Questions time

Thank You

References Pearson WR. (1997) Identifying distantly related protein

sequences. Comput Appl Biosci., 13, 325-332

Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402

Altschul SF, Koonin EV. (1998) Iterated profile searches with PSI-BLAST – a tool for discovery in protein databases. Trends Biochem Sci., 23, 444-447

http://www.ncbi.nlm.nih.gov/BLAST http://www.cs.huji.ac.il/~cbio http://www.people.virginia.edu/~w

rp/ http://www-lmmb.ncifcrf.gov/

Appendix - Statistics

NS 2log'

Sequence Comparison – Identification of remote homologues

Documents

Transcript of Sequence Comparison – Identification of remote homologues

Current Status of Homology Modeling Using MCSG Structures 319 MCSG structures in PDB have over 400,000 sequence homologues. These structures represent.

Identification of a novel sequence element in the common promoter ...

Proteomics Informatics Protein identification II: search engines and protein sequence …fenyolab.org/presentations/Proteomics_Informatics_2013/... · 2013-03-05 · protein sequence

Sequence-based identification of microbial contaminants in ...

Identification of amino acid sequence ... - Semantic Scholar

Phenomena Identification in Severe Accident Sequence …takamasa/J-US2012/image/Prof Abe.pdf · Phenomena Identification in Severe Accident Sequence and Safety Issues ... Three Mile

Identification of a Pax paired domain recognition sequence and ...

Sequence-Based Identification of Filamentous Basidiomycetous Fungi from Clinical Specimens

Aspergillus collagen-like (acl) genes: identification, sequence ...

Identification of an Export Control Sequence and a Requirement for ...

Protein Identification by Sequence Database Search

Transfusion de globules rouges homologues : produits ...

Aspergillus Collagen-Like Genes (acl): Identification, Sequence ...

Recommandations - Transfusion de globules rouges homologues ...

Identification and sequence analysis of alkaloid …journals.tubitak.gov.tr/biology/issues/biy-16-40-1/biy...Identification and sequence analysis of alkaloid biosynthesis genes in

Sequence Identification using BLAST - Craig Ventermaize.jcvi.org/cellgenomics/outreach/2016/notes/Intro_BLAST.pdf · Sequence Alignment • Sequence alignment is the assignment of

TRANSFUSIONS DE GLOBULES ROUGES HOMOLOGUES ...

Original Article Identification and sequence analysis … · Original Article Identification and sequence analysis of aroA gene of avibacterium paragallinarum Xue-Ze Lv, Hong-Jun

DNA target sequence identification mechanism for dimer-active

Isolation, Identification and Sequence Analysis of ...