SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST

43
SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information alone An example of two different bioinformatics approaches to the same problem

description

Two methods to predict domain boundary sequence positions from sequence information alone. SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST. An example of two different bioinformatics approaches to the same problem. - PowerPoint PPT Presentation

Transcript of SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Page 1: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDRAGON: protein 3D prediction-based

DOMAINATION: based on PSI-BLAST

Two methods to predict domain boundary sequence positions from

sequence information alone

An example of two different bioinformatics approaches to the same problem

Page 2: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDRAGON

Richard A. George

Jaap Heringa

George, R.A. & Heringa, J. (2002) J.Mol.Biol. 316,839-851

George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.

 

Combining protein secondary and tertiary structure prediction to predict structural domains in sequence data

Page 3: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure evolutionInsertion/deletion of secondary structural

elements can ‘easily’ be done at loop sites

Page 4: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Flavodoxin family - TOPS diagrams (Flores et al., 1994)

1 2345

1

234

5

Page 5: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure evolutionInsertion/deletion of structural domains can

‘easily’ be done at loop sites

N

C

Page 6: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

A domain is a:

• Compact, semi-independent unit (Richardson, 1981).

• Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973).

• Recurring functional and evolutionary module (Bork, 1992).

“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

Page 7: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.

http

://w

ww

.msh

ri.o

n.ca

/paw

son

Page 8: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Delineating domains is essential for:

• Obtaining high resolution structures (x-ray, NMR)• Sequence analysis • Multiple sequence alignment methods• Prediction algorithms (SS, Class, secondary/tertiary

structure)• Fold recognition and threading• Elucidating the evolution, structure and function of

a protein family (e.g. ‘Rosetta Stone’ method)• Structural/functional genomics• Cross genome comparative analysis

Page 9: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

1 continuous + 2 discontinuous domains

Structural domain organisation can be nasty…

Page 10: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 11: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 12: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 13: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 14: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Distance Regularisation Algorithm for Geometry OptimisatioN

(Aszodi & Taylor, 1994)

Domain prediction using DRAGON

•Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together.

•First constructs a random high dimensional C distance matrix.

•Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

Page 15: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

The DRAGON target matrix is inferred from:

• A multiple sequence alignment of a protein (old)– Conserved hydrophobicity

• Secondary structure information (SnapDRAGON)– predicted by PREDATOR (Frishman & Argos, 1996).– strands are entered as distance constraints from the N-

terminal Cto the C-terminal C

Page 16: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

•The C distance matrix is divided into smaller clusters.

•Seperately, each cluster is embedded into a local centroid.

•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.

3NN

NN

C distancematrix

Targetmatrix

N

CCHHHCCEEE

Multiple alignment

Predicted secondary structure100 randomised

initial matrices

100 predictions Input data

Page 17: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDragon

Generated folds by Dragon

Boundary recognition

Summed and Smoothed Boundaries

CCHHHCCEEE

Multiple alignment

Predicted secondary structure

Page 18: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Domains in structures assigned using method by Taylor (1997)

Domain boundary positions of each model against sequence

Summed and Smoothed Boundaries (Biased window protocol)

SnapDRAGON

1

2

3

Page 19: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Prediction assessment

• Test set of 414 multiple alignments;183 single and 231 multiple domain proteins.

Sequence searches using PSI-BLAST (Altschul et al., 1997) followed by redundancy filtering using OBSTRUCT (Heringa et al.,1992) and alignment by PRALINE (Heringa, 1999)

• Boundary predictions are compared to the region of the protein connecting two domains (min 10 residues)

Page 20: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Continuous set Discontinuous set Full set

SnapDRAGONCoverage 63.9 (± 43.0) 35.4 (± 25.0) 51.8 (± 39.1)

Success 46.8 (± 36.4) 44.4 (± 33.9) 45.8 (± 35.4)

Baseline 1Coverage 43.6 (± 45.3) 20.5 (± 27.1) 34.7 (± 40.8)

Success 34.3 (± 39.6) 22.2 (± 29.5) 29.6 (± 36.6)

Baseline 2Coverage 45.3 (± 46.9) 22.7 (± 27.3) 35.7 (± 41.3)

Success 37.1 (± 42.0) 23.1 (± 29.6) 31.2 (± 37.9)

Average prediction results per protein

Coverage is the % linkers predicted (TP/TP+FN)Success is the % of correct predictions made (TP/TP+FP)

Page 21: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SnapDRAGON

• Is very slow (can be hours for proteins>400 aa) – cluster computing implementation

• Uses consistency in the absence of standard of truth

• Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences

• SnapDRAGON webserver is underway

Page 22: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

DOMAINATIONRichard A. George

Protein domain identification and improved sequence searching using PSI-BLAST

(George & Heringa, Prot. Struct. Func. Genet., in press; 2002)

Integrating protein sequence database searching and domain recognition

Page 23: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Domaination

• Current iterative homology search methods do not take into account that:– Domains may have different ‘rates of

evolution’.– Common conserved domains, such as the

tyrosine kinase domain, can obscure weak but relevant matches to other domain types

– Premature convergence (false negatives)– Matrix migration / Profile wander (false

positives).

Page 24: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

PSI-BLAST• Query sequence is first scanned for the presence of so-

called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment.

• Initially operates on a single query sequence by performing a gapped BLAST search

• Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment.

• Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges

Page 25: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

PSI-BLAST iteration

Q

ACD..Y

PiPx

Query sequence

PSSM

Q Query sequence

Gapped BLAST search

Database hits

Gapped BLAST searchACD..Y

PiPx

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

Page 26: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

DO

MA

INA

TIO

N

Chop and JoinDomains

Page 27: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Post-processing low complexityRemove local fragments with > 15% LC

Page 28: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying domain boundaries

Sum N- and C-termini ofgapped local alignments

True N- and C- termini are counted twice (within 10 residues)

Boundaries are smoothed using twowindows (15 residues long)

Combine scores using biased protocol:

if Ni x Ci = 0then Si = Ni+Cielse Si = Ni+Ci +(NixCi)/(Ni+Ci)

Page 29: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying domain deletions

• Deletions in the query (or insertion in the DB sequences) are identified by– two adjacent segments in the query align to the

same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini)

DBQuery

Page 30: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying domain permutations

• A domain shuffling event is declared – when two local alignments (>35 residues)

within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order.

DB

Query

b a

a b

Page 31: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Identifying continuous and discontinuous domains

•Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain.•An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered asdiscontinuous domains and joined.

Page 32: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Create domain profiles

• A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq).

• A multiple sequence alignment is generated using PRALINE (Heringa 1999).

• Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997).

• The whole process is iterated until no new domains are identified.

Page 33: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Domain boundary prediction accuracy

• Set of 452 multidomain proteins

• 56% of proteins were correctly predicted to have more than one domain

• 42% of predictions are within 20 residues of a true boundary

• 49.9% (44.6%) correct boundary predictions per protein

Page 34: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

• 23.3% of all linkers found in 452 multidomain proteins. Not a surprise since:– Structural domain boundaries will not always

coincide with sequence domain boundaries– Proteins must have some domain shuffling

• For discontinuous proteins 34.2% of linkers were identified

• 30% of discontinuous domains were successfully joined

Page 35: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Change in domain prediction accuracy using various PSI-BLAST E-value cut-offs

Page 36: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Benchmarking versus PSI-BLAST

• A set 452 non-homologous multidomain protein structures.

• Each protein was delineated into its structural domains. Database searches of the individual domains were used as a standard of truth.

• We then tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.

Page 37: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Two sets based on individual domain searches:

• Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query.

• Reference set 2: consists of database sequences found by searching with one or more of the domain sequences

• Therefore set 2 contains many more sequences than set 1

Ref set 1 Ref set 2

Query

DB seqs

Page 38: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Sequences found over Reference sets 1 and 2

PSI-BLASTvs Ref set 1

DOMAINATIONvs Ref set 1

PSI-BLASTvs Ref set 2

DOMAINATIONvs Ref set 2

Seq's found 28581 28921 67300 73274

Seq's missed 618 278 13542 7568

% missed 2.12 0.95 16.8 9.36

Page 39: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Reference 1

• PSI-BLAST finds 97.9% of sequences

• Domaination finds 99.1% of sequences

Reference 2

• PSI-BLAST finds 83.2% of sequences

• Domaination finds 90.6% of sequences

Page 40: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Sequences found over Reference sets 1 and 2 from 15 Smart sequences

PSI-BLASTvs Ref set 1

DOMAINATIONvs Ref set 1

PSI-BLASTvs Ref set 2

DOMAINATIONvs Ref set 2

Seq's found 323 347 3672 5902

Seq's missed 24 0 3438 1202

% missed 6.9 0 48.4 17.0

Page 41: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

SSEARCH significance test

• Verify the statistical significance of database sequences found by relating them to the original query sequence.

• SSEARCH (Pearson & Lipman 1988). Calculates an E-value for each generated local alignment.

• This filter will lose distant homologies.

• Use the 452 proteins with known structure.

Page 42: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Significant sequences found in database searches

At an E-value cut-off of 0.1 the performance of DOMAINATION

searches with the full-length proteins is 15% better than PSI-BLAST

Page 43: SnapDRAGON:    protein 3D prediction-based DOMAINATION: based on PSI-BLAST

Summary

Domains are recurring evolutionary units: by collecting the N- and C- termini of local alignments we can identify domain boundaries.

By finding domains we can significantly improve database search methods

SnapDRAGON is more sensitive than DOMAINATION but at high computational cost