BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and...

BL5203: Molecular Recognition & Interaction BL5203: Molecular Recognition & Interaction

Lecture 6: Modeling Protein Structure and Lecture 6: Modeling Protein Structure and Protein-Protein Interaction Protein-Protein Interaction

Y.Z. ChenY.Z. ChenDepartment of PharmacyDepartment of Pharmacy

National University of SingaporeNational University of Singapore Tel: 65-6616-6877; Email: Tel: 65-6616-6877; Email: [email protected] ; Web: ; Web: http://bidd.nus.edu.sg

ContentContent

• Protein fold and structureProtein fold and structure

• Homology modelingHomology modeling

• Protein-protein dockingProtein-protein docking

mailto:[email protected]

http://bidd.nus.edu.sg/

Sizes of protein databasesSizes of protein databases

1

100

10,000

1,000,000

100,000,000

10,000,000,000

Protein

residues

Protein

sequences

Protein

structures

Protein

complexes

500M 1.6M 26K 1K

Swiss-Prot databaseSwiss-Prot database

Protein world

Protein fold

Protein structure classificationProtein structure classification

Protein superfamily

Protein familyNew Fold

PDB New Fold GrowthPDB New Fold Growth

• The number of unique folds in nature is fairly small (possibly a few thousands)

• 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

New folds

Old folds

New

PD

B s

truct

ure

s

Protein classificationProtein classification

• Number of protein sequences grow exponentially• Number of solved structures grow exponentially• Number of new folds identified very small (and

close to constant)• Protein classification can

– Generate overview of structure types– Detect similarities (evolutionary relationships) between

protein sequences

Problems in Protein Problems in Protein BioinformaticsBioinformatics

• 20,000 entries of proteins in the PDB

• 1000 - 2000 distinct protein folds in nature

• Thought to be only several thousand unique folds in all

• Prediction of structure from sequence– Fold recognition– Fragment construction

• Proteome annotation

• Protein-protein docking

Protein folding codeProtein folding code

Proteinfoldingcode

Proteinstructure

Protein sequence

Prediction of correct foldPrediction of correct foldQuery sequence Fold

recognition

Eisenberg et al.Jones, Taylor, Thornton

Matchedfold

Match sequence against library of known folds

Computational RequirementsComputational Requirements

• 1 sequence search takes 12 mins (3Ghz)

• Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days

• 30 approaches explored = 7 years (on 1 cpu)

Types of Structure PredictionTypes of Structure Prediction

• De novo protein– methods seek to build three-dimensional

protein models "from scratch" – Example: Rosetta

• Comparative protein – modeling uses previously solved structures as

starting points, or templates.– Example: protein threading

Factors that Make Protein Structure Factors that Make Protein Structure Prediction a Difficult Task Prediction a Difficult Task

• The number of possible structures that proteins may possess is extremely large, as highlighted by the Levinthal paradox

• The physical basis of protein structural stability is not fully understood.

• The primary sequence may not fully specify the tertiary structure. – chaperones

• Direct simulation of protein folding is not generally tractable for both practical and theoretical reasons.

Homology ModelingHomology Modeling

• Homolog a protein related to it by divergent evolution from a common ancestor

• 40 % amino-acid identity with its homolog – NO large insertions or deletions – Produces a predicted structure

equivalent to that of a medium resolution experimentally solved structure

• 25 % of known protein sequences fall in a safe area implying they can be modeled reliably

Homology Modeling DefinedHomology Modeling Defined

• Homology modeling – Based on the reasonable assumption that two

homologous proteins will share very similar structures.

– Given the amino acid sequence of an unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated computationally, into the corresponding amino acid from the unknown structure.

Homology Modeling LimitationsHomology Modeling Limitations

• Cannot study conformational changes• Cannot find new catalytic/binding sites• Brainstorm lack of activity vs activity

– Chymotrypsionogen, trypsinogen and plasminogen– 40% homologous– 2 active, 1 no activity, cannot explain why

• Large Bias towards structure of template• Models cannot be docked together

Why Homology Modeling?Why Homology Modeling?

• Value in structure based drug design• Find common catalytic sites/molecular

recognition sites• Use as a guide to planning and interpreting

experiments• 70-80 % chance a protein has a similar fold to

the target protein due to X-ray crystallography or NMR spectroscopy

• Sometimes it’s the only option or best guess

Protein ThreadingProtein Threading

• A target sequence is threaded through the backbone structure of a collection of template proteins (fold library)

• Quantitative measure of how well the sequence fits the fold

• Based on assumptions – 3-D structures of proteins have characteristics that

are semi-quantitatively predictable– reflect the physical-chemical properties of amino

acids– Limited types of interactions allowed within folding

Fold Recognition MethodsFold Recognition Methods

• Bowie, Lüthy and Eisenberg (1991)• 2 approaches to recognition methods• Derive a 1-D profile for each structure in the fold

library and align the target sequence to these profiles – Identify amino acids based on core or external

positions– Part of secondary structure

• Consider the full 3-D structure of the protein template – Modeled as a set of inter-atomic distances– NP-Hard (if include interactions of multiple residues)


• The word threading implies that one drags the sequence (ACDEFG...) step by step through each location on each template

Generalized Threading ScoreGeneralized Threading Score

• Want to correctly recognize arrangements of residues• Building a score function

– potentials of mean force – from an optimization calculation.

• G(rAB) = kTln (ρAB/ ρAB°)– G, free energy– k and T Boltzmanns constant and temperature respectively– ρ is the observed frequency of AB pairs at distance r. – ρ° the frequency of AB pairs at distance r you would expect to

see by chance.

• Z-score = (ENat - <Ealt>)/σ Ealt

– Natural energies and mean energies of all the wrong structures/ standard deviation

Scoring Different FoldsScoring Different Folds

• Goodness of fit score– Based on empirical energy

function– Modify to take into account

pairwise interactions and solvation terms

– High score means good fit– Low score means nothing

learned

Some Threading ProgramsSome Threading Programs

• 3D-pssm (ICNET). Based on sequence profiles, solvatation potentials and secondary structure.

• TOPITS (PredictProtein server) (EMBL). Based on coincidence of secondary structure and accesibility.

• UCLA-DOE Structure Prediction Server (UCLA). Executes various threading programs and report a consensus.

• 123D+ Combines substitution matrix, secondary structure prediction, and contact capacity potentials.

• SAM/HMM (UCSC). Basen on Markov models of alignments of crystalized proteins.

• FAS (Burnham Institute). Based on profile-profile matching algorithms of the query sequence with sequences from clustered PDB database.

• PSIPRED-GenThreader (Brunel) • THREADER2 (Warwick). Based on solvatation potentials and contacts

obtained from crystalized proteins. • ProFIT CAME (Salzburg)

Process of 3D Structure Process of 3D Structure Prediction by ThreadingPrediction by Threading

• Has this protein sequence similarity to other with a known structure?

• Structure related information in the databases• Results from threading programs• Predicted folding comparison• Threading on the structure and mapping of the

known data • A comparison between the threading predicted

structure and the actual one

Protein Threading Based on Multiple Protein Protein Threading Based on Multiple Protein Structure AlignmentStructure Alignment

Tatsuya Akutsu and Kim Lan SimTatsuya Akutsu and Kim Lan SimHuman Genome Center, Institute of Medical Science, Human Genome Center, Institute of Medical Science,

University of TokyoUniversity of Tokyo

• NP-Hard if include interactions between 2 or more AA

• Determine multiple structural alignments based on pair wise structure alignments – Center Star Method

Center Star MethodCenter Star Method• Let I0 be the maximum number of gap symbols placed before the first

residue of S0 in any of the alignments A(S0; S1); : : : ;A(S0; SN). Let IS0j be

the maximum number of gaps placed after the last character of S0 in any

of the alignments, and let Ii be the maximum number of gaps placed

between character S0;i and S0;i+1, where Sj:i denotes the i-th letter of

string Si

• Create a string S0 by inserting I0 gaps before S0, IjSo gaps after S0, and Ij

gaps between S0;I and S0;i+1.

• For each Sj (j > 0), create a pairwise alignment A(S0; Sj) between S0 and

Sj by inserting gaps into Sj so that deletion of the columns consisting of

gaps from A(S0; Sj) results in the same alignment as A(S0; Sj).

• Simply arrange A(S0; Sj )'s into a single matrix A (note that all A(S0; Sj )'s

have the same length).

Simple Threading AlgorithmSimple Threading Algorithm• Apply simple score function based on structure alignment algorithm

– Let X = x1……xN (input amino acid sequence)– Ci ( i-th column in A)

• Test and analyze results and/or apply constraints

Protein Threading with ConstraintsProtein Threading with Constraints

• Assume part of the input sequence xi…xi+k must correspond to part of the structure alignment c j…cj+k

• Apply constraints

Prediction PowerPrediction Power

• Entered in CASP3 competition• 17 predictions made• 3 targets evaluated as similar to correct folds• Only team to create a nearly correct model for

structure T0043• Best in competition

– 8 evaluated as similar to correct

Next time….Next time….

• In depth detail of– Multiple structural alignment program

• Multiprospector

– Global Optimum Protein Threading with Gapped Alignment

• Quality measures for protein threading models

• Improvements on threading-based models

Gapped AlignmentGapped Alignment

Trial structures for a local sequence taken from database of segments of known 3D structure

.

Fragment based methodFragment based method1 -Predict structure 1 -Predict structure of segmentof segment

Fragment based methodFragment based method2 - Construct trial model from segments2 - Construct trial model from segments

1 Low resolution energy function used in initial search through conformational space

2 - Side chains represented by single “centroid” pseudoatom

3 - Major contributions from Hydrophobic burial Beta strand pairing Steric overlap Specific residue pair interactions

4 - Models then refined using explicit rotamer based side chain representation and potential from design method

Fragment based methodFragment based method3 - Identify good trial structures3 - Identify good trial structures

Fragment-based protein foldingFragment-based protein folding

observed

Cro repressor(1orc)


• Methodology performs numerous simulations and looks for clusters

• One simulation takes 3 mins (3Ghz)

• Require 1,000 simulations per protein = 2 days

• Benchmark on 50 proteins = 100 days

Annotation procedure

MySQL database

New research

3D-GENOMICS - proteome 3D-GENOMICS - proteome annotationannotation

WWW

Databasesequences

Databasestructures

Proteomesequences

Functionaldata

Types of annotationTypes of annotation

Enzyme ABCEC 1.2.3.4- functionsuggested

E. coli Protein325-homologybut nofunction

membraneprotein

No similarsequence- orphan

structure

3D-Genomics database3D-Genomics database-structural and functional annotation-structural and functional annotation

size

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

M. genitalium

H. pylori J99

A. aeolicus

M. jannaschii

P. horikoshii

H. influenzae

V. cholerae

M. tuberculosis H37Rv

B. subtilis

E. coli K12

S. cerevisiae

D. melanogaster

C. elegans

H. sapiens

fraction of proteome (% of residues)

structure

function

any homology

non globular

orphan

Computational requirementsComputational requirements

• Today 800,000 protein sequences.

• Each sequence 15 mins to annotate on 2.5GHz cpu.

• Time today = 8,000 cpu days = 2.5 months with 100 processor farm.

• Need to update every 6 months.

• No of sequences will double in 2-3 years and so will keep pace with increase in compute power.

Modelling protein-protein Modelling protein-protein dockingdocking

Coordinatesof mol 1

Coordinatesof mol 2

Rigid body search

List of possible complexes

Evaluate association energy

Flexibility to refine

List of complexes

Experimentalinformation

Protein-protein dockingProtein-protein docking

Step 1 - Generating ComplexesStep 1 - Generating Complexes

+1

A(i,j,k) B(l,m,n)

C = A(i,j,k) x B(l,m,n)

+1-15

overlap+1 x -15

match+1 x +1

Shape complementarityShape complementarity

+1

-1+1

-1

Charge in 1 = Q(i,j,k) Potential outside 2 V(l,m,n)

E = Q(i,j,k) x V(l,m,n)

Electrostatic complementarityElectrostatic complementarity

Step 2 - Modelling residue-Step 2 - Modelling residue-residue interactionsresidue interactions

V

I

E

Empirical residue pair potentialsEmpirical residue pair potentials

a b

Analyse residues packing across 90 hetero-protein interfaces

A pair of residues pack if one atom-atom contact

Score(a,b) = log10 (Observed no a/b pairs) (Expected no a/b pairs)

< distance cut off (4.5A)

Step 3 - Including informationStep 3 - Including informationabout functional residuesabout functional residues

E

From literature

Step 4 - Refinement by Step 4 - Refinement by multicopymulticopy

Search for optimalcombination ofside-chain rotamersby energy calculation

+ Limitedrigid-body shifts

CAPRI - blind test of dockingCAPRI - blind test of docking

unboundamylase

bound Ab - X-raybound Ab - predicted

Prediction / Actual:Difference =0.6A


• 1 run of procedure takes 2 day on one 3Ghz processor

• Development tested on 30 protein complexes takes 60 days for one parameter set

• Applications– extension to predict which protein interacts with

another requires 1000s of docking simulations

Application areaApplication area

• Protein structure prediction– fold recognition– simulation

• Proteome annotation

• Protein-protein docking

Computing costComputing cost

• Modelling algorithm on one protein 10 mins - 2 days on one 3GHz cpu

• But algorithm development requires consideration of several structures (50 -100) with different parameter sets.

• Hence years of cpu required

Structure prediction & sequence spaceStructure prediction & sequence space

ASDJFHLKASDLFHASDFLHUHOUIQWEQWEONBLQWEROKJASDFPOIQWERUHOQWEORSADFLKJIJ

ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFGQWOIEGTXKNBVALHERTASDLFHIUWERHSDDFGHKBJDDURMWOFBMFERTJFGJDKEGORTMVIRGHRT

ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFG

ASDJFHLKASDASDFLHUHOUIQWEONBLQWERASDFPOIQWERQWEORSADFLK

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ctx/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1be7/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1bpi/tracel.html

http://www.biochem.ucl.ac.uk/bsm/pdbsum/1ag6/tracel.html

Multiple sequence alignments aid Multiple sequence alignments aid comparative protein modelingcomparative protein modeling

• 1 in 3 sequences are recognizably related to at least one protein structure.

• A significant fraction of the remaining 2/3 have solved structural homologues, but they are not recognized through sequence similarity searching techniques.

• Marti-Renom et al. (2000)

• Multiple sequence alignments greatly improve the efficacy and accuracy of almost all phase of comparative modeling.

• Venclovas (2001)

Computational protein designComputational protein design

Native structure

Iterative refinementNew sequence

Large scale sequence Large scale sequence generationgeneration

200,000Total sequences generated

4,000Processors available

80 daysTotal time of data collection

26,400Total backbone variants

264Total structures

“Reverse BLAST” study:

““Reverse BLAST”: Reverse BLAST”: finding templates for finding templates for

comparative modelingcomparative modeling

Larson SM, Garg A, Desjarlais JR, Pande VS. (2003) Proteins: Structure, Function, and Genetics

Experiment: Sequence qualityExperiment: Sequence quality

ASDFASDFASDFASFDSAFASDFASDFAFASDFASDFASDFAFHFDIDIFERIDKDADHFYWTEFHHASDASDFYEFHGASDFVADHFYWTEFHHASDASDFYEFHGASDFVDGSAHDYERCNDFKAKSLKALSDFPLAK

Design BLAST E<0.01

Results: Sequence qualityResults: Sequence quality

1E-17

1E-16

1E-15

1E-14

1E-13

1E-12

1E-11

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

0.0001

0.001

0.01

0.1

1

10

0 25 50 75 100 125 150 175 200 225

Designed sequence profile (ranked by E-value)

E-v

alu

e o

f b

est

PD

B h

it

0

5

10

15

20

25

30

Ave

rag

e id

enti

ty t

o n

ativ

e se

qu

ence

(%

)

Method: “Reverse BLAST”Method: “Reverse BLAST”

THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF

BLAST E<0.01








Designed Sequences Hypothetical Proteins Structural Templates

Do the designed sequences help?Do the designed sequences help?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

2 3 4 5 6 7 8 9 10

E-value threshold (-log(E))

hit

s w

ith

seq

uen

ce a

lig

nm

ent

: h

its

wit

ho

ut

0

20

40

60

80

100

120

140

160

Tota

l u

niq

ue

hit

s

Correctly identified structural templates

fold-increase in # of templates

fold-increase in # of genes

total hits

Remote homology detectionRemote homology detection

Optimizing structural diversityOptimizing structural diversity

0

10

20

30

40

50

60

70

80

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

RMSD of structural ensemble (Angstroms)

(%)

0

1

2

3

4

5

6

Seq

uen

ce e

ntr

op

y

sequence entropy

prediction accuracy

prediction coverage

mean pairwise %ID

mean native %ID

BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and...

Documents

Transcript of BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and...