Protein folding - Dalhousie University

What would happen?

+ -

Lecture Plan

Molecules spend more time in low energy conformations

Predicting a structure is approximated as the search for the lowest energy conformation on a protein model

Modeling requires data, a molecular model and a strategy to explore protein conformations

All practical methods are heuristics; which means that they aren't guaranteed to find THE lowest energy conformation

Bigger, faster computers will not get ourselves out of this situation

What is a model?

Model: a hypothetical description of a complex entity or process.

wordnet.princeton.edu/perl/webwn

Sir John Kendrew and his model of myoglobin, 1958

To model a protein we need?

A description of the system.

(atom coordinates)

An abstraction of what is going on.

A way to know whether we are going anywhere.

(Force field)

A process to find the ‘best’ protein structure.

(optimization)

Force fields

A set of mathematical functions and parameters to model each relevant “forces” (VdW, etc.).

The energy of 1 atom is the sum of all these terms.

The energy of a protein is the sum of energies of all atoms.

Force field are mathematical expressions that include all known important factors for molecular interactions.

The Natural choice is to use the chemical energy as a score and try to find the structure with the lowest free energy.

It doesn’t work with free energies

G H TS= −Free energy is temperature-dependent.

Entropic contribution cannot be calculated from a snapshot.

Something else must be used.

Modeling all relevant energy terms.

Energy function

A sum of terms that approximate the contributions of known theoretical microscopic forces.

FF str bend tors VdW el crossE E E E E E E= + + + + +

Modeling bond stretch/compression

In the case of bond stretching/compression, we need to measure the distance r between two atoms, and get from the force field what should be the optimal distance ro for a given pair of atoms.

Modeling bond stretch/compression

Realistic models of bond stretch-compression are computationally expensive.

This figure show how simpler model fare at modeling bond stretching.

P2E str Rab=k2

abRab

−R0ab

2

P4E str Rab=k 2abRab−R0

ab2k3abRab−R0

ab3k 4abRab−R0

ab4

morseE str Rab=D 1−eRab

Three factors to consider

Parameterization nightmare

Can someone come up with all these numbers?

Generalization

How robust is the simulation in a range of conditions.

Efficiency

The longer it takes to perform a single task, the fewer iterations will be computed in the same amount of time.

Modeling VdW

Lennard-Jones

Is actually a computational stunt so there is no need to compute R but rather use Rn where n is an even factor.

( ) ( ) ( )2 2 2

ij i j i j i jR x x y y z z= − + − + −

6 6( ) BR C

AEXP R eR

−= −

E str R=[R0

R

12

−R0

R

6

]

Modeling Electrostatic interactions

Modeling electrostatic interaction is critical in many situations.

Why?

Electrostatic fields decay with 1/ distance. Which makes them the longest-ranged interactions.

Coulomb’s Law

( )el AA

BAB

BEQ Q

RRε

=

Non-bond interactions create a computational bottleneck

Computational cost of non-bonded energy (VdW, El)

~99.88% of computation in protein-sized models. Most of this is very small and does not contribute to the total energy significantly.

The number of non-bonded increase to the square of the number of atoms while bonded interaction are increasing roughly linearly.

Hydrophobic forces

This is not an explicit term.

Hydrophobic interaction are due to the difference in free energy between water molecules and polar/non-polar side chains.

The effect is thus intrinsic to the computations of electrostatic forces.

A process to find the best possible conformation

Force fields provide all the functions and parameters to compute the energy of 1 structure.

Finding the best structure comes down to finding the structure with the lowest energy.

Need a process to change the conformation toward a better structure: optimization.

The optimization is iterated until it is reasonable to believe that no better structure can be found.

Principle of optimization

You start with a protein for which you know all coordinates.

Evaluate the energy

Find a better structure, usually with small changes

Repeat until no better structure can be found.

This task is usually NEVER straightforward, unless the system would be made of a small number of atoms.

Minimizing

The gradient method

For each atoms:

1. Computer the force vectors for each term2. Sum all vectors3. Move the atoms over a small distance along their

resultant vectors.

Repeat until all resultant force vectors are of length 0.

Optimizing simple functions

R

Ene

rgie

s

Minimum (Ro)

R

Optimizing more complex functions

Conformations

Ene

rgie

s

Minimum

Molecular Simulations

Time dependent methods (Molecular Dynamics)

Make use of classical mechanics equations such as:

Each atom gets a random kinetic energy vector which is added to the resultant force vector. This simulates thermal motion.

F ma=


( ) 21 12i i i ir r r a t+ −= − + ∆

Verlet AlgorithmNumerical solution to Newton’s equations

r i1=r ivi t1/2 ai t 2

r i−1=r i−vi t−1/2 ai t 2

a can be computed from F = ma


Reasonable: Femtoseconds 10-15

Scope of simulation (ideal): Millisecond 10-3

(practical): Microsecond 10-6

t


Timesteps

To simulate a microsecond, it takes a very, very long time…

To simulate slow processes, the time scale isn’t realistic.

Other Optimization strategies

Simulated Annealing

Scaling down the energy landscape makes the crossing of barrier more probable.

Time to do so it in short supply, however!

Why modeling proteins

Anticodon binding site on eRF32 possibilities.

From phylogenetic information, a few residues were identified as players.

Use molecular mechanics to “see” whether the surface of the protein can accommodate

an anti-codon.Inagaki, Y., Blouin, C., Doolittle, W.F., and Roger, A.J. 2002. Convergence and constraint in eukaryotic release factor 1 (eRF1) domain 1: the evolution of stop codon specificity. Nucleic Acids Res 30: 532-544.

Why modeling proteinsModeling a weird substrate into an active site.

Mandelate racemase can bind a substrate with two rings! Is there room for this in the wild type active site?

The answer is yes, although a bit counter-intuitive.

Siddiqi, F, Bourque, J., Jiang, H., Gardner, M., St. Maurice, M., Blouin, C., and Bearne S.L., Perturbing the Hydrophobic Pocket of Mandelate Racemase to Probe Phenyl Motion During Catalysis. Biochemistry 44(25):9013-21

Folding polypeptide isn’t expected to work out as well because…

Empirical models are parameterized with pre-folded proteins.

Role of water in partially folder proteins is significant.

Time scale for folding a protein is still a bit out of range for simulation.

Assistance in folding, either from Chaperones, other monomers isn’t there.

Folding process is seeded by the chain extension during translation.

Folding of peptide occurs an a time scale COMPLETELY beyond

what we can get away with today.

Protein folding from Scratch

Must be restrained to a limited scope

Two genes: TC5b and TC3b

Both have references structure for validation.

Sequences

NLYIQWLKDGGPSSGRPPPS (TC5b; 304 atoms)

NLFIEWLKNGGPSSGAPPPS (TC3b; 289 atoms)

Software: AMBER 6.0

Model: AMBER

Solvation: Generalize Born/solvent-accessible surface area

This means that the water molecules are not explicitly defined in the simulation and the effect of the solvent is treated as a macro

property.


Must be restrained to a limited scope

Understanding folding and design: Replica-exchange simulation of “Trp-Cage” miniproteins.

Pitera, JW., Swope, W. 2003. Proc. Natl. Acad. Sci. USA, 100: 7587-7592

The GIST

Run 23 simulation (4ns) in parallel, each at a different temperature from hot to cold. At every 5ps, redistribute the best

conformation to the coldest simulation.

This is much more effective at exploring solutions than a 23 X 4ns simulation


Impact

RED is the reference, GREEN is the computed model.

Large Energy barriers are not as high in small, isolated structures.

It is reasonable to limit the scope of these simulations to protein

domains.


Validation

The root mean square deviation RMSD

( )2

1

n

i refatom i

n

ii

w i i

RMSDn w

=

=

−=￥

￥

v v≤2.0 RMSD

from any of the snapshots

≤2.0 RMSD from the average

snapshot.

Why multiple temperature sampling work

Total energy remains constant:

So, at higher temperatures, higher energy structures are more often observed.

This means more alternative conformation sampled in the same amount of time.

Each of these conformational groups gets in turn refined at lower temperature.

Statistically, most of the simulation time of the coldest chain will be spent in the energy well with the lowest energy.

kFFT EEE +=

Building a large machine for protein folding

IBM Blue Gene project(65K processors, never enough, however)

High Performance achievement in MD NAMD

Open source

University of Illinois, Dept. of theoretical physicshttp://www.ks.uiuc.edu/Research/namd/

Benchmark system

(their big one)

http://www.ks.uiuc.edu/Research/namd/

Parallel computing and Molecular dynamics

Folding proteins from an extended conformation is a difficult problem because of the crossing of energy

barriers.

The following slides describe the limitations of simulating the crossing of energy barrier using “massively” parallel

techniques.

Limitations of Parallel computing

It takes 1500 days to complete a thesis for one student

If the student is helped by someone, the work may go 2X as fast: 750 days.

What if 1500 students are working on the same thesis?

Overhead

Communication

Load balancing

Parallel computing

Factors that complicate parallelization:

Some work have to be executed in a sequence

Communicating the task and the results becomes an increasingly important time step as the task become small.

Each individual process have to wait for the slowest one to finish, leading to a loss of efficiency.

It doesn't make sense to have much more CPU than atoms in the system!

Time scale in protein folding

In the order of micro to milliseconds

This is not achievable by modern computers.

~10 000 days for 1 experiment (~28 years)

folding@home

Using unspent cycles from idle hardware

(PS3, Xboxes, PC (Screensaver) )

Crossing energy barrier

Most of the time is spent waiting for the thermal motion to topple a structure over a barrier.

Principle of Ensemble dynamics

M CPU should take M X less time to go over a barrier.

For breaking 3-10 H-bonds (~22.3 kJ/M)

If an even occurs on average every 10,000 ns, the chance to witness this event during a 30 ns simulation is 0.3%.

If the same simulation runs on 10,000 machines, one expect to observe the event ~30 times over 30 ns.

Ensemble Dynamics

Start M dynamic calculations with the same initial structure.

Once 1 thread finds a barrier and go over it, copy the state of this thread into all other M-

1 replicate processes.

The communication overhead is negligible if the crossings are rare events, which is the

case.

Villin's headpiece

Note how most of the interactions in the partially

folded protein are non-native.

This means that in order to resume folding, these must

be broken.

The Villin headpiece is one of the fastest (known) folding peptide !! What

about simulating anything else?

Energy Landscape

Observe that in this figure that there are:

• one folding pathway• One intermediate• Two energy barriers

Progress in last 2 years

Rates of protein folding appear to be correctly

predicted using ensemble dynamics.

Progress on large systems

Multiplicate simulation of SNARE-mediated vesicle

fusion have been published.

Ensemble molecular dynamics yields submillisecond kinetics and intermediates of

membrane fusion

Peter M. Kasson , Nicholas W. Kelley , Nina Singhal , Marija Vrljic¶, Axel T. Brunger¶,||, , and

Vijay S. Pande

PNAS | August 8, 2006 | vol. 103 | no. 32 | 11916-

11921

Summary

Biological models are assumed to have the lowest energy.

Optimization is used to find the lowest energy structures, and thus the biologically relevant conformation.

Simulation time is the bottleneck. The more you sample, the more likely that the solution will be good.

There are some progress in solving protein folding using heuristics and parallel computing, but the solution depend on theoretical breakthroughs, not the addition of hardware.

Protein folding - Dalhousie University

Documents

Transcript of Protein folding - Dalhousie University