PROTEIN FOLDING - UC Homepageshomepages.uc.edu/~montazmd/Notes/ProteinFolding.pdf · Protein...

PROTEIN FOLDING: HP Lattice Model

By: Mohammad Montazeri Supervisor: Professor Frank Pinski

Department of Physics, University of Cincinnati,

May 2007

2

Table of Contents:

1. Introduction........................................... 3

2. Protein Folding...................................... 8

3. Mechanism of Protein Folding........... 10

4. H-P Lattice Model............................... 15

5. Our Problem........................................ 17

6. Conclusion ........................................... 23

References ............................................ 24

3

1. Introduction

The word protein comes from the Greek πρώτα ("prota"), meaning "of primary

importance" and these molecules were first described and named by Jöns Jakob Berzelius

in 1838. However, proteins' central role in living organisms was not fully appreciated

until 1926, when James B. Sumner showed that the enzyme urease was a protein. The

first protein to be sequenced was insulin, by Frederick Sanger, who won the Nobel Prize

for this achievement in 1958. The first protein structures to be solved included

hemoglobin and myoglobin, by Max Perutz and Sir John Cowdery Kendrew,

respectively, in 1958. Both proteins' three-dimensional structures were first determined

by x-ray diffraction analysis; the structures of myoglobin and haemoglobin won the 1962

Nobel Prize in Chemistry for their discoverers.

Proteins are large organic compounds made of amino acids arranged in a linear chain and

joined together by peptide bonds between the carboxyl and amino groups of adjacent

amino acid residues. The sequence of amino acids in a protein is defined by a gene and

encoded in the genetic code. Although this genetic code specifies 20 "standard" amino

acids, the residues in a protein are often chemically altered in post-translational

modification: either before the protein can function in the cell, or as part of control

mechanisms. Proteins can also work together to achieve a particular function, and they

often associate to form stable complexes.

4

Proteins are essential parts of all living organisms and participate in every process within

cells. Many proteins are enzymes that catalyze biochemical reactions, and are vital to

metabolism. Other proteins have structural or mechanical functions, such as the proteins

in the cytoskeleton, which forms a system of scaffolding that maintains cell shape.

Proteins are also important in cell signaling, immune responses, cell adhesion, and the

cell cycle. Protein is also a necessary component in our diet, since animals cannot

synthesize all the amino acids and must obtain essential amino acids from food.

The lowest functional size of protein begins from 40 to 50 amino acids up to several

thousand residues. Human body makes at least 50,000 different types of proteins with

different functionality in the body.

Depending on the scale looking at proteins, the structure of protein could be realized into

four types:

I. Primary Structure: The primary structure is simply the sequence of amino acids in

protein (see figure 1). This structure forms the back bone of protein shape.

II. Secondary Structure: The secondary structure is regularly formed by repeating local

structures stabilized by hydrogen bonds. The most common examples are the alpha

helix and beta sheet (see figure 2). Because secondary structures are local, many regions

of different secondary structure can be present in the same protein molecule.

5

III. Tertiary Structure or Folded State: The folded state is the overall shape of a single

protein molecule; the spatial relationship of the secondary structures to one another (see

figure 3). Tertiary structure is generally stabilized by non-local interactions, most

commonly the formation of a hydrophobic core, but also through salt bridges, hydrogen

bonds, disulfide bonds, and even post-translational modifications. The term "tertiary

structure" is often used as synonymous with the term fold.

IV. Quaternary Structure: Quaternary structure is the shape or structure that results from

the interaction of more than one protein molecule, usually called protein subunits in this

context, which function as part of the larger assembly or protein complex.

In the next section we focus on the process of formation of folded structure, namely

protein folding.

6

Figure 1: Primary Structure

Figure 2: Alpha Helices in Secondary Structure

7

Figure 3: Folded Structre (Simulation Results)

8

2. Protein Folding

Protein folding is the physical process by which a protein folds into its characteristic

three-dimensional structure. Each protein begins as a polypeptide, translated from a

sequence of mRNA as a linear chain of amino acids. This polypeptide lacks any

developed three-dimensional structure. However each amino acid in the chain can be

thought of having certain 'gross' chemical features. These may be hydrophobic,

hydrophilic, or electrically charged, for example. These interact with each other and their

surroundings in the cell to produce a well-defined, three dimensional shape, the folded

protein, known as the native state. The resulting three-dimensional structure is

determined by the sequence of the amino acids.

Experimentally determining the three dimensional structure of a protein is often very

difficult and expensive. However the sequence of that protein is often known. Therefore

scientists have tried to use different biophysical techniques to manually fold a protein.

That is, to predict the structure of the protein complete from the sequence of the protein.

In certain solutions and under some conditions proteins will not fold into their

biologically "functional" forms. Temperatures above the range that cells tend to live in

will cause proteins to unfold or "denature" (this is why boiling makes the white of an egg

opaque). High concentrations of solutes and extremes of pH can do the same. A fully

denatured protein lacks both tertiary and secondary structure, and exists as a so-called

9

random coil. Cells sometimes protect their proteins against the denaturing influence of

heat with enzymes known as chaperones or heat shock proteins, which assist other

proteins both in folding and in remaining folded. Some proteins never fold in cells at all

except with the assistance of chaperone molecules, that either isolate individual proteins

so that their folding is not interrupted by interactions with other proteins or help to unfold

misfolded proteins, giving them a second chance to refold properly.

For many proteins the correct three dimensional structure is essential for the protein to

function correctly. Thus, failure of folding usually produces inactive proteins with

different properties. Several diseases are believed to result from the accumulation of

misfolded proteins like Alzheimer's disease, cystic fibrosis and BSE. These diseases are

associated with the aggregation of misfolded proteins into insoluble plaques; it is not

known whether the plaques are the cause or merely a symptom of illness.

10

3. Mechanism of Protein Folding

The mechanism of protein folding is not well understood. It is generally accepted that the

folding process is dominated by hydrophobic residues and their interaction with the

solvent and other residues.

Hydrophobic molecules tend to be non-polar and thus prefer other neutral molecules and

nonpolar solvents. Hydrophobic molecules in water often cluster together. Water on

hydrophobic surfaces will exhibit a high contact angle. Examples of hydrophobic

molecules include the alkanes, oils, fats, and greasy substances in general. Hydrophobic

materials are used for oil removal from water, the management of oil spills, and chemical

separation processes to remove non-polar from polar compounds.

On the other hand, a hydrophilic molecule or portion of a molecule is one that is typically

charge-polarized and capable of hydrogen bonding, enabling it to dissolve more readily

in water than in oil or other hydrophobic solvents. Hydrophilic and hydrophobic

molecules are also known as polar molecules and nonpolar molecules, respectively. Soap

has a hydrophilic head and a hydrophobic tail which allows it to dissolve in both waters

and oils, therefore allowing the soap to clean a surface.

In a protein sequence, some residues are hydrophobic and some hydrophilic. When

protein is in the solvent (usually water, which its molecules are polar), the protein

deforms such a way to make minimum contact between its hydrophobic residues and

11

water molecules, or equivalently, to maximize the contact between its own hydrophobic

residues. This is understood as the dominant interaction in folding process.

One can define a free energy for the protein taking into account the domination of

hydrophobic interaction in protein folding process. At the native state, which the free

energy is minimized, the number of hydrophobic contacts is maximized. When an

unfolded protein start to be folded the protein passes through various geometrical

formations each corresponds to an energy level. Hence one can define the energy

landscape (see figure 4). Each point on the energy landscape corresponds to one

geometrical conformation of protein. In this view, the folding process is understood as a

path on the energy landscape toward the native state which is located at the lowest level

of the energy landscape.

The energy landscape theory was formulated by Joseph Bryngelson and Peter Wolynes in

the late 1980's and early 1990's. This approach introduced the principle of minimal

frustration, which asserts that evolution has selected the amino acid sequences of natural

proteins so that interactions between side chains largely favor the molecule's acquisition

of the folded state. Interactions that do not favor folding are selected against, although

some residual frustration is expected to exist. A consequence of these evolutionarily

selected sequences is that proteins are generally thought to have globally "funneled

energy landscapes" (coined by José Onuchic) that are largely directed towards the native

state(see figure 5). This "folding funnel" landscape allows the protein to fold to the native

state through any of a large number of pathways and intermediates, rather than being

restricted to a single mechanism. The theory is supported by computational simulations

12

of model proteins and has been used to improve methods for protein structure prediction

and design.

For most of sequences the energy landscape has not the funnel like form but has

roughness (see figure 6). In this case, if the protein follows the lower energy path, it

might be trapped in some region that is not the native state but a local minimum. This

phenomenon is called kinetic traps. In this situation, the protein is expected to climb the

local minimum, or in other words, the protein should unfold enough to pass the kinetic

trap.

The depth of these minima could be large enough that even the thermal fluctuation could

not help the protein to follow its path toward its native state. The kinetic traps could be

one of the reasons for protein misfolding or delaying the folding process. It turns out that

toward the native state, the protein makes large number of mistakes (follows wrong

paths). These mistakes are not only because of kinetic traps but phenomena like uphill

steps, thermal motions, retrying of earlier conformation and etc.

Generally, what is understood from folding process is that, at very first steps, the

unfolded protein collapses to a compact formation with hydrophobic core in a very short

time, and then it goes toward native state by some internal reformation involving both

folding and unfolding steps. Most of above mentioned mistakes are made after collapse

of protein during its internal deformation.

13

Figure 4: Energy Landscape

Figure 5: Funnel-Like Energy Landscape

14

Figure 6: Roughness on Energy Landscape and Kinetic Traps

15

4. H-P Lattice Model

The hydrophobic-polar protein folding model is a highly simplified model for examining

protein folds in space. First proposed by Dill in 1985, it is motivated by the observation

that hydrophobic interactions between amino acid residues are the driving force for

proteins folding into their native state. All amino acid types are classified as either

hydrophobic (H) or polar (P), and the folding of a protein sequence is defined as a self-

avoiding walk in a 2D or 3D lattice. The HP model imitates the hydrophobic effect by

assigning a negative (favorable) weight to interactions between adjacent, non-covalently

bound H residues. Proteins that have minimum energy are assumed to be in their native

state (see figure 7).

The HP model can be expressed in both two and three dimensions, generally with square

lattices, although triangular lattices have been used as well.

Randomized search algorithms are often used to tackle the HP folding problem. This

includes stochastic, evolutionary algorithms like the Monte Carlo method, genetic

algorithms, and ant colony optimization. While no method has been able to calculate the

experimentally determined minimum energetic state for long protein sequences, the most

advanced methods today are able to come close.

Even though the HP model abstracts away many of the details of protein folding, it could

describe the general behavior of protein folding accurately.

16

Figure 7: HP Lattice Model on a Square Lattice

17

5. Our Problem

In this project, we developed a code to finding the native state of an arbitrary sequence by

approach of HP lattice model. Given a specific sequence, the code generate large number

of conformation. These conformations are generated by self avoiding random walk path

algorithm. The number of steps in the algorithm is equal to the length of protein.

Afterward, the code counts the number of HH contacts for each generated conformation.

Searching among the conformation, the code specifies the conformation with the

maximum number of HH contacts as the native state of the sequence.

The algorithm of our code is as follows:

18

Given Sequence

Using Self Avoiding Random Walk Generate

Many Conformations.

Count Number of HH Contacts in Each

Conformation.

Find the Conformation with the Greatest Number of HH

Contacts.

Native State

Start

End

19

The specific sequence we used to check our code is a sequence with 13 residue “P-H-P-

H-H-P-H-H-H-H-H-H-H”.

After examining 10,000 conformations for this sequence and finding the number of HH

contacts for each conformation, the graph shown in figure 8 produced. In this graph, the

horizontal axis corresponds to the serial number of conformations while the vertical axis

represents the number of HH contacts for each conformation.

It is understood that among these 10,000 conformations, two of them have 6 HH contacts.

Noting the dramatically decrease of degeneracy by increasing the energy level, we

reasonably conclude that these two specific conformations represent the native state. It

turns out that both of them represent the same conformation which is generated twice by

the self avoiding random walk code. This is expected since, after all, the code does not

prevent double generation of the same conformation.

The predicted conformation of the native state is illustrated at figure 9. The conformation

is in agreement with the reference.

20

Figure 8: Number of HH Contacts vs Conformations for Our Test Sequence

Figure 9: The Natice State with 6 HH Contacts

21

At figure 10, the Mathematica Code we developed appears. In this code, number of

conformations intended to be generated is defined by ‘try’, the dimension of sequence is

defined by ‘dim’, and the sequence of H and P residues is given by ‘seq’. At the end of

running, two output would be appear: ‘pos[[conf, x, y]]’ and ‘h[[conf]]’. Where in the

first output position of residues for each conformation and in the second output number

of HH contact for each conformation are saved. By use of ‘h[[conf]]’ the conformation

correspond to the maximum number of HH contact could be find. Returning to the

‘pos[[conf, x, y]]’, the conformation of native state could be obtained.

22

try= 10000;dim = 13;seq= 8P, H, P, H, H, P, H, H, H, H, H, H, H<;H∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Generating Conformation and Counting number of HH Contacts for Each Conformation−−−∗Lpos= Array@Null, 8try, dim, 2<D; h= Table@0, 8i, 1, try<D; dsk= Table@Null, 8i, 1, dim<D; lns= Table@Null, 8i, 1, dim − 1<D;For@conf= 1, conf< try+1, conf++,

Label@repeatD;pos@@conf, 1, 1DD = 0.; pos@@conf, 1, 2DD = 0.;For@i= 1, i< dim, i++,MU= 1; MD = 1; ML= 1; MR = 1;For@j= 1, j< i, j++,If@pos@@conf, j, 1DD == pos@@conf, i, 1DD && pos@@conf, j, 2DD == pos@@conf, i, 2DD+ 1., MU = 0D;If@pos@@conf, j, 1DD == pos@@conf, i, 1DD && pos@@conf, j, 2DD == pos@@conf, i, 2DD− 1., MD = 0D;If@pos@@conf, j, 1DD == pos@@conf, i, 1DD − 1. && pos@@conf, j, 2DD == pos@@conf, i, 2DD, ML = 0D;If@pos@@conf, j, 1DD == pos@@conf, i, 1DD + 1. && pos@@conf, j, 2DD == pos@@conf, i, 2DD, MR = 0D;D;If@MU 0&& MD 0 && ML 0&& MR 0, Goto@repeatDD;Label@againD;p= Random@D;If@p≤ 0.25, If@MU≠ 0, pos@@conf, i+ 1, 1DD = pos@@conf, i, 1DD;

pos@@conf, i+1, 2DD = pos@@conf, i, 2DD + 1.; Goto@doneD, Goto@againDDD;[email protected]< p && p ≤ 0.5, If@MD ≠ 0, pos@@conf, i+1, 1DD = pos@@conf, i, 1DD;

pos@@conf, i+1, 2DD = pos@@conf, i, 2DD − 1.; Goto@doneD, Goto@againDDD;[email protected]< p && p ≤ 0.75, If@ML ≠ 0, pos@@conf, i+1, 1DD = pos@@conf, i, 1DD − 1.;

pos@@conf, i+1, 2DD = pos@@conf, i, 2DD; Goto@doneD, Goto@againDDD;[email protected]< p, If@MR≠ 0, pos@@conf, i+ 1, 1DD = pos@@conf, i, 1DD + 1.;

pos@@conf, i+1, 2DD = pos@@conf, i, 2DD; Goto@doneD, Goto@againDDD;Label@doneD;

D;Label@finalD;For@k= 1, k< dim −2, k++,xx= pos@@conf, k, 1DD; yy = pos@@conf, k, 2DD;If@seq@@kDD H,

For@m = k+ 2, m < dim + 1, m++,If@seq@@mDD H,

If@pos@@conf, m, 1DD xx+ 1 && pos@@conf, m, 2DD yy, h@@confDD = h@@confDD +1D;If@pos@@conf, m, 1DD xx − 1&& pos@@conf, m, 2DD yy, h@@confDD = h@@confDD +1D;If@pos@@conf, m, 1DD xx && pos@@conf, m, 2DD yy+ 1, h@@confDD = h@@confDD +1D;If@pos@@conf, m, 1DD xx && pos@@conf, m, 2DD yy− 1, h@@confDD = h@@confDD +1D;D;

D;D;D;D;H∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Finding Conformation with Maximum HH Contacts−−−∗LListPlot@hD;For@i= 1, i< try, i++, If@h@@iDD 6, Print@"Configuration Number: ", iD; conf = iDD;H∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Plotting Native State−−−∗Lr= 0.2;thick = 0.02;For@i= 1, i< dim +1, i++,

If@seq@@iDD H, col1= 1; col3 = 0D; If@seq@@iDD P, col1= 0; col3 = 1D;dsk@@iDD = Graphics@8RGBColor@col1, 0, col3D, Disk@8pos@@conf, i, 1DD, pos@@conf, i, 2DD<, rD<D;If@i< dim, lns@@iDD = Graphics@8Thickness@thickD,

Line@88pos@@conf, i, 1DD, pos@@conf, i, 2DD<, 8pos@@conf, i+1, 1DD, pos@@conf, i+ 1, 2DD<<D<DD;D;

Show@8lns, dsk<, AspectRatio→ AutomaticD;

Figure 10: Mathematica Code for Finding Native State in HP Lattice Model

23

6. Conclusion

Protein folding should be a complex phenomenon. HP lattice model, though sacrifices

resolution, could describe many aspect of protein folding.

In this project we developed a code to find the native state of a sequence in domain of HP

lattice model. Beside of simplicity of the code, the code is basically designed for any

sequence with any size. However, the code is not the most efficient and hence for large

sequence it takes considerable time for evaluation. On the other hand, since the code

generates the conformation randomly, there is no promise of generation of native state if

the number of generated conformation is not large enough. Regarding these weakness, it

seems that the code in its original form could be used for sequences with length up to 15

residues good enough and it could be enhanced to handle sequences with length up to 20

to 25 residues.

24

References

[1] K. F. Lau and Ken Dill, Macromolecules 22: 3986-3997 (1989)

[2] Ken Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas and H. S.

Chan, Protein Science 4: 561-602 (1995)

[3] J. N. Onuchic and P. G. Wolynes, Current Opinion in Structural Biology 14: 70-75

(2004)

[4] J. A. McCammon, Rep. Prog. Phys. 47: 1-46 (1984)

[5] http://online.kitp.ucsb.edu/online/infobio01/dill/

[6] http://folding.stanford.edu/

[7] http://en.wikipedia.org/wiki/Protein

[8] http://en.wikipedia.org/wiki/Protein_folding

PROTEIN FOLDING - UC Homepageshomepages.uc.edu/~montazmd/Notes/ProteinFolding.pdf · Protein...

Documents

Transcript of PROTEIN FOLDING - UC Homepageshomepages.uc.edu/~montazmd/Notes/ProteinFolding.pdf · Protein...