PROTEIN FOLDING - UC Homepageshomepages.uc.edu/~montazmd/Notes/ProteinFolding.pdf · Protein...
Transcript of PROTEIN FOLDING - UC Homepageshomepages.uc.edu/~montazmd/Notes/ProteinFolding.pdf · Protein...
PROTEIN FOLDING: HP Lattice Model
By: Mohammad Montazeri Supervisor: Professor Frank Pinski
Department of Physics, University of Cincinnati,
May 2007
2
Table of Contents:
1. Introduction........................................... 3
2. Protein Folding...................................... 8
3. Mechanism of Protein Folding........... 10
4. H-P Lattice Model............................... 15
5. Our Problem........................................ 17
6. Conclusion ........................................... 23
References ............................................ 24
3
1. Introduction
The word protein comes from the Greek πρώτα ("prota"), meaning "of primary
importance" and these molecules were first described and named by Jöns Jakob Berzelius
in 1838. However, proteins' central role in living organisms was not fully appreciated
until 1926, when James B. Sumner showed that the enzyme urease was a protein. The
first protein to be sequenced was insulin, by Frederick Sanger, who won the Nobel Prize
for this achievement in 1958. The first protein structures to be solved included
hemoglobin and myoglobin, by Max Perutz and Sir John Cowdery Kendrew,
respectively, in 1958. Both proteins' three-dimensional structures were first determined
by x-ray diffraction analysis; the structures of myoglobin and haemoglobin won the 1962
Nobel Prize in Chemistry for their discoverers.
Proteins are large organic compounds made of amino acids arranged in a linear chain and
joined together by peptide bonds between the carboxyl and amino groups of adjacent
amino acid residues. The sequence of amino acids in a protein is defined by a gene and
encoded in the genetic code. Although this genetic code specifies 20 "standard" amino
acids, the residues in a protein are often chemically altered in post-translational
modification: either before the protein can function in the cell, or as part of control
mechanisms. Proteins can also work together to achieve a particular function, and they
often associate to form stable complexes.
4
Proteins are essential parts of all living organisms and participate in every process within
cells. Many proteins are enzymes that catalyze biochemical reactions, and are vital to
metabolism. Other proteins have structural or mechanical functions, such as the proteins
in the cytoskeleton, which forms a system of scaffolding that maintains cell shape.
Proteins are also important in cell signaling, immune responses, cell adhesion, and the
cell cycle. Protein is also a necessary component in our diet, since animals cannot
synthesize all the amino acids and must obtain essential amino acids from food.
The lowest functional size of protein begins from 40 to 50 amino acids up to several
thousand residues. Human body makes at least 50,000 different types of proteins with
different functionality in the body.
Depending on the scale looking at proteins, the structure of protein could be realized into
four types:
I. Primary Structure: The primary structure is simply the sequence of amino acids in
protein (see figure 1). This structure forms the back bone of protein shape.
II. Secondary Structure: The secondary structure is regularly formed by repeating local
structures stabilized by hydrogen bonds. The most common examples are the alpha
helix and beta sheet (see figure 2). Because secondary structures are local, many regions
of different secondary structure can be present in the same protein molecule.
5
III. Tertiary Structure or Folded State: The folded state is the overall shape of a single
protein molecule; the spatial relationship of the secondary structures to one another (see
figure 3). Tertiary structure is generally stabilized by non-local interactions, most
commonly the formation of a hydrophobic core, but also through salt bridges, hydrogen
bonds, disulfide bonds, and even post-translational modifications. The term "tertiary
structure" is often used as synonymous with the term fold.
IV. Quaternary Structure: Quaternary structure is the shape or structure that results from
the interaction of more than one protein molecule, usually called protein subunits in this
context, which function as part of the larger assembly or protein complex.
In the next section we focus on the process of formation of folded structure, namely
protein folding.
6
Figure 1: Primary Structure
Figure 2: Alpha Helices in Secondary Structure
7
Figure 3: Folded Structre (Simulation Results)
8
2. Protein Folding
Protein folding is the physical process by which a protein folds into its characteristic
three-dimensional structure. Each protein begins as a polypeptide, translated from a
sequence of mRNA as a linear chain of amino acids. This polypeptide lacks any
developed three-dimensional structure. However each amino acid in the chain can be
thought of having certain 'gross' chemical features. These may be hydrophobic,
hydrophilic, or electrically charged, for example. These interact with each other and their
surroundings in the cell to produce a well-defined, three dimensional shape, the folded
protein, known as the native state. The resulting three-dimensional structure is
determined by the sequence of the amino acids.
Experimentally determining the three dimensional structure of a protein is often very
difficult and expensive. However the sequence of that protein is often known. Therefore
scientists have tried to use different biophysical techniques to manually fold a protein.
That is, to predict the structure of the protein complete from the sequence of the protein.
In certain solutions and under some conditions proteins will not fold into their
biologically "functional" forms. Temperatures above the range that cells tend to live in
will cause proteins to unfold or "denature" (this is why boiling makes the white of an egg
opaque). High concentrations of solutes and extremes of pH can do the same. A fully
denatured protein lacks both tertiary and secondary structure, and exists as a so-called
9
random coil. Cells sometimes protect their proteins against the denaturing influence of
heat with enzymes known as chaperones or heat shock proteins, which assist other
proteins both in folding and in remaining folded. Some proteins never fold in cells at all
except with the assistance of chaperone molecules, that either isolate individual proteins
so that their folding is not interrupted by interactions with other proteins or help to unfold
misfolded proteins, giving them a second chance to refold properly.
For many proteins the correct three dimensional structure is essential for the protein to
function correctly. Thus, failure of folding usually produces inactive proteins with
different properties. Several diseases are believed to result from the accumulation of
misfolded proteins like Alzheimer's disease, cystic fibrosis and BSE. These diseases are
associated with the aggregation of misfolded proteins into insoluble plaques; it is not
known whether the plaques are the cause or merely a symptom of illness.
10
3. Mechanism of Protein Folding
The mechanism of protein folding is not well understood. It is generally accepted that the
folding process is dominated by hydrophobic residues and their interaction with the
solvent and other residues.
Hydrophobic molecules tend to be non-polar and thus prefer other neutral molecules and
nonpolar solvents. Hydrophobic molecules in water often cluster together. Water on
hydrophobic surfaces will exhibit a high contact angle. Examples of hydrophobic
molecules include the alkanes, oils, fats, and greasy substances in general. Hydrophobic
materials are used for oil removal from water, the management of oil spills, and chemical
separation processes to remove non-polar from polar compounds.
On the other hand, a hydrophilic molecule or portion of a molecule is one that is typically
charge-polarized and capable of hydrogen bonding, enabling it to dissolve more readily
in water than in oil or other hydrophobic solvents. Hydrophilic and hydrophobic
molecules are also known as polar molecules and nonpolar molecules, respectively. Soap
has a hydrophilic head and a hydrophobic tail which allows it to dissolve in both waters
and oils, therefore allowing the soap to clean a surface.
In a protein sequence, some residues are hydrophobic and some hydrophilic. When
protein is in the solvent (usually water, which its molecules are polar), the protein
deforms such a way to make minimum contact between its hydrophobic residues and
11
water molecules, or equivalently, to maximize the contact between its own hydrophobic
residues. This is understood as the dominant interaction in folding process.
One can define a free energy for the protein taking into account the domination of
hydrophobic interaction in protein folding process. At the native state, which the free
energy is minimized, the number of hydrophobic contacts is maximized. When an
unfolded protein start to be folded the protein passes through various geometrical
formations each corresponds to an energy level. Hence one can define the energy
landscape (see figure 4). Each point on the energy landscape corresponds to one
geometrical conformation of protein. In this view, the folding process is understood as a
path on the energy landscape toward the native state which is located at the lowest level
of the energy landscape.
The energy landscape theory was formulated by Joseph Bryngelson and Peter Wolynes in
the late 1980's and early 1990's. This approach introduced the principle of minimal
frustration, which asserts that evolution has selected the amino acid sequences of natural
proteins so that interactions between side chains largely favor the molecule's acquisition
of the folded state. Interactions that do not favor folding are selected against, although
some residual frustration is expected to exist. A consequence of these evolutionarily
selected sequences is that proteins are generally thought to have globally "funneled
energy landscapes" (coined by José Onuchic) that are largely directed towards the native
state(see figure 5). This "folding funnel" landscape allows the protein to fold to the native
state through any of a large number of pathways and intermediates, rather than being
restricted to a single mechanism. The theory is supported by computational simulations
12
of model proteins and has been used to improve methods for protein structure prediction
and design.
For most of sequences the energy landscape has not the funnel like form but has
roughness (see figure 6). In this case, if the protein follows the lower energy path, it
might be trapped in some region that is not the native state but a local minimum. This
phenomenon is called kinetic traps. In this situation, the protein is expected to climb the
local minimum, or in other words, the protein should unfold enough to pass the kinetic
trap.
The depth of these minima could be large enough that even the thermal fluctuation could
not help the protein to follow its path toward its native state. The kinetic traps could be
one of the reasons for protein misfolding or delaying the folding process. It turns out that
toward the native state, the protein makes large number of mistakes (follows wrong
paths). These mistakes are not only because of kinetic traps but phenomena like uphill
steps, thermal motions, retrying of earlier conformation and etc.
Generally, what is understood from folding process is that, at very first steps, the
unfolded protein collapses to a compact formation with hydrophobic core in a very short
time, and then it goes toward native state by some internal reformation involving both
folding and unfolding steps. Most of above mentioned mistakes are made after collapse
of protein during its internal deformation.
13
Figure 4: Energy Landscape
Figure 5: Funnel-Like Energy Landscape
14
Figure 6: Roughness on Energy Landscape and Kinetic Traps
15
4. H-P Lattice Model
The hydrophobic-polar protein folding model is a highly simplified model for examining
protein folds in space. First proposed by Dill in 1985, it is motivated by the observation
that hydrophobic interactions between amino acid residues are the driving force for
proteins folding into their native state. All amino acid types are classified as either
hydrophobic (H) or polar (P), and the folding of a protein sequence is defined as a self-
avoiding walk in a 2D or 3D lattice. The HP model imitates the hydrophobic effect by
assigning a negative (favorable) weight to interactions between adjacent, non-covalently
bound H residues. Proteins that have minimum energy are assumed to be in their native
state (see figure 7).
The HP model can be expressed in both two and three dimensions, generally with square
lattices, although triangular lattices have been used as well.
Randomized search algorithms are often used to tackle the HP folding problem. This
includes stochastic, evolutionary algorithms like the Monte Carlo method, genetic
algorithms, and ant colony optimization. While no method has been able to calculate the
experimentally determined minimum energetic state for long protein sequences, the most
advanced methods today are able to come close.
Even though the HP model abstracts away many of the details of protein folding, it could
describe the general behavior of protein folding accurately.
16
Figure 7: HP Lattice Model on a Square Lattice
17
5. Our Problem
In this project, we developed a code to finding the native state of an arbitrary sequence by
approach of HP lattice model. Given a specific sequence, the code generate large number
of conformation. These conformations are generated by self avoiding random walk path
algorithm. The number of steps in the algorithm is equal to the length of protein.
Afterward, the code counts the number of HH contacts for each generated conformation.
Searching among the conformation, the code specifies the conformation with the
maximum number of HH contacts as the native state of the sequence.
The algorithm of our code is as follows:
18
Given Sequence
Using Self Avoiding Random Walk Generate
Many Conformations.
Count Number of HH Contacts in Each
Conformation.
Find the Conformation with the Greatest Number of HH
Contacts.
Native State
Start
End
19
The specific sequence we used to check our code is a sequence with 13 residue “P-H-P-
H-H-P-H-H-H-H-H-H-H”.
After examining 10,000 conformations for this sequence and finding the number of HH
contacts for each conformation, the graph shown in figure 8 produced. In this graph, the
horizontal axis corresponds to the serial number of conformations while the vertical axis
represents the number of HH contacts for each conformation.
It is understood that among these 10,000 conformations, two of them have 6 HH contacts.
Noting the dramatically decrease of degeneracy by increasing the energy level, we
reasonably conclude that these two specific conformations represent the native state. It
turns out that both of them represent the same conformation which is generated twice by
the self avoiding random walk code. This is expected since, after all, the code does not
prevent double generation of the same conformation.
The predicted conformation of the native state is illustrated at figure 9. The conformation
is in agreement with the reference.
20
Figure 8: Number of HH Contacts vs Conformations for Our Test Sequence
Figure 9: The Natice State with 6 HH Contacts
21
At figure 10, the Mathematica Code we developed appears. In this code, number of
conformations intended to be generated is defined by ‘try’, the dimension of sequence is
defined by ‘dim’, and the sequence of H and P residues is given by ‘seq’. At the end of
running, two output would be appear: ‘pos[[conf, x, y]]’ and ‘h[[conf]]’. Where in the
first output position of residues for each conformation and in the second output number
of HH contact for each conformation are saved. By use of ‘h[[conf]]’ the conformation
correspond to the maximum number of HH contact could be find. Returning to the
‘pos[[conf, x, y]]’, the conformation of native state could be obtained.
22
try= 10000;dim = 13;seq= 8P, H, P, H, H, P, H, H, H, H, H, H, H<;H∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Generating Conformation and Counting number of HH Contacts for Each Conformation−−−∗Lpos= Array@Null, 8try, dim, 2<D; h= Table@0, 8i, 1, try<D; dsk= Table@Null, 8i, 1, dim<D; lns= Table@Null, 8i, 1, dim − 1<D;For@conf= 1, conf< try+1, conf++,
Label@repeatD;pos@@conf, 1, 1DD = 0.; pos@@conf, 1, 2DD = 0.;For@i= 1, i< dim, i++,MU= 1; MD = 1; ML= 1; MR = 1;For@j= 1, j< i, j++,If@pos@@conf, j, 1DD == pos@@conf, i, 1DD && pos@@conf, j, 2DD == pos@@conf, i, 2DD+ 1., MU = 0D;If@pos@@conf, j, 1DD == pos@@conf, i, 1DD && pos@@conf, j, 2DD == pos@@conf, i, 2DD− 1., MD = 0D;If@pos@@conf, j, 1DD == pos@@conf, i, 1DD − 1. && pos@@conf, j, 2DD == pos@@conf, i, 2DD, ML = 0D;If@pos@@conf, j, 1DD == pos@@conf, i, 1DD + 1. && pos@@conf, j, 2DD == pos@@conf, i, 2DD, MR = 0D;D;If@MU 0&& MD 0 && ML 0&& MR 0, Goto@repeatDD;Label@againD;p= Random@D;If@p≤ 0.25, If@MU≠ 0, pos@@conf, i+ 1, 1DD = pos@@conf, i, 1DD;
pos@@conf, i+1, 2DD = pos@@conf, i, 2DD + 1.; Goto@doneD, Goto@againDDD;[email protected]< p && p ≤ 0.5, If@MD ≠ 0, pos@@conf, i+1, 1DD = pos@@conf, i, 1DD;
pos@@conf, i+1, 2DD = pos@@conf, i, 2DD − 1.; Goto@doneD, Goto@againDDD;[email protected]< p && p ≤ 0.75, If@ML ≠ 0, pos@@conf, i+1, 1DD = pos@@conf, i, 1DD − 1.;
pos@@conf, i+1, 2DD = pos@@conf, i, 2DD; Goto@doneD, Goto@againDDD;[email protected]< p, If@MR≠ 0, pos@@conf, i+ 1, 1DD = pos@@conf, i, 1DD + 1.;
pos@@conf, i+1, 2DD = pos@@conf, i, 2DD; Goto@doneD, Goto@againDDD;Label@doneD;
D;Label@finalD;For@k= 1, k< dim −2, k++,xx= pos@@conf, k, 1DD; yy = pos@@conf, k, 2DD;If@seq@@kDD H,
For@m = k+ 2, m < dim + 1, m++,If@seq@@mDD H,
If@pos@@conf, m, 1DD xx+ 1 && pos@@conf, m, 2DD yy, h@@confDD = h@@confDD +1D;If@pos@@conf, m, 1DD xx − 1&& pos@@conf, m, 2DD yy, h@@confDD = h@@confDD +1D;If@pos@@conf, m, 1DD xx && pos@@conf, m, 2DD yy+ 1, h@@confDD = h@@confDD +1D;If@pos@@conf, m, 1DD xx && pos@@conf, m, 2DD yy− 1, h@@confDD = h@@confDD +1D;D;
D;D;D;D;H∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Finding Conformation with Maximum HH Contacts−−−∗LListPlot@hD;For@i= 1, i< try, i++, If@h@@iDD 6, Print@"Configuration Number: ", iD; conf = iDD;H∗−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Plotting Native State−−−∗Lr= 0.2;thick = 0.02;For@i= 1, i< dim +1, i++,
If@seq@@iDD H, col1= 1; col3 = 0D; If@seq@@iDD P, col1= 0; col3 = 1D;dsk@@iDD = Graphics@8RGBColor@col1, 0, col3D, Disk@8pos@@conf, i, 1DD, pos@@conf, i, 2DD<, rD<D;If@i< dim, lns@@iDD = Graphics@8Thickness@thickD,
Line@88pos@@conf, i, 1DD, pos@@conf, i, 2DD<, 8pos@@conf, i+1, 1DD, pos@@conf, i+ 1, 2DD<<D<DD;D;
Show@8lns, dsk<, AspectRatio→ AutomaticD;
Figure 10: Mathematica Code for Finding Native State in HP Lattice Model
23
6. Conclusion
Protein folding should be a complex phenomenon. HP lattice model, though sacrifices
resolution, could describe many aspect of protein folding.
In this project we developed a code to find the native state of a sequence in domain of HP
lattice model. Beside of simplicity of the code, the code is basically designed for any
sequence with any size. However, the code is not the most efficient and hence for large
sequence it takes considerable time for evaluation. On the other hand, since the code
generates the conformation randomly, there is no promise of generation of native state if
the number of generated conformation is not large enough. Regarding these weakness, it
seems that the code in its original form could be used for sequences with length up to 15
residues good enough and it could be enhanced to handle sequences with length up to 20
to 25 residues.
24
References
[1] K. F. Lau and Ken Dill, Macromolecules 22: 3986-3997 (1989)
[2] Ken Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas and H. S.
Chan, Protein Science 4: 561-602 (1995)
[3] J. N. Onuchic and P. G. Wolynes, Current Opinion in Structural Biology 14: 70-75
(2004)
[4] J. A. McCammon, Rep. Prog. Phys. 47: 1-46 (1984)
[5] http://online.kitp.ucsb.edu/online/infobio01/dill/
[6] http://folding.stanford.edu/
[7] http://en.wikipedia.org/wiki/Protein
[8] http://en.wikipedia.org/wiki/Protein_folding