Post on 22-May-2020
Machine Learning For
Protein Modeling
Presented By
Ellen Huynh
What are Proteins?
• Complex, high molecular
mass, organic compounds
• Consists of a specific order of amino acids (aa’s) joined together by peptide bonds
• The order of the aa’s is determined by the base sequence of nucleotides in the gene that codes for the protein
Why Study Proteins?
• Proteins are required for structure, function and regulation of body’s cells, tissues and organs
• Each protein has a unique functions, determine by their structure
• Examples of protein are enzymes, hormones, antibodies
Protein Structures (1/3)• Primary
– Amino acid sequence of polypeptide chain (linear)
– Determined by the gene that encodes it
• Secondary– Three types: -helix,
-sheets, coils
– Local ordered structure brought
about via hydrogen bonding
mainly within the peptide
backbone
– -helix: backbone H-bonds
link residues i and i+4
– -sheets: H-bonds link two
sequence segments
Protein Structure (2/3)
• Tertiary– "global" folding of a single polypeptide chain
– driving force in determining the tertiary structure of globular proteins is the
hydrophobic effect
– Folding so that side chains of the
nonpolar amino acids are
"hidden“ within the
structure and the side chains
of the polar residues are
exposed on the outer surface
Protein Structure (3/3)
• Quaternary
– Involves 2 or more
polypeptide chain to form
a multi-subunit
structure
Pre-Machine Learning Methods
• X-ray and NMR were used to determine structure and function of proteins
• Methods were costly and time consuming
Goal
• Increase the accuracy of Protein Structure prediction, mainly at the secondary level, in an effective manner to help improve the understanding of
protein functions
Machine Learning Methods (1/2)• Neural Networks
– Trained pairwise neural networks
– Networks are initialized with random uniform weights and subsequently trained through backpropagation
• Hidden Markov Method– Modeling stochastic sequences with probabilistic finite
state machine
– Character in position t depends only on the k preceding characters, where k = order of Markov Chain
– Hidden process: secondary structure of protein
– Observed process: amino acid sequence
– Prediction achieved with forward/backward algorithm
Protein Alphabets
• Structural Alphabet: 20 amino acids
• Chemical Alphabet: acidic, aliphatic, amide, aromatic, basic, hydroxyl, etc.
• Functional Alphabet: acidic, basic, hydrophoic nonpolar, polar uncharged
• Charge Alphabet: acidic, basic, neutral
• Hydrophobic Alphabet: hydrophobic, hydrophilic
Attribute
• Window size (W) that covers a relevant sequence
• Input: Protein sequence: p = p1p2…p1n
• Output: -helix (H), -sheets (B), coils (C)
• Data Set: www.pdb.org
• Trained weights: determine by previous set of alphabets and data set
Additional Information
• How far into Project?
– Have done researches into possible algorithms that can be implemented
• Risk?
References• Baldi, P., Brunak, S. (1998). “Bioinformatics: The
Machine Learning Approach.” The MIT Press.
• Gorga, F.R. (2001). “Introduction to Protein Structure” http://webhost.bridgew.edu/fgorga/proteins/default.htm
• Martin, J., Gibrat, J., Rodolphe, F. “Hidden Markov Model for Protein Secondary Structure.”
• Won, K., Hamelryck, T., Prugel-Bennett, A., Krogh, A. “Evolving Hidden Markov Models for Protein Secondary Structure Prediction.”
• Zhang, B., Zhihang, C. Murphey, Y.L. (2005). “Protein Secondary Structure Prediction Using Machine Learning.”