Data Mining for Protein Structure Prediction
description
Transcript of Data Mining for Protein Structure Prediction
![Page 1: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/1.jpg)
Data Mining for Protein Structure Prediction
Mohammed J. Zaki
SPIDER Data Mining Project: Scalable, Parallel and Interactive
Data Mining and Exploration at RPI
http://www.cs.rpi.edu/~zaki
![Page 2: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/2.jpg)
Outline of the Talk
How do proteins form? Protein folding problem Contact map mining Using HMMs based on local motifs Mining “physical” dense frequent patterns
(non-local motifs) Future directions
Heuristic rules Folding pathways
![Page 3: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/3.jpg)
How do Proteins Form?
![Page 4: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/4.jpg)
Building Blocks of Biological Systems DNA (nucleotides, 4 types): information carrier/encoder RNA: bridge from DNA to protein Protein (amino acids, 20 types): action molecules.
Processes Replication of DNA Transcription of gene (DNA) to messenger RNA (mRNA)
Splicing of non-coding regions of the genes (introns) Translation of mRNA into proteins Folding of proteins into 3D structure Biochemical or structural functions of proteins
How do Proteins Form?
![Page 5: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/5.jpg)
![Page 6: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/6.jpg)
![Page 7: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/7.jpg)
![Page 8: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/8.jpg)
![Page 9: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/9.jpg)
Protein Folding Problem
![Page 10: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/10.jpg)
Protein Structures
Primary structure Un-branched polymer 20 side chains (residues or amino acids)
Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together
![Page 11: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/11.jpg)
Amino Acid
![Page 12: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/12.jpg)
Polypeptide Chain
![Page 13: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/13.jpg)
Torsion Angles
![Page 14: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/14.jpg)
The Protein Folding Problem
![Page 15: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/15.jpg)
Contact Map Mining
![Page 16: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/16.jpg)
Contact Map
Amino acids Ai and Aj are in contact if their 3D distance is less than threshold (7å)
Sequence separation is given as |i-j| Contact map C is an N x N matrix, where
C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise
Consider all pairs with |i-j| >= 4
![Page 17: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/17.jpg)
Protein 2igd: 3D Structure
Anti-parallel Beta Sheets
Parallel Beta Sheets
Alpha Helix
![Page 18: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/18.jpg)
Contact Map (2igd PDB)Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
Amino Acid Ai
Am
ino
Aci
d A
j
![Page 19: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/19.jpg)
How much information in Amino Acids Alone: Classification Problem
A pair of amino acids (Ai,Aj) is an instance The class: C (1) or NC (0), i.e., contact or
non-contact Highly skewed class distribution
1.7% C and 98.3% NC; 300K C vs 17,3M NC Features for each instance
Ai and Aj Class: C or NC
![Page 20: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/20.jpg)
Predicting Protein Contacts
Predict contacts for new sequence
A D T S K C PA 0 1 0 0 1 0D 0 1 0 0 1T 1 0 0 0S 0 1 0K 0 1C 0P
Ai Aj F1 FnA T .. ..A C .. ..D S .. ..D P .. ..T S .. ..S C .. ..K P .. ..
![Page 21: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/21.jpg)
Classification via Association Mining Association mining good for skewed data Mining: Mine frequent itemsets in C data (Dc)
P(X | Dc) = Frequency(X | Dc) / |Dc| Counting: find P(X | Dnc) Pruning
Likelihood of a contact: r = P(X|Dc) / P(X|Dnc) Prune pattern X if ratio r of contact to non-contact
probability is less then some threshold i.e., keep only the patterns highly predictive of
contacts
![Page 22: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/22.jpg)
Testing Phase 90-10 split into training and testing 2.4 million pairs, with 36K contacts (1.5%) Evidence calculation:
Find matching patterns P for each instance Compute cumulative frequency in C and NC
Sc = Sum of frequency (X | Dc) where X in P Snc = Sum of frequency (X | Dnc) where X in P
Compute evidence: ratio of Sc / Snc Prediction: Sort instances on evidence
Predict top PR fraction as contacts
![Page 23: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/23.jpg)
Experiments
794 Proteins from Protein Data Bank Distinct structures (< 25% similarity) Longest: 907, Smallest: 35 amino acids 90-10 split for training-testing Total pairs: 20 million (> 2.5 GB) Contacts: 330 thousand (1.6%) Highly uneven class distribution
![Page 24: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/24.jpg)
Evaluation Metrics Na: set of all pairs Na*: all pairs with positive evidence Ntc: true contacts in test data Ntc*: true contacts with positive evidence Npc: predicted contacts Ntpc: correctly predicted contacts Accuracy = Ntpc / Npc Coverage = Ntpc / Ntc Prediction Ratio (PR): Ntc*/Na* Random Predictor Accuracy: Ntc/Na
![Page 25: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/25.jpg)
Results (Amino Acids; All Lengths)
Crossover: 7% accuracy and 7% coverage; 2 times over Random
![Page 26: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/26.jpg)
Results (Amino Acids; by length)
1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C
170-300: 4.5% A and C; 300+: 2% A and C
![Page 27: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/27.jpg)
Using HMMs based on Local Motifs to Improve Classification
![Page 28: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/28.jpg)
An HMM for Local Predictions HMMSTR (Chris Bystroff, Biology, RPI) Build a library of short sequences that tend to
fold uniquely across protein families: the I-Sites Library
Treat each motif as a Markov chain Merge the motifs into a global HMM for local
structure prediction
![Page 29: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/29.jpg)
Training the HMM Build I-sites Library
Short sequence motifs (3 to 19) Exhaustive clustering of sequences Non-redundant PDB dataset (< 25% similarity)
Build an HMM Each of 262 motifs is a chain of Markov states Each state has sequence and structure for one
position Merge I-sites motifs hierarchically to get one global
HMM for all the motifs
![Page 30: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/30.jpg)
HMM Output
Total of 282 States in the HMM Each state produces or “emits”:
Amino acid profile (20 probability values) Secondary structure (D) (helix, strand or loop) Backbone angles (R) (11 dihedral angle symbols) Finer structural context (C) (10 context symbols)
![Page 31: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/31.jpg)
I-Sites Motifs (Initiation Sites)Beta Hairpin Beta to Alpha Helix C-Cap
![Page 32: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/32.jpg)
![Page 33: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/33.jpg)
Data Format and Preparation Take the 794 PDB proteins Compute optimal alignment to HMM
Find best state sequence for the observed acids Output probability distribution of a residue over all
the 282 HMM states Integrate the 3 datasets
Alignment probability distribution (Nx282) Amino acid and context information (D, R, C) Contact map (NxN)
![Page 34: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/34.jpg)
HMMSTR Output (per Protein)PDB Name: 153l_
Sequence Length: 185
Position: 1Residue: RCoordinates: 0.0, -73.2, 177.6AA Profile (20 values): 0.0 1.0 0.0HMMSTR State Probabilities (282 values):0.0 0.7 0.3 0.0Distances (185 values): 0 3 5 18 15 13...Position: 185Residue: YCoordinates: -88.7 , 0.0, 0.0AA Profile (20 values): 0.0 0.4 0.6 0.0HMMSTR State Probabilities (282 values):0.0 0.2 0.5 0.3 0.0Distances (185 values): 15 13 10 5 3 0
![Page 35: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/35.jpg)
Adding features from HMMSTR The class: C (1) or NC (0)
Highly skewed class distribution Approx 1.5% C and 98.5% NC
Features for each instance Ai Aj Di Dj Ri Rj Ci Cj Profile: pi1 pi2 … pi20 pj1 pj2 … pj20 HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282 Class: C or NC
![Page 36: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/36.jpg)
HMM and AA + (R,D,C) ; All Lengths
Left Crossover: 19% accuracy and coverage; 5.3 times over Random
Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random
![Page 37: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/37.jpg)
HMM + AA + R,D,C (by length)
1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C
170-300: 10% A and C; 300+: 6% A and C
![Page 38: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/38.jpg)
Predicted Contact Map (2igd)
![Page 39: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/39.jpg)
Summary of Classification Results Challenging prediction problem In essence, we have to predict a contact
matrix for a new protein Hybrid HMM/Associations approach Best results to-date: 19% overall
accuracy/coverage, 30% for short proteins 14.4% Accuracy (Fariselli, Casadio ‘99; NN) 13% Accuracy (Thomas et al ‘96) Short proteins: 26% (Olmea, Valencia, ‘97)
![Page 40: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/40.jpg)
Mining “Physical” Dense Frequent Patterns (non-local motifs)
![Page 41: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/41.jpg)
Characterizing Physical, Protein-like Contact Maps A very small subset of all contact maps code
for physically possible proteins (self-avoiding, globular chains)
A contact map must: Satisfy geometric constraints Represent low-energy structure
What are the typical non-local interactions? Frequent dense 0/1 submatrices in contact maps 3-step approach: 1) data generation, 2) dense
pattern mining, and 3) mapping to structure space
![Page 42: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/42.jpg)
Dense Pattern Mining 12,524 protein-like 60 residue structures
Use HMMSTR to generate protein-like sequences Use ROSETTA to generate their structures
Monte Carlo fragment insertion (from I-sites library) Up to 5 possible low-energy structures retained
Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)^2 / 2 possible windows per N length protein Look for “minimum density”; scale away from diag Try different window sizes
![Page 43: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/43.jpg)
Counting Dense Patterns
Naïve Approach:for W=5, N=60 there are 1485 windows per protein. Total 15 Million possible windows for 12,524 proteins Test if two submatrices are equal
Linear search: O(P x W^2) with P current dense patterns Hash based: O(W^2)
Our Approach: 2-level Hashing O(W) time
![Page 44: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/44.jpg)
Pattern (WxW Submatrix) Encoding Encode submatrix as string (W integers)
Submatrix Integer Value
00000 0
01100 12
01000 8
01000 8
00000 0
Concatenated String: 0.12.8.8.0
![Page 45: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/45.jpg)
String ID (M) =
Level 1 (approximate):
Level2 (exact) : h2 (M) = StringID (M)
Two-level Hashing
W
iivMh
1)(1
Wvvv ...... 21
![Page 46: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/46.jpg)
Binding Patterns to Proteins Sequence and Structure Using window size, W=5 StringID:0.12.8.8.0, Support = 170
0000001100010000100000000
Occurrences:pdb-name (X,Y) X_sequence Y_sequence Interaction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...
![Page 47: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/47.jpg)
Frequent Dense Local PatternsSubmatrix 0 0 0 0 0 0
0 00 0 0 0 0 0
0 00 0 0 0 0 0
0 00 0 0 0 0 0
0 00 0 0 0 0 0
0 10 0 0 0 0 0
1 10 0 0 0 0 1
1 10 0 0 0 1 1
1 0
0 1 1 1 0 0 0 0
1 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
Frequency 2.0% 2.2% 1.9%
Physical Phenomenon
antiparallel beta sheet secondary structure
antiparallel beta sheet secondary structure
parallel beta sheet secondary structure
![Page 48: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/48.jpg)
Frequent Dense Non-Local Patterns
Alpha – Alpha Alpha – Beta Sheet
![Page 49: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/49.jpg)
Frequent Dense Non-Local Patterns
Alpha – Beta Turn Beta Sheet – Beta Turn
![Page 50: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/50.jpg)
Future Directions
![Page 51: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/51.jpg)
Mining Physicality Rules
Comprehensive list of non-local motifs I-sites library catalogs local motifs
Mining heuristic rules for “physicality” Based on simple geometric constraints Rules governing contacts and non-contacts
Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0
Anti-parallel Beta Sheets: If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0
Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1, then C(i+2,j) = 0
![Page 52: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/52.jpg)
Heuristic Rules of Physicality
i
i+2 j
j+2
If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0
If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0
Parallel Beta SheetsAnti-parallel Beta Sheets
i
i+2
j
j+2
![Page 53: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/53.jpg)
Protein Folding Pathways
Rules for Pathways in Contact Map Space Pathway is time-ordered sequence of contacts Condensation rule: New contact within Smax
U(i,j) <= Smax; U(i,j) is unfolded residues from i to j Pathway prediction is complementary to structure
prediction
![Page 54: Data Mining for Protein Structure Prediction](https://reader030.fdocuments.net/reader030/viewer/2022013004/56813f47550346895da9fc0a/html5/thumbnails/54.jpg)
Contact Map Folding Pathways