Zhiyong Wang and Jinbo Xu Toyota Technological Institute at Chicago
description
Transcript of Zhiyong Wang and Jinbo Xu Toyota Technological Institute at Chicago
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming
Zhiyong Wang and Jinbo XuToyota Technological Institute at Chicago
Web server at http://raptorx.uchicago.eduSee http://arxiv.org/abs/1308.1975 for an extended version
Problem DefinitionContact : Distance between two Cα or Cβ atoms < 8Å
short range: 6-12 AAs apartmedium range: 12-24 AAs long range: >24 AAs apart
1J8B
Existing WorkResidue co-evolution method: mutual information (MI), PSICOV, Evfold Needs a large number of homologous sequences PSICOV and Evfold better than MI since they differentiate direct and indirect residue
couplings (Residues A and C indirect coupling if it is due to direct A-B and B-C couplings)
PSICOV and Evfold also enforce sparsity
Supervised learning method: NNcon, SVMcon, CMAPpro Mutual information, sequence profile and others Predicts contacts one by one, ignoring their correlation Do not differentiate direct and indirect residue couplings
First-principle method: Astro-Fold No evolutionary information Minimize contact potential Enforce physical feasibility including sparsity
Our Method: PhyCMAP1. Focus on proteins with few sequence homologs proteins with many sequence homologs
very likely have similar templates in PDB
2. Integrate by machine learning seq profile, residue co-evolution and
non-evolutionary info (implicitly) differentiate direct and
indirect residue couplings through feature engineering
3. Enforce physical constraints, which imply sparsity
Info used by Random Forests• Evolution info from a single protein family– sequence profile – co-evolution: 2 types of mutual information (MI)
• Non-evolution info from the whole structure space: residue contact potential
• Mixed info from the above 2 sources– homologous pairwise contact score– EPAD: context-specific evolutionary-based distance-
dependent statistical potential• amino acid physic-chemical properties
Mutual Information
1. Contrastive Mutual Information (CMI): remove local background by measuring the MI difference of one pair with its neighbors.
2. Chaining effect of residue couplings: MI, MI2, MI3, MI4, equivalent to (1-MI), (1-MI)2, (1-MI)3, (1-MI)4 (see http://arxiv.org/abs/1308.1975 for more details)
CMI Example: 1J8B• Upper triangle: mutual information• Lower triangle: contrastive mutual information• Blue boxes: native contacts
Homologous Pairwise Contact Score
Probability of a residue pair forming a contact between 2 secondary structures.
PSbeta (a, b): prob of two AAs a and b forming a beta contactPShelix (a, b): prob of two AAs a and b forming a helix contactH: the set of sequence homologs in a multiple seq alignment
𝑃𝑆 (𝑖 , 𝑗 )= 1|𝐻|
( ∑h∈𝐻
𝑃𝑆𝑏𝑒𝑡𝑎 (h 𝑖 , h 𝑗 )𝑜𝑟 ∑h∈𝐻
𝑃𝑆h𝑒𝑙𝑖𝑥 (h𝑖 , h 𝑗))
Training Random Forests• Training dataset– Chosen before CASP10 started– 900 non-redundant protein structures– <25% sequence identity– All contacts and 20% of non-contacts
• Model parameters– Number of features: 300– Number of trees: 500– 5 fold cross validation
Select Physically Feasible Contacts by Integer Linear Programming
Maximize accumulative contact probability while minimize violation of physical constraints
Xi,j Indicate one contact between two residues i and j
Rr a relaxation variable of the rth soft constraint
g(R) penalty for violation of physical constraints
6
,,,)(max
ijjijiRX
RgPX
Soft Constraints 1
# contacts between two secondary structure segments is limited
2)(:2,11,
,1)(,
sjSStypejssji bRX
siSStypei
s1,s2 95% MaxH,H 5 12H,E 3 10H,C 4 11E,H 4 12E,E 9 13E,C 6 15C,H 3 12C,E 5 12C,C 6 20
Soft Constraints 2Upper and lower bounds for #contacts between two beta strands
))(),(min(3 ,
)(),(2,
vLenuLenS
RX
vu
uSSegjvSSegiji
3
)(),(,
))(),(max(3.3 RvLenuLen
XuSSegjvSSegiji
Soft Constraints 3
Statistics shows that only 3.4% of loop segments that have a contact between the start and end residues.
Hard Constraints 1
• For parallel contacts between two β strands, the contacts of neighboring residue pairs should satisfy the following constraints
• For anti-parallel contacts11,11,1, jijiji XXX
11,11,1, jijiji XXX
Hard Constraints 2
1) One residue cannot form contacts with both j and j+2 when j and j+2 are in the same alpha helix
12,, jiji XX
2) One beta-strand can form beta-sheets with up to 2 other beta-strands.
Test Datasets• CASP10: 123 proteins– 36 are “hard”, i.e., no similar templates in PDB – low sequence identity (<25%) among them– low seq id with the training data, which were chosen
before CASP10 started
• Set600: 601 proteins– share <25% seq ID with the training proteins – each has ≥50 AAs and an X-ray structure with resolution
<1.9Å– each has ≥5 AAs with predicted secondary structure
being alpha-helix or beta-strand
Accuracy w.r.t. #sequence homologs1. Meff: #non-redundant sequence homologs of a protein
2. Divide the CASP10 targets into groups by Meff
3. Top L/10 predicted medium- and long-range contacts
logMeff
accuracy
Results on CASP10 – Medium RangeOverall accuracy on top L/5 predicted Cβ contacts: PhyCMAP 0.465, CMAPpro 0.370, PSICOV 0.316
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
Results on CASP10 – Long RangeOverall accuracy on top L/5 predicted Cβ contacts: PhyCMAP: 0.373, CMAPpro: 0.313, PSICOV: 0.315
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
Results on 36 hard CASP10 targetsaccuracy on top L/5 medium and long-range Cβ contacts: PhyCMAP: 0.363, CMAPpro: 0.308, PSICOV: 0.180
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
CMAPproPSICOV
PhyC
MAP
PhyC
MAP
Results on Set600 with few homologs (Meff ≤ 100)
top L/5 predicted medium and long Cβ contacts: PhyCMAP: 0.345, CMAPpro: 0.287, PSICOV: 0.059
Example: T0677-D2Dozens of sequence homologs Meff=31
Upper triangle: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.357
Right lower triangle: Evfold accuracy ~0
Note contacts between alpha helices are not continuous
Example: T0693-D2Many sequence homologs Meff=2208
Upper triangles: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.744
Right lower triangle: Evfold accuracy 0.419
Example: T0701-D1Many sequence homologs Meff=3300
Upper triangle: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.794
Right lower triangle: Evfold accuracy 0.444
Example: T0756-D1
Many sequence homologs Meff=1824Upper triangles: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.944Right lower triangle: Evfold accuracy 0.500
Summary
Combining seq profile, residue co-evolution, non-evolutionary info can result in good accuracy even for proteins with 10--100 non-redundant seq homologs
Physical constraints are helpful for proteins with few sequence homologs
L/10 L/
5
L/10 L/
5
Short-range
contacts
Medium and long-
range
0.2
0.3
0.4
0.5
with physical constraintsno physical constraints
Cβ accuracy on 130 proteins Meff ≤ 100
Acknowledgements• Student: Zhiyong Wang• Funding – NIH R01GM0897532– NSF CAREER award– Alfred P. Sloan Research Fellowship
• Computational resources– University of Chicago Beagle team– TeraGrid
Web server at http://raptorx.uchicago.edu
Protein contact Contact : Distance between two Cα or Cβ atoms < 8Å; or Distance between the closest atoms of 2 residues.
1J8B
short range: 6-12 AAs apartmedium range: 12-24 AAs long range: >24 AAs apart
Why contact prediction?
• Contacts describe spatial and functional relationship of residues
• Contains key information for 3D structure• Useful for protein structure prediction• Used for protein structure alignment and
classification
Contrastive Mutual Information
Contrastive Mutual Information (CMI) removes local background, by measuring the MI difference between one pair of residues and neighboring pairs.
Integer Linear Programming
• Objective function:• g(R): penalty for violation of physical constraints
Variables ExplanationsXi,j equal to 1 if there is a contact between
two residues i and j.APu,v equal to 1 if two beta-strands u and v
form an anti-parallel beta-sheet.Pu,v equal to 1 if two beta-strands u and v
form a parallel beta-sheet.Su,v equal to 1 if two beta-strands u and v
form a beta-sheet. Tu,v equal to 1 if there is an alpha-bridge
between two helices u and v.Rr a non-negative integral relaxation
variable of the rth soft constraint.
6
,,,)(max
ijjijiRX
RgPX
Hard Constraints 3
One beta-strand can form beta-sheets with up to 2 other beta-strands.
2)(:
, betavSStypev
vuS
Global constraints
• Antiparallel and parallel contacts
• A residue contact implies a segment-wise contact
• Put a limit of total number of contacts
– k is the number of top contacts we want to predict.
vuvuvu SPAP ,,,
)(),(,,, vSSegjuSSegiSX vuji
6,1,
ijLjiji kX
Results on Set600 with many sequence homologs (Meff > 100)
CMAPpro PSICOV
PhyC
MAP
PhyC
MAP
top L/5 predicted medium and long Cβ contacts: PhyCMAP: 0.611, CMAPpro: 0.515, PSICOV: 0.569
Contribution of HPS and CMI featuresAverage Cβ accuracy the 471 proteins with Meff >100
L/10 L/
5
L/10 L/
5Short-range contacts Medium and long-
range
0.4
0.5
0.6
0.7
with CMI and HPS no CMI and HPS
Contribution of physical constraints Average Cβ accuracy on 130 proteins with Meff ≤ 100
L/10 L/
5
L/10 L/
5
Short-range contacts Medium and long-range
0.2
0.3
0.4
0.5
with physical constraintsno physical constraints