Identification of amino acid residues in protein-protein interaction interfaces using machine...

14
Identification of amino acid residues in protein- protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence- and structure- based features employed Angshuman Bagchi, Ph.D Assistant Professor of Biochemistry Department of Biochemistry and Biophysics University of Kalyani Formerly postdoctoral fellow in Buck Institute, Stanford University, California, USA Purdue University, Indianapolis, USA Email: [email protected]

Transcript of Identification of amino acid residues in protein-protein interaction interfaces using machine...

Page 1: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence- and structure- based features employed

Angshuman Bagchi, Ph.D

Assistant Professor of Biochemistry

Department of Biochemistry and Biophysics University of Kalyani

Formerly postdoctoral fellow in Buck Institute, Stanford University, California, USA

Purdue University, Indianapolis, USA

Email: [email protected]

Page 2: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Importance of protein-protein interactions (PPIs)

• Crucial for the understanding of the biological pathways, like cell signalling

• PPI dysfunctions may lead to disease situations

• Important targets for therapy

http://nrc.bu.edu/cluster/

Angshuman Bagchi – [email protected]

Page 3: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Aim of the Present Research

• To extract features of PPIs from known PP hetero-complex structures and thereby to predict PPIs with their help using machine learning tools

• To build machine learning (Support Vector Machine and Random Forest) classifiers with the help of the training dataset

• To set up an online server to predict PPI residues from protein sequence and structural information

• To build a web service plug-in for UCSF Chimera to visualize the PPI residues

Page 4: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Overview of Support Vector Machine (SVM)

•A support vector machine (SVM) is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis.

•Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other.

•An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.

•New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

Page 5: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Overview of Random Forest (RF)

•A Random Forest (RF) is an ensemble classifiers that consists of many decision trees.

•Given a set of training examples, it generates random decision trees. The output of the tree is the class which has got the maximum votes.

•RF has the ability to give estimates of the importance of the variables.

•It efficiently handles the problem of missing data..

Page 6: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Assumptions – employed

• Surface residue: An amino acid with its accessible surface area (ASA) > 15% of its total area

• Interface residue: A surface residue with at least one heavy atom located within a distance of 5Å from any of the heavy atoms of its interacting partner

• Dataset: 274 high resolution X-ray hetero-complex structure files with 10597 interface residues (+ve) and 27333 non-interface surface residues (-ve) (Jo-Lan et al., Proteins, 2006)

Features

• Sequence based: Obtained from sequence conservations using PSI-BLAST

• Structure based (2ndary Structure, Charge, Solvent accessibility, B-factor etc.): Obtained using S-BLEST (Mooney et al., Proteins, 2005), DSSP (Kabasch & Sander, Biopolymers, 1983), PDB files

Page 7: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Development of PPI predictorThe dataset was divided into the following two categories with equal number of PPI (positive) and non-PPI (negative) examples. This balanced dataset was used for the training purposes.

Page 8: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Development of PPI predictor-Continued

•The RF package in R and the LibSVM package were used to

implement separate RF and SVM predictors using each of the

aforementioned datasets with 10-fold cross-validation.

•Two SVM predictors, one using a linear kernel and the other

using a Radial Basis Function (RBF) kernel, were created from

each dataset.

•Throughout the experiments, the default values of the

regularization parameter (C) and γ for linear and RBF kernel

SVM were used.

•For RF, we generated 1000 trees keeping other parameters to

their default values.

Page 9: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Rank & Description AUC

B-factor 0.91

PSSM 0.85

Frequency of Lys residues in a 20 amino acid sequence window

0.83

Solvent accessibility 0.80

Number of neighboring charged residues (Arg, Asp, Glu, Lys)

0.78

Acidic residue 0.75

Atomic charge 0.71

Hydrophobicity 0.70

Best features ranked on the basis of their AUC

AUC: Area under Receiver Operating Characteristics (ROC) Curve

Page 10: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Method Accuracy (%) Sensitivity (%) Specificity (%) AUC

SVM linear 60.5 57.9 63.1 0.63

SVM RBF 58.9 51.6 66.3 0.59

RF 76.7 74.8 78.7 0.77

Machine learning results

TPR = True Positive Rate , FPR = False Positive Rate

The dataset used is sequence (interface residues as positives and all non-interface surface and core residues as negatives)

Page 11: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Angshuman Bagchi – [email protected]

Method Accuracy (%) Sensitivity (%) Specificity (%) AUC

SVM linear 53.3 22.7 83.9 0.53

SVM RBF 50.2 70.7 29.6 0.50

RF 69.3 67.3 71.3 0.70

Machine learning results-continued

Method Accuracy (%) Sensitivity (%) Specificity (%) AUC

SVM linear 57 47.1 66.6 0.57

SVM RBF 57.4 49.3 65.5 0.57

RF 70.7 66.3 75.1 0.71

The dataset used is structure (interface residues as positives and non-interface surface residues as negatives)

The dataset used is sequence (interface residues as positives and non-interface surface residues as negatives)

Page 12: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Case Study

Top-scoring amino acid residues from the crystal structure of the antibody N10-staphylococcal nuclease complex (PDB ID: 1NSN). The backbone of the antibody N10 is presented in black whereas the staphylococcal nuclease is shown as surface in cyan. The top scoring amino acid residues are highlighted.

Angshuman Bagchi – [email protected]

Page 13: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Conclusion

•We have developed and evaluated several classification models (RF, SVM-linear

& -RBF) for identifying PPI interfaces using both a combination of sequence- &

structure-based features as well as only sequence-based features.

•The wider application of our classifier could have important consequences for the

prediction, prognosis and treatment of inherited disease states brought about by

disruption of PPI sites.

•Since we have developed a sequence-only predictor for PPI interface prediction,

our method can be used by researchers to have a quick idea about the probable

function of the protein for which no structures are available.

•Finally, we have constructed a web resource that can be used for the prediction of

PPI sites using either sequence alone, or structure and sequence together. This

resource can be found at http://www.sblest.org/ppi

Angshuman Bagchi – [email protected]

Page 14: Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Acknowledgement

Angshuman Bagchi – [email protected]