Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

22
SVM based approach to assess the reliability of protein-protein interactions Meher Preethi Boorgula, Ronak Shah, Neerja Katiyar George Mason University

description

Protein-Protein interactions discovered by the existing high-throughput techniques contain very high amount of false positives. Here we present an SVM based approach to generate a model that is built on sequence and non-sequence based information of the interacting proteins. This model is used to assess the reliability of given protein-protein interactions. It was run on the interaction data of a pathogenic bacterium; Treponema pallidum (causes Syphilis in humans) obtained from Yeast two hybrid experiments. Various kernels were used for building the model and of all, Sigmoid kernel performed well when used with all the features combined with area under the receiver operating curve (ROC) as 0.53.

Transcript of Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Page 1: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

SVM based approach to assess the reliability of

protein-protein interactions

Meher Preethi Boorgula, Ronak Shah,

Neerja Katiyar

Geor

ge M

ason

Uni

vers

ity

Page 2: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Motivation:

� Protein interactions play a key role in many

cellular processes.

� Distortion of protein interfaces may lead to

development of many diseases.

� Reliable Protein-protein interactions (PPIs)

conserved among different species and that are

involved in diseases would be very helpful for

researchers.Geo

rge

Maso

n Un

iver

sity

Page 3: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Problem Statement:

� Protein-Protein Interactions (PPIs) are very

helpful in functional annotation of proteins. It

is important that the PPI data is reliable.

� Thus, we try to predict the reliability of PPIs

with respect to a disease causing bacterium.

Geor

ge M

ason

Uni

vers

ity

Page 4: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Objective:

� To create a prediction model based on Kernel

method (SVM) to assess the reliability of PPIs

in Treponema pallidum obtained from Yeast

Two Hybrid (Y2H) system.

� To classify the interactions as reliable and not

reliable.

Geor

ge M

ason

Uni

vers

ity

Page 5: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Introduction:

� Protein-protein interactions can be identified with the help of high-throughput techniques like the Yeast-two Hybrid (Y2H) and Mass Spectrometry (MS).

� The main disadvantage with these existing techniques is the amount of false-positives in the data obtained.

� So, assessing the reliability of PPIs is necessary.Geor

ge M

ason

Uni

vers

ity

Page 6: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Methodology:

Preparation of data sets

Extract the attributes

Create & test model using SVM light

Evaluate the performance of the model

Analyze the reliability of PPI data sets

Geor

ge M

ason

Uni

vers

ity

Page 7: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Datasets:

� Raw data of interactions was obtained from

Y2H experiments performed at J.Craig Venter

Institute.

� This data was then organized into train and

test sets by considering equal number of

positive and negative examples.

� Positive – High Confidence data

� Negative – Low Confidence data

Geor

ge M

ason

Uni

vers

ity

Page 8: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Dataset (Contd…)

� All Interactions = 2993

� High Confidence = 721

� Common Interactions = 66

� Total (excluding common) = 3648

� Train & Test datasets were made by taking

1824 interactions.

Geor

ge M

ason

Uni

vers

ity

Page 9: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Extracting Attributes:

� Attributes chosen include:

- Sequence based:

i. occurrence of 5-mers in the sequence data

ii. Hydrophobicity

- Non-sequence based:

i. Jaccard coefficient

ii. GO AnnotationGeor

ge M

ason

Uni

vers

ity

Page 10: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Hydrophobicity:

� Protein interaction depends on the nature of the

active/binding site.

� Hydrophobicity profile was used in order to extract

this feature.

� Average Hydropathy was calculated for a sequence

based on the hydrophobicity of each amino acid

residue.

� This was obtained using the tool “ProteinGRAVY”.Geor

ge M

ason

Uni

vers

ity

Page 11: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Jaccard coefficient:

� In a PPI network, the neighbors of interacting

proteins also tend to interact.

Jaccard coefficient:

|N(v) U N(u)| / |N(v) ∩ N(u)|

where u, v are the interacting proteins

N(X) = set of neighbors of protein X in the PPI

network Geor

ge M

ason

Uni

vers

ity

Page 12: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

GO Annotations:

� Proteins that are present in the same cellular

component or that participate in same biological

processes are more likely to interact.

� This was captured with the help of extracting

identical GO IDs for the interacting proteins.

� Interacting proteins with atleast one common GO

ID was considered reliable.

Geor

ge M

ason

Uni

vers

ity

Page 13: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Occurrence of 5-mers

� Spectrum kernel models a sequence in the

space of all k-mers (5-mers).

� All possible 5-mers in the protein sequences

were obtained for the data.

� Number of times each 5-mer appears in the

sequence data for both bait and prey proteins

was extracted.Geor

ge M

ason

Uni

vers

ity

Page 14: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Creating & Testing Model:

� SVM Light was used to create a classification

model based on linear & sigmoid kernel.

� Test data was applied to the model in order to

classify it.

� The performance of the model was evaluated

based on Accuracy, Precision and Recall

values. Geor

ge M

ason

Uni

vers

ity

Page 15: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Experiments Performed:

1) Model generated using the attribute

Hydrophobicity.

2) Model generated using the attribute JC

3) Model generated using both of these

attributes.

4) Model generated using both these attributes

on a different data set (equal number of

positive and negative examples).

Geor

ge M

ason

Uni

vers

ity

Page 16: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Results for Linear Kernel:

0.00.00.00.0Recall

(%)

----Precision

(%)

51.2379.8879.9979.99Accuracy

(%)

Exp-4Exp-3Exp-2Exp-1

Geor

ge M

ason

Uni

vers

ity

Page 17: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Results for Sigmoid Kernel:

45.790.0--Recall

(%)

57.80 0.0--Precision

(%)

57.2679.88--Accuracy

(%)

Exp-4Exp-3Exp-2Exp-1

Geor

ge M

ason

Uni

vers

ity

Page 18: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Observation:

� Results obtained were not reliable as the

model was built using only two attributes.

� This would not be efficient in discriminating

the positive & negative examples.

� Also, it was observed that there was no

significance of the positive examples while

creating the model. Geor

ge M

ason

Uni

vers

ity

Page 19: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

To Be done:

� Extracting the attribute “occurrence of 5-mers” for the protein pairs and perform all the experiments.

� Obtain data from INTACT database to increase the number of positive examples and to overcome the number of false positives in the data since it is from Y2H experiments.

� Compare the performance with the existing model based on “Logistic Regression”.

Geor

ge M

ason

Uni

vers

ity

Page 20: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Problems:

� The major problem for extracting attributes

which were dependent on the annotation was

that Treponema is not fully annotated.

� The interaction data for Treponema is also not

reliable.

Geor

ge M

ason

Uni

vers

ity

Page 21: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

Future Work:

� We would like to apply this model to

Streptococcus Pneumoniae.

� Using PSSM scores by performing PSI-Blast

would be helpful.

� Analyze for the biological relevance of our

predictions and then test experimentally the

interactions predicted to be reliable by the

model.

Geor

ge M

ason

Uni

vers

ity

Page 22: Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology

References:

� Dr.Peter Uetz et al (J.Craig Venter Institute)

� Kernel methods for predicting protein–protein

interactions by Asa Ben-Hur & William Stafford

Noble

� SVM Light: http://svmlight.joachims.org/

� Protein GRAVY: http://www.bioinformatics.org/sms2/protein_gravy.html

� PIR: http://pir.georgetown.edu/

Geor

ge M

ason

Uni

vers

ity