Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology
-
Upload
ronak-shah -
Category
Documents
-
view
1.302 -
download
2
description
Transcript of Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene Ontology
SVM based approach to assess the reliability of
protein-protein interactions
Meher Preethi Boorgula, Ronak Shah,
Neerja Katiyar
Geor
ge M
ason
Uni
vers
ity
Motivation:
� Protein interactions play a key role in many
cellular processes.
� Distortion of protein interfaces may lead to
development of many diseases.
� Reliable Protein-protein interactions (PPIs)
conserved among different species and that are
involved in diseases would be very helpful for
researchers.Geo
rge
Maso
n Un
iver
sity
Problem Statement:
� Protein-Protein Interactions (PPIs) are very
helpful in functional annotation of proteins. It
is important that the PPI data is reliable.
� Thus, we try to predict the reliability of PPIs
with respect to a disease causing bacterium.
Geor
ge M
ason
Uni
vers
ity
Objective:
� To create a prediction model based on Kernel
method (SVM) to assess the reliability of PPIs
in Treponema pallidum obtained from Yeast
Two Hybrid (Y2H) system.
� To classify the interactions as reliable and not
reliable.
Geor
ge M
ason
Uni
vers
ity
Introduction:
� Protein-protein interactions can be identified with the help of high-throughput techniques like the Yeast-two Hybrid (Y2H) and Mass Spectrometry (MS).
� The main disadvantage with these existing techniques is the amount of false-positives in the data obtained.
� So, assessing the reliability of PPIs is necessary.Geor
ge M
ason
Uni
vers
ity
Methodology:
Preparation of data sets
Extract the attributes
Create & test model using SVM light
Evaluate the performance of the model
Analyze the reliability of PPI data sets
Geor
ge M
ason
Uni
vers
ity
Datasets:
� Raw data of interactions was obtained from
Y2H experiments performed at J.Craig Venter
Institute.
� This data was then organized into train and
test sets by considering equal number of
positive and negative examples.
� Positive – High Confidence data
� Negative – Low Confidence data
Geor
ge M
ason
Uni
vers
ity
Dataset (Contd…)
� All Interactions = 2993
� High Confidence = 721
� Common Interactions = 66
� Total (excluding common) = 3648
� Train & Test datasets were made by taking
1824 interactions.
Geor
ge M
ason
Uni
vers
ity
Extracting Attributes:
� Attributes chosen include:
- Sequence based:
i. occurrence of 5-mers in the sequence data
ii. Hydrophobicity
- Non-sequence based:
i. Jaccard coefficient
ii. GO AnnotationGeor
ge M
ason
Uni
vers
ity
Hydrophobicity:
� Protein interaction depends on the nature of the
active/binding site.
� Hydrophobicity profile was used in order to extract
this feature.
� Average Hydropathy was calculated for a sequence
based on the hydrophobicity of each amino acid
residue.
� This was obtained using the tool “ProteinGRAVY”.Geor
ge M
ason
Uni
vers
ity
Jaccard coefficient:
� In a PPI network, the neighbors of interacting
proteins also tend to interact.
Jaccard coefficient:
|N(v) U N(u)| / |N(v) ∩ N(u)|
where u, v are the interacting proteins
N(X) = set of neighbors of protein X in the PPI
network Geor
ge M
ason
Uni
vers
ity
GO Annotations:
� Proteins that are present in the same cellular
component or that participate in same biological
processes are more likely to interact.
� This was captured with the help of extracting
identical GO IDs for the interacting proteins.
� Interacting proteins with atleast one common GO
ID was considered reliable.
Geor
ge M
ason
Uni
vers
ity
Occurrence of 5-mers
� Spectrum kernel models a sequence in the
space of all k-mers (5-mers).
� All possible 5-mers in the protein sequences
were obtained for the data.
� Number of times each 5-mer appears in the
sequence data for both bait and prey proteins
was extracted.Geor
ge M
ason
Uni
vers
ity
Creating & Testing Model:
� SVM Light was used to create a classification
model based on linear & sigmoid kernel.
� Test data was applied to the model in order to
classify it.
� The performance of the model was evaluated
based on Accuracy, Precision and Recall
values. Geor
ge M
ason
Uni
vers
ity
Experiments Performed:
1) Model generated using the attribute
Hydrophobicity.
2) Model generated using the attribute JC
3) Model generated using both of these
attributes.
4) Model generated using both these attributes
on a different data set (equal number of
positive and negative examples).
Geor
ge M
ason
Uni
vers
ity
Results for Linear Kernel:
0.00.00.00.0Recall
(%)
----Precision
(%)
51.2379.8879.9979.99Accuracy
(%)
Exp-4Exp-3Exp-2Exp-1
Geor
ge M
ason
Uni
vers
ity
Results for Sigmoid Kernel:
45.790.0--Recall
(%)
57.80 0.0--Precision
(%)
57.2679.88--Accuracy
(%)
Exp-4Exp-3Exp-2Exp-1
Geor
ge M
ason
Uni
vers
ity
Observation:
� Results obtained were not reliable as the
model was built using only two attributes.
� This would not be efficient in discriminating
the positive & negative examples.
� Also, it was observed that there was no
significance of the positive examples while
creating the model. Geor
ge M
ason
Uni
vers
ity
To Be done:
� Extracting the attribute “occurrence of 5-mers” for the protein pairs and perform all the experiments.
� Obtain data from INTACT database to increase the number of positive examples and to overcome the number of false positives in the data since it is from Y2H experiments.
� Compare the performance with the existing model based on “Logistic Regression”.
Geor
ge M
ason
Uni
vers
ity
Problems:
� The major problem for extracting attributes
which were dependent on the annotation was
that Treponema is not fully annotated.
� The interaction data for Treponema is also not
reliable.
Geor
ge M
ason
Uni
vers
ity
Future Work:
� We would like to apply this model to
Streptococcus Pneumoniae.
� Using PSSM scores by performing PSI-Blast
would be helpful.
� Analyze for the biological relevance of our
predictions and then test experimentally the
interactions predicted to be reliable by the
model.
Geor
ge M
ason
Uni
vers
ity
References:
� Dr.Peter Uetz et al (J.Craig Venter Institute)
� Kernel methods for predicting protein–protein
interactions by Asa Ben-Hur & William Stafford
Noble
� SVM Light: http://svmlight.joachims.org/
� Protein GRAVY: http://www.bioinformatics.org/sms2/protein_gravy.html
� PIR: http://pir.georgetown.edu/
Geor
ge M
ason
Uni
vers
ity