Support vector machine approach for p
rotein subcelluar localization prediction
(SubLoc)(SubLoc)
Kim Hye JinIntelligent Multimedia Lab.
2001.09.07.
Contents
• Introduction• Materials and Methods
– Support vector machine– Design and implementation of the
prediction system– Prediction system assessment
• Result• Discussion and Conclusion
Introduction (1)
• Motivation– A key functional charactristic of potential gene prod
ucts such as proteins
• Traditional methods– Protein N-terminal sorting signals
• Nielsen et al.,(1999), von Heijne et al (1997)
– Amino acid composition• Nakashima and Inshikawa(1994), Nakai(2000)Andrade et al(1998), Cedano et al(1997), Reinhart and Hub
bard(1998)
Materials and Methods(1)
• Dataset - SWISSPROT release 33.0- Essential sequences which complete and r
eliable localization annotations- No transmembrane proteins
By Rost et al.,1996; Hirokawa et al.,1998;Lio and Vnnucci,2000
- Redundancy reduction- Effectiveness test
- by Reinhardt and Hubbard (1998)
Support vector machine(1)
• A quadratic optimization problem with boundary constraints and one linear equality constraints
• Basically for two classification problem input vector x =(x1, .. x20) ( xi :aa) output vector y∈{-1,1}
• Idea– Map input vectors into a high dimension feature space– Construct optimal separating hyperplane(OSH)– maximize the margin; the distance between hyperplane and the neare
st data points of each class in the space H– Mapping by a kernel function K(xK(xii,x,xjj))
Support vector machine(2)• Decision function
• Where the coefficient by solving convex quadratic programming
Support vector machine(3)
• Constraints– In eq(2), C is regularization parameter => control the trade-of
f between margin and misclassification error
• Typical kernel functions
Eq(3), polynomial with d parameterEq(4), radial basic function (RBF) with r parameter
Support vector machine(4)
• Benefits of SVM– Globally optimization– Handle large feature spaces– Effectively avoid over-fitting by
controlling margin– Automatically identify a small subset
made up of informative points
Design and implementation of the prediction system
• Problem :Multi-class classification problem– Prokaryotic sequences 3 classes– Eukaryotic sequences 4 classes
• Solution– To reduce the multi-classification into binary classification– 1-v-r SVM( one versus rest )
• QP problem – LOQO algorithm (Vanderbei, 1994)
• SVMlight
• Speed– Less than 10 min on a PC running at 500MHz
Prediction system assessment
• Prediction quality test by jackknife test– Each protein was singled out in turn as a
test protein with the remaining proteins used to train SVM
Results (1)
• SubLoc prediction accuracy by jackknife test– Prokaryotic sequence case
• d=1and d=9 for polynomial kernel• =5.0 for RBF• C = 1000 for SVM constraints
– Eukaryotic sequence case• d =9 for polynomial kernel• =16.0 for RBF• C=500 for each SVM
• Test : 5–fold cross validation ( since limited computational power)
Comparison
• based on amino acid composition – Neural network
• Reinhardt and Hubbard, 1998
– Covariant discriminant algorithm• Chou and Elrod, 1999
• Based on the full sequence information in genome sequence– Markov model ( Yuan, 1999)
Assigning a reliability index
• RI (reliability index)Diff between the highest
and the second - highest output value of the 1-v-r SVM
• 78% of all sequence have RI ≥3 and 95.9% correct prediction
Robustness to errors in the N-terminal sequence
Discussion and ConclusionDiscussion and Conclusion
• SVM information condensation– The number of SVs is quite small– The ratio of SVs to all training is 13-30%
SVM parameter selection
• Little influence on the classification performance– Table8 shows with little difference between
kernel functions– Robust characteristic of the dataset
by Vapnik(1995)
Improvement of the perfomance
• Combining with other methods– Sorting signal base method and amino acid
composition• Signal : sensitive to errors in N terminal• Composition: weakness in similar aa
• Incorporate other informative features• Bayesian system integrating in the whole genom
e expression data• Fluorescence microscope images
Top Related