Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890:...

35
Remote Homology Remote Homology detection detection : : A motif based approach A motif based approach CS 6890: Bioinformatics - CS 6890: Bioinformatics - Dr. Yan Dr. Yan Swati Adhau Swati Adhau 04/14/06 04/14/06
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890:...

Page 1: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Remote Homology Remote Homology detectiondetection::

A motif based approachA motif based approach

CS 6890: Bioinformatics - Dr. YanCS 6890: Bioinformatics - Dr. Yan

Swati AdhauSwati Adhau

04/14/06 04/14/06

Page 2: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Outline of the PresentationOutline of the Presentation

• MotivationMotivation

• IntroductionIntroduction

• Description (Remote Homology Description (Remote Homology Detection)Detection)

• MethodsMethods

• Results & DiscussionResults & Discussion

• Q and AQ and A

Page 3: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

MotivationMotivation

• Remote homology detection is the problem of Remote homology detection is the problem of detecting homology in case of low sequence detecting homology in case of low sequence similarity.similarity.

• A method based on presence of discrete A method based on presence of discrete sequence motifs for detecting remote sequence motifs for detecting remote homology.homology.

• The motif content of a pair of sequences is The motif content of a pair of sequences is used to define a similarity that is used as a used to define a similarity that is used as a kernel for support vector machine (SVM) kernel for support vector machine (SVM) classifier.classifier.

Page 4: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

• Testing of method is done upon two Testing of method is done upon two remote homology detection tasksremote homology detection tasks

1) Prediction of previously unseen 1) Prediction of previously unseen SCOP family (Structural classification SCOP family (Structural classification of Proteins).of Proteins).

2) Prediction of an Enzyme class given 2) Prediction of an Enzyme class given other enzymes that have a similar other enzymes that have a similar function on other substrates.function on other substrates.

Page 5: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

IntroductionIntroduction

• Protein Homology detection is one of the Protein Homology detection is one of the most important problems in computational most important problems in computational biology.biology.

• Homology is generally established by Homology is generally established by sequence similarity.sequence similarity.

• Two established methodsTwo established methods 1) Smith Waterman algorithm1) Smith Waterman algorithm 2) Blast2) Blast• Protein sequence motifs are an alternative Protein sequence motifs are an alternative

method of detecting sequence similarity method of detecting sequence similarity

Page 6: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Intro(continued)Intro(continued)

• By focussing on limited highly conserved regions By focussing on limited highly conserved regions of proteins, motifs can often reveal important of proteins, motifs can often reveal important clues to a proteins role.clues to a proteins role.

• Motifs often represent functionally important Motifs often represent functionally important regions such as catalytic sites, binding sites and regions such as catalytic sites, binding sites and structural motifs.structural motifs.

• The Blocks+ database combines various The Blocks+ database combines various databases such as pFAM, PRINTs, ProDom, DOMO databases such as pFAM, PRINTs, ProDom, DOMO and InterPro. eMotif database contains discrete and InterPro. eMotif database contains discrete sequence motifs constructed from blocks of sequence motifs constructed from blocks of BLOCKS+.BLOCKS+.

• This paper uses discrete sequence motifs This paper uses discrete sequence motifs extracted from the eBLOCKS database using the extracted from the eBLOCKS database using the eMOTIF method.eMOTIF method.

Page 7: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Intro(Continued)Intro(Continued)• Based upon the motif content of a pair of Based upon the motif content of a pair of

sequence we introduce sequence similarity sequence we introduce sequence similarity measure.measure.

• This paper uses an SVM method.This paper uses an SVM method.• SVM method is shown to perform better than SVM method is shown to perform better than

methods for Fisher-Kernel method, SAM T-98 methods for Fisher-Kernel method, SAM T-98 and PSI-BLAST.and PSI-BLAST.

• When a sequence similarity is shown to be a When a sequence similarity is shown to be a dot product in some space it is called the dot product in some space it is called the kernel.kernel.

• In this paper we use protein motifs to In this paper we use protein motifs to construct a kernel that can be computed construct a kernel that can be computed efficiently which performs better than a efficiently which performs better than a kernel based on BLAST or Smith-Waterman kernel based on BLAST or Smith-Waterman scores.scores.

Page 8: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Remote Homology Remote Homology DetectionDetectionThis method was tested on the following This method was tested on the following

two tasks:-two tasks:-

1) Prediction of a SCOP family when 1) Prediction of a SCOP family when trained on other families in that trained on other families in that family’s fold.family’s fold.

2) Prediction of the function of an 2) Prediction of the function of an enzyme when the training set contains enzyme when the training set contains enzyme that have same general enzyme that have same general functions but different substrates.functions but different substrates.

Page 9: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

BackGround of the first datasetBackGround of the first dataset• The first dataset is composed of The first dataset is composed of

sequences of domains from the SCOP sequences of domains from the SCOP database.database.

Objective:-Objective:- To detect homology at the To detect homology at the SCOP superfamily level. Recognizing a SCOP superfamily level. Recognizing a SCOP family when the training set SCOP family when the training set contains other families in the family’s contains other families in the family’s superfamily. superfamily.

Page 10: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Page 11: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

……contdcontd• This specifies the +ve examples in the This specifies the +ve examples in the

test set and training set.test set and training set.

• The –ve examples are taken from The –ve examples are taken from outside of the family’s fold.outside of the family’s fold.

• A random family is chosen to belong to A random family is chosen to belong to -ve test set & rest of the families in it’s -ve test set & rest of the families in it’s

superfamily are added to negative superfamily are added to negative training set.training set.

Page 12: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

The second datasetThe second dataset• We use the classification of Enzymes to We use the classification of Enzymes to

simulate remote homology.simulate remote homology.• The function of an enzyme is given by EC The function of an enzyme is given by EC

number given it to by Enzyme Commision. number given it to by Enzyme Commision. • EC number is like n1.n2.n3.n4EC number is like n1.n2.n3.n4 For eg 1.1.3.13 for alcohol oxidase.For eg 1.1.3.13 for alcohol oxidase. n1 – 1-6 :- indicates the type of chemical n1 – 1-6 :- indicates the type of chemical

reaction catalyzed by the enzyme.reaction catalyzed by the enzyme. n2 – specifies donor molecule.n2 – specifies donor molecule. n3 – specifies the acceptor.n3 – specifies the acceptor. n4 – specifies the substrate.n4 – specifies the substrate.

Page 13: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

……contdcontd• In this paper author concentrates on In this paper author concentrates on

oxidoreductase (n1 = 1).oxidoreductase (n1 = 1).

• A classifier is trained to predict A classifier is trained to predict oxidoreductases with a certain functionoxidoreductases with a certain function

(n2 & n3).(n2 & n3).

• The classifier will be tested on The classifier will be tested on oxidoreductases with adifferent substrate oxidoreductases with adifferent substrate (n4) than those it was trained on.(n4) than those it was trained on.

Page 14: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

For eg.For eg.

EC class 1.14.13.8 EC class 1.14.13.8 Positive examples of Positive examples of training set.training set.

EC class 1.14.13.39 EC class 1.14.13.39 Positive examples of Positive examples of test set.test set.

• So the similarity between the +ve training So the similarity between the +ve training & test may not be very high.& test may not be very high.

• Negative test & training set are defined Negative test & training set are defined analogusly.analogusly.

Page 15: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

MethodsMethods

The Motif kernelThe Motif kernel• When the similarity is a dot product it is When the similarity is a dot product it is

called a kernel.called a kernel.

The method is as follows:- Each position in The method is as follows:- Each position in the motif represents the variability in the the motif represents the variability in the column in a block from multiple sequence column in a block from multiple sequence alignment.alignment.

For eg the motif For eg the motif

[as].dkf[filmv]..[filmv]…l[ast].[as].dkf[filmv]..[filmv]…l[ast].

[filmv] is a substitution group.[filmv] is a substitution group.

Page 16: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

…….contd.contd• If the pattern of amino acids that appear in If the pattern of amino acids that appear in

a column of a block does not match any a column of a block does not match any substitution group, then the motif contains substitution group, then the motif contains the wild card symbol ‘.’ .the wild card symbol ‘.’ .

• A sequence will or match above motif if it A sequence will or match above motif if it has either an a an s in some position, then has either an a an s in some position, then any character, then d, k, f & so on, any character, then d, k, f & so on, matching until the end of motifmatching until the end of motif

Page 17: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

• A sequence x contains a motif m, if x A sequence x contains a motif m, if x contains m at some position.contains m at some position.

• A sequence x can be represented in A sequence x can be represented in vector space indexed by a set of vector space indexed by a set of motifs M as follows:-motifs M as follows:-

(x) = ((x) = (mm(x))(x))mЄMmЄM

where where mm(x) is the number of (x) is the number of occurences of the motif m in x.occurences of the motif m in x.

We can define motif kernel as We can define motif kernel as K(x, x’) = K(x, x’) = (x) (x) (x’) (x’)

Page 18: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

• As in the most cases a motif appears only once As in the most cases a motif appears only once in sequence, this kernel will count the number in sequence, this kernel will count the number of motifs that are common to both sequence.of motifs that are common to both sequence.

Q Why are we using eBlocks database over other Q Why are we using eBlocks database over other motif databases to define a motif kernel?motif databases to define a motif kernel?

Ans:-Ans:-1)1) Usage of databases like PROSITE & the eMOTIF Usage of databases like PROSITE & the eMOTIF

presents a problem in the evaluation of presents a problem in the evaluation of performance of the kernel.performance of the kernel.

2)2) The eBLOCKS database are generated in an The eBLOCKS database are generated in an unsupervised wayunsupervised way

3)3) Increased coverage of eBLOCKS set of BLOCKS.Increased coverage of eBLOCKS set of BLOCKS.

Page 19: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Computing the Motif KernelComputing the Motif Kernel

Page 20: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Computing the Motif kernelComputing the Motif kernel• To compute the motif content of each sequence; the To compute the motif content of each sequence; the

subsequent computation of the kernel is simply a dot subsequent computation of the kernel is simply a dot product between the vectors.product between the vectors.

• To facilitate the efficient computation of the motif To facilitate the efficient computation of the motif content of a sequence, the motif database is stored in content of a sequence, the motif database is stored in TRIE which is defined as follows.TRIE which is defined as follows.

Let m be a motif over the alphabet A U S U {.}Let m be a motif over the alphabet A U S U {.} Every prefix of m has a node.Every prefix of m has a node. Let m1 and m2 be prefixes of m; there is an edge fromLet m1 and m2 be prefixes of m; there is an edge from m1 to m2 if lm2l = lm1l +1. m1 to m2 if lm2l = lm1l +1. To compute all the motifs that are contained in x at To compute all the motifs that are contained in x at

any position, this search is started at each position of any position, this search is started at each position of x.x.

__

Page 21: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

The Blast kernelThe Blast kernel

• A query sequence by its BLAST scores A query sequence by its BLAST scores against the training set is represented.against the training set is represented.

• This representation in conjuction with This representation in conjuction with SVMs was used to address the problem SVMs was used to address the problem of remote homology detectionof remote homology detection

• Results were better than Fisher-kernel Results were better than Fisher-kernel method.method.

Page 22: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Classification MethodsClassification Methods• We report results using two classification We report results using two classification

methods:-methods:-

1) SVMs1) SVMs

2) K-Nearest-Neighbour.2) K-Nearest-Neighbour.

SVMSVM f(x) = w.x + bf(x) = w.x + b

w w weight vector weight vector

b b constant bias constant bias

Query is classified according to the sign of f.Query is classified according to the sign of f.

Page 23: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

• As a consequence of optimization process, As a consequence of optimization process, the weight vector can be expressed as a the weight vector can be expressed as a weighted sum of the Support Vectors(SV):-weighted sum of the Support Vectors(SV):-

w = w = iixxii

• The decision function is now written as The decision function is now written as

f(x) = f(x) = iixxi i * x + b* x + b

• In terms of kernel function, the decision is In terms of kernel function, the decision is expressed as:-expressed as:-

f(x) = f(x) = i i K(xK(xi ,i , x) + b x) + biSV

iSV

iSV

Page 24: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

KNN classifierKNN classifier• We use a KNN classifier with a continuous We use a KNN classifier with a continuous

valued decision functions. valued decision functions.

• A score for class j is defined asA score for class j is defined as

ffjj(x) = (x) = xxi ,i , x) x)

• kNNj(x) is the set of k nearest neighbors is the set of k nearest neighbors of x of x

in class j.in class j.

ikNNj(x)

Page 25: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

MetricsMetrics• We consider two metrics for asessing the We consider two metrics for asessing the

performance of a classifierperformance of a classifier1)1) ROC (area under receiver operator characteristic).ROC (area under receiver operator characteristic).2)2) RFP (the median rate of false positive)RFP (the median rate of false positive)

• The ROC curve describes the tradeoff between The ROC curve describes the tradeoff between sensitivity and specificity.sensitivity and specificity.

• More specifically we use ROC50 curve, which More specifically we use ROC50 curve, which counts true positives only up to the first 50 false counts true positives only up to the first 50 false positives.positives.

• The RFP score of a positive test sequence x is the The RFP score of a positive test sequence x is the fraction of negative test sequences that have a fraction of negative test sequences that have a value of the decision function that is at least as value of the decision function that is at least as high as the value of the decision function of x.high as the value of the decision function of x.

Page 26: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

ResultsResults• Use of astral database to obtain protein domain Use of astral database to obtain protein domain

sequences of the SCOP database.sequences of the SCOP database.

• Retained only superfamilies having atleast two families Retained only superfamilies having atleast two families that have atleast 10 members in each family.that have atleast 10 members in each family.

• A dataset with1639 domains in 23 superfamilies & 56 A dataset with1639 domains in 23 superfamilies & 56 families was yielded.families was yielded.

• Protein sequences annotated with EC numbers were Protein sequences annotated with EC numbers were extracted from SwissProt database.extracted from SwissProt database.

• The extracted dataset has 2187 enzymes in 65 classes.The extracted dataset has 2187 enzymes in 65 classes.

• To generate Blast kernel, authors ran an all vs all BLAST To generate Blast kernel, authors ran an all vs all BLAST on two datasets using default parameters & E value cut on two datasets using default parameters & E value cut off 0.1.off 0.1.

• To generate motif kernel, datasets were computed with To generate motif kernel, datasets were computed with eBLOCKS sequence motifs using the TRIE method.eBLOCKS sequence motifs using the TRIE method.

Page 27: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Results …contdResults …contd• A family by family comparison of classification A family by family comparison of classification

performance of the motif-SVM & BLAST-SVM performance of the motif-SVM & BLAST-SVM methods is provided in figure in next slide.methods is provided in figure in next slide.

• On the SCOP task the motif-SVM method On the SCOP task the motif-SVM method performs significantly better than BLAST-SVM performs significantly better than BLAST-SVM method with a p-value of 3.9 * 10method with a p-value of 3.9 * 10-9. -9. in a wilcoxon in a wilcoxon signed rank test for the ROC50 score.signed rank test for the ROC50 score.

• In enzyme classification task there is no In enzyme classification task there is no significant difference in ROC50 scores.significant difference in ROC50 scores.

• Similar behavior is observed in the median RFP Similar behavior is observed in the median RFP and RFP50.and RFP50.

• The results were similar when Smith-Waterman The results were similar when Smith-Waterman algorithm was used instead of BLAST.algorithm was used instead of BLAST.

Page 28: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Results …contdResults …contd

Page 29: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Results …contdResults …contd

Page 30: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Results …contdResults …contd

• The motif kernel in figure 4 shows The motif kernel in figure 4 shows the similarity between the families in the similarity between the families in superfamily whereas none is superfamily whereas none is detected by the BLAST kernel.detected by the BLAST kernel.

• Increased sensitivity of motif kernel.Increased sensitivity of motif kernel.

Page 31: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Results …contdResults …contd

Page 32: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Results …contdResults …contd• Figure 5 shows the comparison of the Figure 5 shows the comparison of the

SVM-based method to the one that SVM-based method to the one that uses KNN as a classifier.uses KNN as a classifier.

• In both the motif and BLAST kernels, In both the motif and BLAST kernels, SVM based classifier performs SVM based classifier performs significantly better than significantly better than corresponding KNN classifier.corresponding KNN classifier.

Page 33: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

DiscussionDiscussion• This paper showed that an SVM classifier that This paper showed that an SVM classifier that

uses motif kernel performs significantly better uses motif kernel performs significantly better than SVM that uses a BLAST/Smith-Waterman than SVM that uses a BLAST/Smith-Waterman kernel on a remote homology detection problem kernel on a remote homology detection problem derived from SCOP database.derived from SCOP database.

• Both methods performed equally well on the task Both methods performed equally well on the task of Enzyme detection.of Enzyme detection.

• BLAST kernel & motif kernel worked significantly BLAST kernel & motif kernel worked significantly better when used in conjunction with an SVM better when used in conjunction with an SVM rather than a Nearest Neighbor classifier.rather than a Nearest Neighbor classifier.

• Despite the relative success of motif method, Despite the relative success of motif method, there were many SCOP families & EC classes that there were many SCOP families & EC classes that were not detected using this method. were not detected using this method.

Page 34: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Questions?? Comments!!Questions?? Comments!!

Page 35: Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Thank you !!!Thank you !!!