CISC 841 Bioinformatics Combining HMMs with SVMs

20
CISC 841 Bioinformatics Combining HMMs with SVMs 1 Li Liao, CISC841, F07

description

CISC 841 Bioinformatics Combining HMMs with SVMs. HMM gradients. Fisher Score =   log P(X|H,  ) The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm. Each dimension corresponds to one parameter of the model. - PowerPoint PPT Presentation

Transcript of CISC 841 Bioinformatics Combining HMMs with SVMs

Page 1: CISC 841 Bioinformatics Combining HMMs with SVMs

CISC 841 Bioinformatics

Combining HMMs with SVMs

1Li Liao, CISC841, F07

Page 2: CISC 841 Bioinformatics Combining HMMs with SVMs

HMM gradients

• Fisher Score <X> = log P(X|H, )

• The gradient of a sequence X with respect to a given model is computed using the forward-backward algorithm.

• Each dimension corresponds to one parameter of the model.

• The feature space is tailored to the sequences from which the model was trained.

2Li Liao, CISC841, F07

Page 3: CISC 841 Bioinformatics Combining HMMs with SVMs

SVM-Fisher discrimination

A probabilistic hidden Markov model is trained from some example sequences x1 x2 x3 … xN

Usually probability model P(xi|) (or function of P(xi|)) is used as a measure of sequence-model membership, and a threshold is used on this measure to decide membership.

The Fisher vector is a vector of gradients of P(xi|) (or gradients of function of P(xi|)) w.r.t the parameters of the model. Uxi = P(xi|)

One can take the training example sequences (positive set) and other sequences that are known to be non-members (negative set), and transform them into Fisher vectors.

A Support Vector Machine (SVM) can be trained using the positive and negative Fisher vectors, and can be used to classify other sequences.

3Li Liao, CISC841, F07

Page 4: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 4

Application: Protein remote homology detection

Page 5: CISC 841 Bioinformatics Combining HMMs with SVMs

SVM-Pairwise method

Protein homologs

Protein non-homologs

Positivepairwise score

vectors

Negativepairwise score

vectors

Support vector machine

Binary classification

Target protein of unknown function

1

23

Positive train Negative train

Testing data

5Li Liao, CISC841, F07

Page 6: CISC 841 Bioinformatics Combining HMMs with SVMs

Experiment: known protein families

Jaakkola, Diekhans and Haussler 19996Li Liao, CISC841, F07

Page 7: CISC 841 Bioinformatics Combining HMMs with SVMs

Sample family sizes

Family ID

Positive train

Positive test

Negative train

Negative test

1.27.1.1 12 6 2890 1444

1.27.1.2 10 8 2408 1926

1.36.1.1 29 7 3477 839

1.4.1.1 26 23 2256 1994

2.1.1.3 113 8 3895 275

3.1.8.3 17 10 2686 1579

3.2.1.5 46 7 3732 567

2.44.1.2 11 140 307 3894

Page 8: CISC 841 Bioinformatics Combining HMMs with SVMs

A measure of sensitivity and specificity

ROC = 1

ROC = 0

ROC = 0.67

6

5

ROC: receiver operating characteristic score is the normalized area

under a curve the plots true positives as a function of false positives

Page 9: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 9

Application: Discriminating signal peptide from transmembrane proteins

Page 10: CISC 841 Bioinformatics Combining HMMs with SVMs

Feature selection

We expect gradients w.r.t transition parameters to be better discrimination features

We look for those transitions that are differentially used by TM proteins and SP proteins

- transform each signal peptide sequence (1275) into a Fisher vector w.r.t transition parameters

and find the resultant vector- transform each TM sequence into a Fisher vector w.r.t transition parameters and find the resultant vector - compare the two resultant vectors

SignalP

TM protein

10Li Liao, CISC841, F07

Page 11: CISC 841 Bioinformatics Combining HMMs with SVMs

Gradients of P(s|x)

In pattern recognition problems, we are interested in P(s|x,) rather than P(x|)

Us|x = log P(s|x,) = log P(s, x|) - log P(x|)

First term:

P(s,x) = aBs1es1(x1) . as1s2 es2(x2) . as2s3 es3(x3) …

= i (i/aa)ni(s,x)

where ni(s,x) number times i is used, and aa = 1

P(x, s) = (1 - k ) nk(s,x) P(s,x) k k

= mk(x)/k – mk(x) mk(x) is the expected number of times k is used in x following the given path s

Second term:

P(x) = P(x,)

P(x,) = a01e1(x1) . a12 e2(x2) . a23 e3(x3)…

= i(i/aa)ni(,x)

where ni(,x) number times i is used, and aa = 1

log P(x) = 1 P(x, )

k P(x) k

But, P(x, ) = (1 - k ) nk(,x) P(x,) k k

Thus, log P(x) = (1 - k ) nk(,x) P(x,) k k P(x) = (1 - k ) nk(,x) P(|x) k

= nk(x)/k – nk(x)

nk(x) is the expected number of times k is used in x following any path

Finally:

Us|x = mk(x)/k – mk(x) – nk(x)/k + nk(x)11Li Liao, CISC841, F07

Page 12: CISC 841 Bioinformatics Combining HMMs with SVMs

Classification experiment

10-fold cross validation experiment using- positive set (247 TM proteins)- negative set (1275 signal peptide containing proteins)

SVM-light package is used.

sequenceto

vectorx Us|x

TMMOD

SVM Learn

SVM Classifier?

?

??

subsets of 247 TM proteins

subsets of 1275 SP proteins

12Li Liao, CISC841, F07

Page 13: CISC 841 Bioinformatics Combining HMMs with SVMs

Discrimination results

Results

A third (68) more SP proteins that were incorrectly classified as TM

TM proteins are identified correctly.

TM proteins incorrectly classified as SP proteins

SP proteins incorrectly classified as TM proteins

Phobius

SignalP-NN

SignalP-HMM

TMMOD

TMMOD + SVM-Fisher

7.7% (19/247)

42.9%

19.0%

6.1% (15/247)

6.1% (15/247)

3.5% (45/1275)

2.3%

1.4%

14.5%(185/1275)

9.2% (117/1275)

13Li Liao, CISC841, F07

Page 14: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 14

Application: Protein-Protein Interaction Prediction

Page 15: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 15

Interaction Profile Hidden Markov Model (ipHMM)

Fredrich et al (2006)

Page 16: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 16

U (x) = ∇θ logP(x|θ)

<LSai, A, LSai, B, LSbj,A, LSbj, B>

Uij = Ej(i) / ej(i) + k Ej(k)

Likelihood Score Vector

Fisher Score Vector

Knowledge transfer: • Build ipHMM from proteins whose structural information is available.• Align the sequences of proteins whose structural information is not available to the model.

Page 17: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 17

Page 18: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 18

Page 19: CISC 841 Bioinformatics Combining HMMs with SVMs

Li Liao, CISC841, F07 19

Scheme mean ROC score

FS_NM 0.7487

LS 0.7997

FS_IM 0.8202

FS_IM + LS 0.8626

Data set Fredrich et al (2006): 2018 proteins in 36 domain families

Page 20: CISC 841 Bioinformatics Combining HMMs with SVMs

Conclusions

• Structural information at binding sites enhances protein-protein interaction prediction.

• Interaction profile HMM can transfer structural information

• Fisher scores extracted from domain profiles further enhance protein-protein interaction prediction for proteins with no available structural information.

20Li Liao, CISC841, F07