Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors:...

14
Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information Sciences, 139(1-2), pp. 139-163, 2001. Advisor: Min-Shiang Hwang Speaker: Chun-Ta Li

Transcript of Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors:...

Page 1: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

Effective hidden Markov models for detecting splicing junction sites in

DNA sequences

Authors: Michael M. Yin and Jason T. L. Wang

Sources: Information Sciences, 139(1-2), pp. 139-163, 2001.

Advisor: Min-Shiang Hwang

Speaker: Chun-Ta Li

Page 2: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

2

Introduction (1/1)codon:密碼子introns:內含子exons:編碼順序donor:捐贈者

Page 3: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

3

Using HMMs to model splicing junction sites (1/3)

• The Donor Model

Page 4: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

4

Using HMMs to model splicing junction sites (2/3)

• The Acceptor Model

Page 5: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

5

Using HMMs to model splicing junction sites (3/3)

• Two modules for each model– The Donor Model:

• true site modules (true sites in the training data set)

• false site modules (false sites in the training data set)

– The Acceptor Model:• true site modules (true sites in the training data set)

• false site modules (false sites in the training data set)

Page 6: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

6

Algorithm (1/3)

• Training algorithm (Donor Model)– Two training data sets

• positive training data set, Et

• negative training data set, Ef

– Probability of a transition form base bi to base bi+1

– True Donor Module• P( True | S, M(t)) S: a sequence S in set M

– False Donor Module• P( False | S, M(f)) S: a sequence S in set M

testing data set, M200 true donor sites14000 false donor sites

Page 7: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

7

Algorithm (2/3)

• Bayes’ rule

• Probability of S being a donor sequence

• Probability of S being a nondonr sequence

1

11

)()( },,,{),,(),|(statesT

iiii

ti

t TCGAbbbftrMTrueSP

)(

)()|()|(

BP

APABPBAP

1

11

)()( },,,{),,(),|(statesT

iiii

fi

f TCGAbbbftrMFalseSP

)(

)(),|(),|(

)()(

SP

TruePMTrueSPMSTrueP

tt

)(

)(),|(),|(

)()(

SP

FalsePMFalseSPMSFalseP

ff

statesT

i

ti MTruebPSP

1

)( ),|()(

statesT

i

fi MFalsebPSP

1

)( ),|()(

Page 8: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

8

Algorithm (3/3)

• The pratio is calculated for each sequence in set M

• Sort the pratio values in the descending order

• Calculates the positive lower bound, denoted Lp

• A sequence S > Lp assigns into set P

),|(

),|()(

)(

f

t

MSFalseP

MSTruePpratio

XT

TS

PP

TPemp

180

180

)(

)(

200

180

)(

)( NP

TPemn T

TS

thSNL emnp 1809.0*200* pratio value of positive sequence in set M

T(TP):屬於 set P的 positive sequences

T(P+N):在 set M的 positive sequences

T(PP):在 set P的 sequences

Page 9: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.
Page 10: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

Algorithm for classifying splicing junction donor sequences

)(

)0(),0|(),|0(

)()(

cand

fcandf

cand SP

YPMYSPMSYP

),|0(

),|1()(

)(

fcand

tcand

MSYP

MSYPsratio

)(

)1(),1|(),|1(

)()(

cand

tcandt

cand SP

YPMYSPMSYP

.

,

0

1

otherwise

LsratioifKIND P

i

Page 11: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

11

Example (1/4)• Training data set M

200 true donor sites14000 false donor sites

Page 12: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

12

Example (2/4)• A sequence S (AGGGTCAGT)

1

11

)()( },,,{),,(),|(statesT

iiii

ti

t TCGAbbbftrMTrueSP

)(

)(),|(),|(

)()(

SP

TruePMTrueSPMSTrueP

tt

statesT

i

ti MTruebPSP

1

)( ),|()(

P(S|True,M(t))

P(True) = 200/14200=0.014

= 0.05*0.11*0.81*1*0.03*0.02*0.63*0.46

= 0.0000007746354

P(S)

= 0.32*0.13*0.81*1*1*0.03*0.72*0.83*0.51

= 0.0003081

P(True|S,M(t))

= (0.0000007746354*0.014)/0.0003081

= 0.0000352

Page 13: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

13

Example (3/4)• A sequence S (AGGGTCAGT)

1

11

)()( },,,{),,(),|(statesT

iiii

ti

f TCGAbbbftrMFalseSP

)(

)(),|(),|(

)()(

SP

FalsePMFalseSPMSFalseP

ff

statesT

i

fi MFalsebPSP

1

)( ),|()(

P(S|False,M(f))

P(False) = 0.986

= 0.07*0.08*0.27*1*0.22*0.06*0.07*0.07

= 0.00000009779616

P(S)

= 0.25*0.25*0.27*1*1*0.22*0.24*0.25*0.3

= 0.00006683

P(False|S,M(f))

= (0.00000009779616*0.986)/0.00006683

= 0.001443

Page 14: Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

14

Example (4/4)• pratio = 0.0000352/ 0.001443 = 0.0244

),|0(

),|1()(

)(

fcand

tcand

MSYP

MSYPsratio

),|(

),|()(

)(

f

t

MSFalseP

MSTruePpratio

200 pratio values

Descending order

Lp = 180 th

Testing data, Scand

Table 1 & 2

sratio

.

,

0

1

otherwise

LsratioifKIND P

iKINDi

(True donor)(False donor)

(True donor or False donor)