(c) M Gerstein '06, 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1...

50
(c) M Gerstein '06, gerstein.info/talks CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,02.27 14:30-15:45)

description

(c) M Gerstein '06, gerstein.info/talks 3 Combining networks forms an ideal way of integrating diverse information Metabolic pathway Transcriptional regulatory network Physical protein- protein Interaction Co-expression Relationship Part of the TCA cycle Genetic interaction (synthetic lethal) Signaling pathways

Transcript of (c) M Gerstein '06, 1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1...

Page 1: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

1

CS/CBB 545 - Data MiningPredicting Networks through

Bayesian Integration #1 - Theory

Mark Gerstein, Yale Universitygersteinlab.org/courses/545

(class 2007,02.27 14:30-15:45)

Page 2: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

2

Predicting Networks via Bayesian Integration: Problem Motivation

Page 3: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

3

Combining networks forms an ideal way of integrating diverse information

Metabolic pathway

Transcriptional regulatory network

Physical protein-protein Interaction

Co-expression Relationship

Part of the TCA cycle

Genetic interaction (synthetic lethal)Signaling pathways

Page 4: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

4

Binding Experiments on Subunit Pairs

1 1 1 1 0 1 0Pull-down 2 1 1 0 1 0 1 0 1 1 0 1 11 0 0 1 01 0 0 0 0 0 0 0 0 0 0

1 1 0 1 0 1 0Pull-down 1 1 1 0 1 0 1 0 1 1 0 1 11 1 0 1 01 0 0 0 0 0 1 0 0 0 0

0 0 1 1 1 1 1 1 0 1Cross-linking 1 11 1 1 11111

1Pull-down 3 1 1 0 0 0 11 0

Far Western 3 1 0 0 10 0

Far Western 2 11 11 1 11 11 1 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 0 0 1Far Western 1 0 0 0 0 0 0 0

Interaction experiments before structure was known

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 12

2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12

2 2 2 2 2 22 3 3 3 3 33 5 5 5 55SubunitsSubunits

Page 5: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

5

RNA polymerase II: Structure

Which subunits interact?Based on Binding

experiments

Source: Cramer et al., 2000, Science, 288:632-633

Compare with Gold Std. Structure

5

6

8

9

10 11

12

1

2

3

Source: Edwards et al., 2002, Trends in Genetics

Page 6: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

6

Gold-Standard Positives

5

6

8

9

10 11

12

1

2

3

Gold-Standard Positive (GSTD+): 13

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

Page 7: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

7

Gold-Standard Negatives

Gold-Standard Negative (GSTD-): 32

5

6

8

9

10 11

12

1

2

3

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

Page 8: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

8

RNA Polymerase II: Gold-Standards

Gold-Standard Negative (GSTD-): 32

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

Gold-Standard Positive (GSTD+): 13

5

6

8

9

10 11

12

1

2

3

Page 9: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

9

Assess Quality and Coverage of PPints

1 1 1 1 0 1 0Pull-down 2 1 1 0 1 0 1 0 1 1 0 1 11 0 0 1 01 0 0 0 0 0 0 0 0 0 0

1 1 0 1 0 1 0Pull-down 1 1 1 0 1 0 1 0 1 1 0 1 11 1 0 1 01 0 0 0 0 0 1 0 0 0 0

0 0 1 1 1 1 1 1 0 1Cross-linking 1 11 1 1 11111

1Pull-down 3 1 1 0 0 0 11 0

Far Western 3 1 0 0 10 0

Far Western 2 11 11 1 11 11 1 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 0 0 1Far Western 1 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD-

Likelihood Ratio for Feature f:

|

|f

ff

p x GSTDL

p x GSTD

“Quality Score”

Page 10: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

10

Data integration: RNA polymerase II

Subunit A 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 5 5 5 5 5 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 11

Subunit B 2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 12 12

structural contact 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Far western 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0

Cross-linking 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1

Far western 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0

Pull-down 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 1 0 1 0 0 1 0

Far western 1 0 0 0 1 0

= false

= true

Page 11: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

11

Integrate using naive Bayes classifier

Majority 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Intersection 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Union 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Data integration: RNA polymerase II

Page 12: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

12

Data integration: RNA polymerase II

Integrate using naive Bayes classifier

Majority 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Intersection 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Union 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Page 13: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

13

Data integration: RNA polymerase II

Integrate using naive Bayes classifier

Majority 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Intersection 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Union 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

(Cross validate)

Page 14: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

14

Weighted Voting: the Likelihood Ratio

structural contact 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0

Far western 1 1 1 1 1 0 0 0 0 0 0 1 0

Far western (dup) 1 1 1 1 1 0 0 0 0 0 0 1 0

Cross-linking 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0

Far western 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0

Pull-down 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1

Pull-down 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1

Pull-down 1 1 1 0 1 0 0 1 0

Far western 1 0 0 0 1 0

Combined 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0

Maj. Vote: 0 = round(avg( 0 + 0 + 0 + 1 + 1 + 0 + 0 )With weights: likelihood ratio L = L1 + L2 + L3 …

Page 15: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

15

Predicting Networks via Bayesian Integration: Intuition & Formalism

Page 16: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

16

Classification by Voting (N1)

Derived from "perceptron model" in earlier class R = <w,f> + b

Page 17: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

17

Classification by Voting (N2)

Page 18: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

18

Bayes Rule

Which is shorthand for:

[From Mitchell, Machine Learning]

Page 19: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

19

More Bayes Rule

Page 20: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

20

More Bayes Rule

Page 21: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

21

Feature Correlation and Fully

Connected Bayes

Page 22: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

22

Estimating Probabilities

• We have so far estimated P(X=x | Y=y) by the fraction nx|y/ny, where ny is the number of instances for which Y=y and nx|y is the number of these for which X=x

• This is a problem when nx is small E.g., assume P(X=x | Y=y)=0.05 and the training set is s.t. that ny=5. Then it is

highly probable that nx|y=0 The fraction is thus an underestimate of the actual probability It will dominate the Bayes classifier for all new queries with X=x

Page 23: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

23

m-estimate

• Replace nx|y/ny by:

mnmpn

y

yx

|

• Where p is our prior estimate of the probability we wish to determine and m is a constant Typically, p = 1/k (where k is the number of possible values of X) m acts as a weight (similar to adding m virtual instances distributed according to

p)

Page 24: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

24

Training and Testing Set

• Cross Validation: Leave one out, seven-fold

Credits: Munson, 1995; Garnier et al., 1996

4-fold

Cross Validation

Page 25: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

25

Intuition N2.7 : ROC Curve & the prior

Page 26: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

26

Predicting Networks via Bayesian Integration:

Worked Examples

Page 27: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

27

Network Gold-Standards

Prediction of protein interactions: Bayesian integration

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Page 28: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

28

Network Gold-Standards

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Prediction of protein interactions: Bayesian integration

Frac. of Gold-Std Negatives with Feature

Frac. of Gold-Std Positives with Feature

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

= "Quality Score" =

Page 29: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

29

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-Standards

L1 = (4/4)/(3/6) =2

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

Page 30: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

30

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-Standards

L1 = (4/4)/(3/6) =2

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

Page 31: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

31

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-Standards

L1 = (4/4)/(3/6) =2

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

Page 32: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

32

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-Standards

L1 = (4/4)/(3/6) =2

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

Page 33: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

33

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-Standards

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

L1 = (4/4)/(3/6) =2

Page 34: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

34

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-StandardsL1 = (4/4)/(3/6) =2L2 = (3/4)/(3/6) =1.5

For each protein pair:LR = L1 L2log(LR) = log(L1) + log(L2)

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

Weighted Voting(Assuming uncorrelated features and Naïve Bayes)

Page 35: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

35

Feature 2, e.g. Y2HFeature 1, e.g. co-expression

Gold-standard +Gold-standard –

Network Gold-StandardsL1 = (4/4)/(3/6) =2L2 = (3/4)/(3/6) =1.5

For each protein pair:LR = L1 L2log(LR) = log(L1) + log(L2)

[Jansen, Yu, et al., Science; Yu, et al., Genome Res.]

( | )( | )i

ii

p xLp x

Likelihood Ratio for Feature i:

Prediction of protein interactions: Bayesian integration

Page 36: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

36

4

4

3

0

TPLR Cutoffs

4

3

2

1

Feature 1; L1 = 2Feature 2; L2 = 1.5Gold-standard +Gold-standard –Predicted PositivePredicted Negative

LR cutoffs1No Prediction made

Protein

Bayesian Integration and ROC Curves

2 <32

13<

3 <23/2<

4 3/2

Page 37: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

37

6

3

0

0

FPLR Cutoffs

4

3

2

1

Feature 1; L1 = 2Feature 2; L2 = 1.5Gold-standard +Gold-standard –Predicted PositivePredicted Negative

LR cutoffs1No Prediction made

Protein

Bayesian Integration and ROC Curves

2 <32

13<

3 <23/2<

4 3/2

Page 38: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

38

64

34

03

00

FPTPLR Cutoffs

4

3

2

1

Feature 1; L1 = 2Feature 2; L2 = 1.5Gold-standard +Gold-standard –Predicted PositivePredicted Negative

LR cutoffs1No Prediction made

Protein

43

2

1

FPR=FP/N"Error Rate"

TPR

=TP/

P"C

over

age"

ROC Curve

1/2 1

1

Bayesian Integration and ROC Curves

2 <32

13<

3 <23/2<

4 3/2

Page 39: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

39

Likelihood Ratios

1 1 0 1 0 1 0Pull-down 1 1 1 0 1 0 1 0 1 1 0 1 11 1 0 1 01 0 0 0 0 0 1 0 0 0 0

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD-

00

0

||

p x GSTDL

p x GSTD

11

1

||

p x GSTDL

p x GSTD

Likelihood Ratio for Feature f:

|

|f

ff

p x GSTDL

p x GSTD

Page 40: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

40

Calculating Likelihood Ratios

1 0 1 0 0Pull-down 1 1 0 1 1 1

6 /13

4 /13

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD-

00

0

||

p x GSTDL

p x GSTD

11

1

||

p x GSTDL

p x GSTD

Page 41: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

41

Calculating Likelihood Ratios

1 1Pull-down 1 1 0 1 0 1 1 01 1 0 1 01 0 0 0 0 0 1 0 0 0 0

6 /1311/ 32

4 /1314 / 32

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD-

=1.34

1 1 0 1 0 1 0Pull-down 1 1 1 0 1 0 1 0 1 1 0 1 11 1 0 1 01 0 0 0 0 0 1 0 0 0 0

=0.70

00

0

||

p x GSTDL

p x GSTD

11

1

||

p x GSTDL

p x GSTD

Page 42: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

42

Calculating Likelihood Ratios1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD-

Pull-down 1 L1 = (6/13) / (11/32) = 1.34 L0 = (4/13) / (14/32) = 0.70Pull-down 2 L1 = (7/13) / (9/32) = 1.91 L0 = (2/13) / (16/32) = 0.31Pull-down 3 L1 = (2/13) / (3/32) = 1.64 L0 = (2/13) / (2/32) = 2.46Cross-linking L1 = (10/13) / (7/32) = 3.52 L0 = (0/13) / (3/32) = 0Far Western 1 L1 = (2/13) / (4/32) = 1.23 L0 = (3/13) / (6/32) = 1.23Far Western 2 L1 = (6/13) / (5/32) = 2.95 L0 = (2/13) / (17/32) = 0.29Far Western 3 L1 = (1/13) / (1/32) = 2.46 L0 = (2/13) / (2/32) = 2.46

1 1 1 1 0 1 0Pull-down 2 1 1 0 1 0 1 0 1 1 0 1 11 0 0 1 01 0 0 0 0 0 0 0 0 0 0

1 1 0 1 0 1 0Pull-down 1 1 1 0 1 0 1 0 1 1 0 1 11 1 0 1 01 0 0 0 0 0 1 0 0 0 0

0 0 1 1 1 1 1 1 0 1Cross-linking 1 11 1 1 11111

1Pull-down 3 1 1 0 0 0 11 0

Far Western 3 1 0 0 10 0

Far Western 2 11 11 1 11 11 1 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 0 0 1Far Western 1 0 0 0 0 0 0 0

Page 43: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

43

Data Integration: ROC-Curve

0 0 1 0 1 1 1 0 0Combined (Bayes) 1 11 1 1 01110 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 0 1 0Pull-down 2 1 1 0 1 0 1 0 1 1 0 1 11 0 0 1 01 0 0 0 0 0 0 0 0 0 0

1 1 0 1 0 1 0Pull-down 1 1 1 0 1 0 1 0 1 1 0 1 11 1 0 1 01 0 0 0 0 0 1 0 0 0 0

0 0 1 1 1 1 1 1 0 1Cross-linking 1 11 1 1 11111

1Pull-down 3 1 1 0 0 0 11 0

Far Western 3 1 0 0 10 0

Far Western 2 11 11 1 11 11 1 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 0 0 1Far Western 1 0 0 0 0 0 0 0

3.52

18.2

11.1

13.9

26.6

0.22 0

0.22

0.22

0.29

0.06

0.12

0.22

0.29

0.06

0.06

0.22

0.22

0.36

0.08

0.91

0.220

0.52

2.16

132

19.5

0.53

3.68

5.52

12.7

10.4

2.25

26.6

0.220

2.25

11.1

18.2

10.4

2.25

0.06

0.29

0.290

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD- “Weighted Voting”

1 1,..., ...n nL f f L f L f

Page 44: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

44

Data Integration: ROC Curve

0 0 1 0 1 1 1 0 0Combined (Bayes) 1 11 1 1 01110 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3.52

18.2

11.1

13.9

26.6

0.22 0

0.22

0.22

0.29

0.06

0.12

0.22

0.29

0.06

0.06

0.22

0.22

0.36

0.08

0.91

0.220

0.52

2.16

132

19.5

0.53

3.68

5.52

12.7

10.4

2.25

26.6

0.220

2.25

11.1

18.2

10.4

2.25

0.06

0.29

0.290

1 1 1 0 1 0 1 0 1Majority 1 11 1 1 01110 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 1 0 0 0 0Intersection 1 11 1 0 01110 00 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1Union 1 11 1 1 11111 00 01 1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0.2

0.4

0.6

0.8

1.0

0.2 0.4 0.6 0.8 1.0

FPR=FP/N=1-Specificity

TPR=

TP/P

=Se

nsit

ivit

y

ROC Curves of RNA Polymerase II

1 1 1 1 1 1 1 1 1 2 3 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 122 2 2 2 2 22 3 3 3 3 33 5 5 5 55Subunits2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 11 12Subunits

TrueFalse

GSTD+GSTD-

Bayesian IntegrationMajorityIntersectionUnion

Page 45: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

45

Predicting Networks via Bayesian Integration: Features Correlation

Page 46: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

46

Correlations between similar features

structural contact 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0

Far western 1 1 1 1 1 0 0 0 0 0 0 1 0

Far western (dup) 1 1 1 1 1 0 0 0 0 0 0 1 0

Cross-linking 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0

Far western 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0

Pull-down 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1

Pull-down 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1

Pull-down 1 1 1 0 1 0 0 1 0

Far western 1 0 0 0 1 0

Combined 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0

Page 47: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

47

Intuition N3 : Feature Correlation

Page 48: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

48

Naive Bayes

APABPACPC,B,AP

B C

A

Page 49: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

49

A ‘correct’ factorisation

APABPB,ACPC,B,AP

B C

A

Page 50: (c) M Gerstein '06,   1 CS/CBB 545 - Data Mining Predicting Networks through Bayesian Integration #1 - Theory Mark Gerstein, Yale University.

(

c) M

Ger

stei

n '0

6, g

erst

ein.

info

/talk

s

50

A Typical BBN