Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2...

Post on 31-Mar-2015

213 views 0 download

Tags:

Transcript of Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2...

Modeling Dependencies in Protein-DNA Binding Sites

1 School of Computer Science & Engineering2 Hadassah Medical School

The Hebrew University, Jerusalem, Israel

Yoseph Barash 1

Gal Elidan 1

Nir Friedman 1

Tommy Kaplan 1,2

promoter

gene

binding site

Dependent positions in binding sites

Pros: Biology suggests dependencies Single amino-acid interacts with two nucleotides Change in conformation of protein or DNA

Cons: Modeling dependencies is harder Additional parameters Requires more data, not as robust

A?C?T

To model or not to model dependencies ?[Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]

Most approaches assume position independence

Can we learn dependencies from available genomic data ?

Do dependency models perform better ?

Outline Flexible models of dependencies Learning from (un)aligned sequences Systematic evaluation

Biological insights

Data driven approach

Yes

Yes

How to model binding sites ?

))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T

5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1

X1 X2 X3 X4 X5 Profile: Independency model

Tree: Direct dependencies

Mixture of Profiles:Global dependencies

Mixture of Trees:Both types of dependencies

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

T

3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X

? )X X X X P(X 54321 represent a distribution of binding sites

Learning models: Aligned binding sites

Learning based on methods for probabilistic graphical models (Bayesian networks)

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC

Aligned binding sitesModels

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

LearningMachinery

select maximum likelihood model

Evaluation using aligned data

Estimate generalization of each model:

Test: how probable is the site given the model?

-20.34-23.03-21.31-19.10-18.42-19.70-22.39-23.54-22.39-23.54-18.07-19.18-18.31-21.43

ATGGGGCGGGGCGTGGGGCGGGGCATGGGGCGGGGCGTGGGGCGGGGCGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC

Data set Test Log-LikelihoodTest setTraining set

Testavg. LL = -20.77

95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’]

Cross-validation:

Arabidopsis ABA binding factor 1

Profile

Test LL per instance -19.93

Mixture of Profiles76%

24%

Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)

X4 X5 X6 X7 X8 X9 X10 X11 X12

Tree

Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)

Likelihood improvement over profiles

TRANSFAC 95 aligned data sets

0.5

1

2

4

8

16

32

64

128

10 20 30 40 50 60 70 80 90

Significant(paired t-test)

Fol

d-ch

ange

in li

kelih

ood Not significant

Significant improvement in generalization

Data often exhibits dependencies

Sources of data: Gene annotation (e.g. Hughes et al, 2000)

Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000)

ChIP (e.g. Simon et al, 2001; Lee et al, 2002)

Motif finding problemInput: A set of potentially co-regulated genes

Output: A common motif in their promoters

Evaluation for unaligned data

EM algorithm

Learning models: unaligned data

Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model

Unaligned Data

Learna model

Identify binding

sites

ModelsX1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

ChIP location analysis[Lee et al, 2002]

Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments

YAL005C...

YAL010CYAL012CYAL013WYPR201W

YAL001CYAL002WYAL003W

Gene

YAL001CYAL002WYAL003W

+ – +– ...

+ –––

ABF1 Targets

– +––. ..

– ++ –

ZAP1 Targets…....

# genes ~ 6000

Learned Mixture of Profiles

43

492

Example: Models learned for ABF1 (YPD) Autonomously replicating sequence-binding factor 1

Learned profile

Known profile(from TRANSFAC)

Evaluating PerformanceDetect target genes on a genomic scale:

ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473

-180 -160 -140 -120 -100 -80 -60

p-v

alu

e

10-8

10-7

10-6

10-5

10-4

10-2

10-1

Profile

10-3

Evaluating Performance

Mix of Trees

Bonferroni corrected p-value ≤ 0.01

Gal4 regulates Gal80

Biologicallyverified site

Detect target genes on a genomic scale:

YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W

Evaluation using ChIP location data[Lee et al, 2002]

Evaluate using a 5-fold cross-validation test:

+–+

YAL001CYAL002WYAL003W

Data set Test set Prediction

– +––+ –––

YAL001CYAL002WYAL003W

+–+

––– – ++– –

Evaluate using a 5-fold cross-validation test:

+–+

True

– +––+ –––

+–+

√√√√FN√√√FP√√

YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W

Data set

YAL001CYAL002WYAL003W

Prediction

Evaluation using ChIP location data[Lee et al, 2002]

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0% 1% 2% 3% 4% 5%

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity)

False Positive Rate

Profile

Example: ROC curve of HSF1

Mixture of Trees

Tree

~60 FP

Mixture of Profiles

-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Δ s

pe

cif

icit

y

Δ sensitivity

Tree vs. ProfileTrue

Predicted

TP

Improvement in sensitivity & specificity

30

615

3

SensitivityTP / True

SpecificityTP / Predicted

105 unaligned data sets from Lee et al.

-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Δ s

pe

cif

icit

y

Δ sensitivity

Mixture of Profiles vs. ProfileTrue

Predicted

TP

Improvement in sensitivity & specificity

52

1718

0

SensitivityTP / True

SpecificityTP / Predicted

105 unaligned data sets from Lee et al.

-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Δ s

pe

cif

icit

y

Δ sensitivity

Mixture of Trees vs. ProfileTrue

Predicted

TP

Improvement in sensitivity & specificity

84

162

1

SensitivityTP / True

SpecificityTP / Predicted

105 unaligned data sets from Lee et al.

“Is it worthwhile to model dependencies?”Evaluation clearly supports this

What about the underlying biology ?(with Prof. Hanah Margalit, Hadassah Medical School)

Distance between dependent positions

0

10

20

30

40

50

Nu

m o

f d

epe

nd

en

cies

1 2 3 4 5 6 7 8 9 10 11

Distance

Weak (< 0.3 bits)

Medium (< 0.7 bits)

Strong

Tree models learned from the aligned data sets

< 1/3 of the dependencies

0.5

1

2

4

8

16

32

64

128

Fo

ld-c

han

ge

in li

ke

liho

od

Zinc finger

bZIPbHLH

Helix

Turn Helix

β Sheetothers ???

Structural families

Dependency models vs. Profile on aligned data sets

0.5

1

2

4

8

16

32

64

128

10 20 30 40 50 60 70 80 90

Significant(paired t-test)

Fol

d-ch

ange

in li

kelih

ood

Not Significant

Conclusions Flexible framework for learning dependenciesDependencies are found in many cases It is worthwhile to model them -

Better learning and binding site prediction

http://compbio.cs.huji.ac.il/TFBN

Future work Link to the underlying structural biology Incorporate as part of other regulatory

mechanism models