Transcription factor binding site prediction in vivo using DNA sequence and shape features

15
Transcription factor binding site prediction in vivo using DNA sequence and shape features Anthony Mathelier , Lin Yang, Tsu-Pei Chiu, Remo Rohs, and Wyeth Wasserman [email protected] @AMathelier REGSYSGEN 2015 Nov. 17th Centre for Molecular Medicine and Therapeutics 1

Transcript of Transcription factor binding site prediction in vivo using DNA sequence and shape features

Page 1: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Transcription factor binding site prediction in vivousing DNA sequence and shape features

Anthony Mathelier, Lin Yang, Tsu-Pei Chiu, Remo Rohs, andWyeth Wasserman

[email protected] @AMathelier

REGSYSGEN2015 Nov. 17th

Centre for Molecular Medicine and Therapeutics

1

Page 2: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Transcriptional regulation of gene expression

Histone octamer TFs

Enhancer

Promoters

RNA PolII

RNAtranscripts

TSS

Cohesin

DNA

Nucleosome

Regulatoryproteins

A. Mathelier, W. Shi, and

W.W. Wasserman, Trendsin Genetics, 2015.

I Transcription of genes is turned on/off thanks to transcriptionfactors (TFs).

I TFs bind to DNA at transcription factor binding sites (TFBSs).

2

Page 3: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Modeling TFBS using position frequency matrices (PFMs)Known binding sites:

GTAACAATGTAAACATGTAAACAAGTAAACAAGTAAACATGTAAACAAGTAAACACGTCAACAGGTAAACATGTAAACAAGTAAACATTTAAGTAAATAAACAACTAAACAGGTAAACATGTAAACAAGTAAACATGTAAACACGTAAACATGTAAACAG

Position Frequency Matrix:

A [ 10 0 190 210 180 15 210 70]C [ 10 0 20 0 15 180 0 25]G [175 0 0 0 15 0 0 35]T [ 15 210 0 0 0 15 0 80]

PFMs - PWMs

Classically, position weight (PWMs) are derived from PFMs tomodel TFBSs, assuming nucleotide independence within TFBSs.

3

Page 4: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Modeling TFBS using Transcription Factor Flexible Models

>HNF4A 1...AGTTCAAAGTTCA...>HNF4A 2...AGTCCAAAGTTCA... ...>HNF4A 73554...CTTGGAACCGGGG...>HNF4A 73555...GGCAAGGTTCATA...

ChIP-seq sequences

TFFMs

positionn

...... 1

En

BG

bg/bg

bg/fg

E0

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

position1

1

E1

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

position2

1

E2

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

1

Logos

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 140

1

2

bits

A. Mathelier and W.W. Wasserman, PLoS Computational Biology, 2013.

TFFMs

TFFMs model the sequence property of TFBSs from ChIP-seqdata by capturing successive dinucleotide dependencies.

4

Page 5: Transcription factor binding site prediction in vivo using DNA sequence and shape features

DNA shape features

The DNAshape tool predicts DNA shape features of a DNAsequence.Genome wide DNA shape features available on GBshape are:

I Minor Groove Width (MGW)

I Roll

I Propeller Twist (ProT)

I Helix Twist (HelT)

T. Zhou et al., Nucl. Acids Res., 2013.

T.P. Chiu et al., Nucl. Acids Res., 2015.

5

Page 6: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Using DNA shape to model TFBSs

Studies showed DNA shapes importance to model TFBSs from:

I SELEX-seq experiments.

I Protein-binding microarray experiments.

I BunDLE-seq experiments.

N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.

Aims of our study:I Construct computational models from large scale in vivo data

(ChIP-seq) by combining DNA sequence and shape features.

I Show TFBS prediction improvements on in vivo data.

I Analyze whether DNA shape induced improvements are TFfamily specific.

I Analyze position-specific DNA shape importance at TFBSs.

6

Page 7: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Using DNA shape to model TFBSs

Studies showed DNA shapes importance to model TFBSs from:

I SELEX-seq experiments.

I Protein-binding microarray experiments.

I BunDLE-seq experiments.

N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.

Aims of our study:I Construct computational models from large scale in vivo data

(ChIP-seq) by combining DNA sequence and shape features.

I Show TFBS prediction improvements on in vivo data.

I Analyze whether DNA shape induced improvements are TFfamily specific.

I Analyze position-specific DNA shape importance at TFBSs.

6

Page 8: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Combining TFFMs and DNA shapes at TFBSs

hit score

MGW

ProT

Roll

HelT

Feature vector

We used an ensemble machine learning approach to combine DNAsequence and shape features.

7

Page 9: Transcription factor binding site prediction in vivo using DNA sequence and shape features

DNA shape features improve TFBS prediction in vivo

A B

Results on 400 human ENCODE ChIP-seq data sets

Combining TFFM scores and DNA shape features improve thediscriminative power. AUROC difference > 0.05 in 107 cases.

8

Page 10: Transcription factor binding site prediction in vivo using DNA sequence and shape features

DNA shape features are important for specific TF familiesB

C

Data sets from E2F and MADS-domain TF families are enrichedfor strong improvements when considering DNA shape features.

9

Page 11: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Validation on independent plant MADS-domain TFs

Incorporating DNA shape features significantly improve TFBSprediction for plant MADS-domain TFs.

10

Page 12: Transcription factor binding site prediction in vivo using DNA sequence and shape features

ProT position-specific importance for MADS-domain TFs

AGL15

bits

1

2

A

B

ProT is of critical importance for predicting TFBSs associated toplant MADS-domain TFs in a position-specific manner.

11

Page 13: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Conclusions

I Our analyses of ChIP-seq data reprensent the in vivoconterpart of the published in vitro studies.

I We can construct computational models combining DNAsequence and shape features from ChIP-seq data to improveTFBS prediction in vivo.

I Incorporating DNA shape information is most beneficial whenapplied to the E2F and MADS-domain TF families.

I ProT is critical for MADS-domain TF binding specificity in aposition-specific manner.

12

Page 14: Transcription factor binding site prediction in vivo using DNA sequence and shape features

Acknowledgements

I Wyeth Wasserman

I Remo Rohs

I Lin Yang

I Tsu-Pei Chiu

I Francois Parcy

I Oriol Fornes

I Chih-Yu Chen

Centre for Molecular Medicine and Therapeutics

13

Page 15: Transcription factor binding site prediction in vivo using DNA sequence and shape features

hit score

MGW

ProT

Roll

HelT

Feature vector

A B

C

Thank you

AGL15

bits

1

2

A

B

14