Transcription factor binding site prediction in vivo using DNA sequence and shape features
-
Upload
amathelier -
Category
Science
-
view
784 -
download
2
Transcript of Transcription factor binding site prediction in vivo using DNA sequence and shape features
Transcription factor binding site prediction in vivousing DNA sequence and shape features
Anthony Mathelier, Lin Yang, Tsu-Pei Chiu, Remo Rohs, andWyeth Wasserman
[email protected] @AMathelier
REGSYSGEN2015 Nov. 17th
Centre for Molecular Medicine and Therapeutics
1
Transcriptional regulation of gene expression
Histone octamer TFs
Enhancer
Promoters
RNA PolII
RNAtranscripts
TSS
Cohesin
DNA
Nucleosome
Regulatoryproteins
A. Mathelier, W. Shi, and
W.W. Wasserman, Trendsin Genetics, 2015.
I Transcription of genes is turned on/off thanks to transcriptionfactors (TFs).
I TFs bind to DNA at transcription factor binding sites (TFBSs).
2
Modeling TFBS using position frequency matrices (PFMs)Known binding sites:
GTAACAATGTAAACATGTAAACAAGTAAACAAGTAAACATGTAAACAAGTAAACACGTCAACAGGTAAACATGTAAACAAGTAAACATTTAAGTAAATAAACAACTAAACAGGTAAACATGTAAACAAGTAAACATGTAAACACGTAAACATGTAAACAG
Position Frequency Matrix:
A [ 10 0 190 210 180 15 210 70]C [ 10 0 20 0 15 180 0 25]G [175 0 0 0 15 0 0 35]T [ 15 210 0 0 0 15 0 80]
PFMs - PWMs
Classically, position weight (PWMs) are derived from PFMs tomodel TFBSs, assuming nucleotide independence within TFBSs.
3
Modeling TFBS using Transcription Factor Flexible Models
>HNF4A 1...AGTTCAAAGTTCA...>HNF4A 2...AGTCCAAAGTTCA... ...>HNF4A 73554...CTTGGAACCGGGG...>HNF4A 73555...GGCAAGGTTCATA...
ChIP-seq sequences
TFFMs
positionn
...... 1
En
BG
bg/bg
bg/fg
E0
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position1
1
E1
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position2
1
E2
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
1
Logos
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 140
1
2
bits
A. Mathelier and W.W. Wasserman, PLoS Computational Biology, 2013.
TFFMs
TFFMs model the sequence property of TFBSs from ChIP-seqdata by capturing successive dinucleotide dependencies.
4
DNA shape features
The DNAshape tool predicts DNA shape features of a DNAsequence.Genome wide DNA shape features available on GBshape are:
I Minor Groove Width (MGW)
I Roll
I Propeller Twist (ProT)
I Helix Twist (HelT)
T. Zhou et al., Nucl. Acids Res., 2013.
T.P. Chiu et al., Nucl. Acids Res., 2015.
5
Using DNA shape to model TFBSs
Studies showed DNA shapes importance to model TFBSs from:
I SELEX-seq experiments.
I Protein-binding microarray experiments.
I BunDLE-seq experiments.
N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.
Aims of our study:I Construct computational models from large scale in vivo data
(ChIP-seq) by combining DNA sequence and shape features.
I Show TFBS prediction improvements on in vivo data.
I Analyze whether DNA shape induced improvements are TFfamily specific.
I Analyze position-specific DNA shape importance at TFBSs.
6
Using DNA shape to model TFBSs
Studies showed DNA shapes importance to model TFBSs from:
I SELEX-seq experiments.
I Protein-binding microarray experiments.
I BunDLE-seq experiments.
N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.
Aims of our study:I Construct computational models from large scale in vivo data
(ChIP-seq) by combining DNA sequence and shape features.
I Show TFBS prediction improvements on in vivo data.
I Analyze whether DNA shape induced improvements are TFfamily specific.
I Analyze position-specific DNA shape importance at TFBSs.
6
Combining TFFMs and DNA shapes at TFBSs
hit score
MGW
ProT
Roll
HelT
Feature vector
We used an ensemble machine learning approach to combine DNAsequence and shape features.
7
DNA shape features improve TFBS prediction in vivo
A B
Results on 400 human ENCODE ChIP-seq data sets
Combining TFFM scores and DNA shape features improve thediscriminative power. AUROC difference > 0.05 in 107 cases.
8
DNA shape features are important for specific TF familiesB
C
Data sets from E2F and MADS-domain TF families are enrichedfor strong improvements when considering DNA shape features.
9
Validation on independent plant MADS-domain TFs
Incorporating DNA shape features significantly improve TFBSprediction for plant MADS-domain TFs.
10
ProT position-specific importance for MADS-domain TFs
AGL15
bits
1
2
A
B
ProT is of critical importance for predicting TFBSs associated toplant MADS-domain TFs in a position-specific manner.
11
Conclusions
I Our analyses of ChIP-seq data reprensent the in vivoconterpart of the published in vitro studies.
I We can construct computational models combining DNAsequence and shape features from ChIP-seq data to improveTFBS prediction in vivo.
I Incorporating DNA shape information is most beneficial whenapplied to the E2F and MADS-domain TF families.
I ProT is critical for MADS-domain TF binding specificity in aposition-specific manner.
12
Acknowledgements
I Wyeth Wasserman
I Remo Rohs
I Lin Yang
I Tsu-Pei Chiu
I Francois Parcy
I Oriol Fornes
I Chih-Yu Chen
Centre for Molecular Medicine and Therapeutics
13
hit score
MGW
ProT
Roll
HelT
Feature vector
A B
C
Thank you
AGL15
bits
1
2
A
B
14