Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp...
Transcript of Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp...
Inference of 3D regulatory interactionsfrom 2D genomic data 13
Katie Pollard 13
Gladstone Institutes Institute for Human Genetics Division of Biostatistics - UCSF
ENCODE Users Meeting
Bolger Center - June 30 2015
Eukaryotic gene regulation is 3D and complex
Eric Keller Drew Berry
Most phenotype associatedmutations are outside coding regions
Most phenotype associatedmutations are outside coding regions
DNA
A C
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Eukaryotic gene regulation is 3D and complex
Eric Keller Drew Berry
Most phenotype associatedmutations are outside coding regions
Most phenotype associatedmutations are outside coding regions
DNA
A C
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Most phenotype associatedmutations are outside coding regions
Most phenotype associatedmutations are outside coding regions
DNA
A C
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Most phenotype associatedmutations are outside coding regions
DNA
A C
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Most phenotype associatedmutations are outside coding regions
DNA
A C
T C
G A
Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp
difference
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Cardiomyopathy associated variant or human-chimp
difference
Gene A Gene B
G A
Most phenotype associatedmutations are outside coding regions
A T C C
DNA Gene C
bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13
bull Gene targets closest gene is right ~10 of time
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can we reconstruct 3D interactions between enhancers and promoters
from 2D genomic data 13
13
13
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 enhancer and promoter
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13 13
13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq
bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13 13 13
13 13 13 13 13
13 13 13 13 13 13 13 13 13 13 13
13 13 13 13
TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features
Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution
Computa6onal Algorithm
Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy
EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and
bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Performance AUC=094-096 Precision
=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder performs well at verylong distances
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter
bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase
bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies
Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Predictive features colocate and form complexes
K562
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Predictive features colocate and form complexes
K562
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
What is a minimal set of experimentsfor accurate prediction optimal 16+
minimal 8
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Can TargetFinder work outsideENCODE cell lines
Test if models generalize across cell types EVALUATE
GM12878 K562 HeLa-S3 HUVEC
GM12878 083 040 043 039
K562 046 085 045 044
HeLa-S3 043 038 088 041
HUVEC 039 040 038 -
TRAIN
Fmax13 values
Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Cardiomyopathy associated variant
Gene A Gene B
G A
TargetFinder accurately annotatesenhancer-promoter pairs
A T C C
DNA Gene C
Massive data integration improves predictionbull Closest gene
- Usually fails to identify the right promoter - Many false positives
bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Which human genome sequences function as long-range enhancers
13
13
13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
VISTA 13 Browser
minus
+
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13
Training Data
+VISTA 13 Browser
EvoluDonary ConservaDon Human
ChimpFeatures Mouse
Rat
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
minus
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13 13
13 13 13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics Human
ChimpFeatures Mouse
Rat
bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13 13 13 13
13 13 13 13 13
13 13 13
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
Training Data
+VISTA 13 Browser
minus
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
13 13 13 13
13 13 13 13 13
13 13 13
Training Data
+VISTA 13 Browser
minus
Computa6onal Algorithm
Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights
EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features
EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13
AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13
ACAAACACACAGACAT13Rat
bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
EnhancerFinder Performance
AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate
Erwin et al (2014) PLoS Comp Bio
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
EnhancerFinder Predictions
239 Predicted HAR13 Enhancers
Heart13 (28)
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96)
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
EnhancerFinder Predictions
bull 84301 developmental enhancer predictions - Cover 2 of the human genome
239 Predicted HAR13 Enhancers
Heart13 (28)
- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes
Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo
bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
HARs
Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)
Neuronal differentiated cells
Massively Parallel Reporter Assaysand capture Hi-C for validation
AGGT
MinP Luciferase
ACGT ACGT
TCGT ACGC
Enhancer vector
Transcribed barcodes
RNA-seq
Lentivirus
Text
bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs
iPS derived cells
Enhancers Human
Chimp
Vector library
Hane Ryu Nadav Ahituv Jay Shendure Yin Shen
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Induced pluripotent stem cell derivedneuronal and cardiac lines
Chi
mp
Hum
an
Human iPSC derived cardiomyocytes
Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Discovering the role of enhancers inhuman biology and evolution
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
G A
Discovering the role of enhancers inhuman biology and evolution
T Computational13 C Characterization
Gene B Gene C
iPSC based screening In vivo molecular studies
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes
Collaborators
EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein
TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau
MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang
Funding from NHLBI PhRMA Foundation Gladstone Institutes