Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp...

57
Inference of 3D regulatory interactions from 2D genomic data Katie Pollard Gladstone Institutes, Institute for Human Genetics, Division of Biostatistics - UCSF ENCODE Users Meeting Bolger Center - June 30, 2015

Transcript of Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp...

Page 1: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Inference of 3D regulatory interactionsfrom 2D genomic data 13

Katie Pollard 13

Gladstone Institutes Institute for Human Genetics Division of Biostatistics - UCSF

ENCODE Users Meeting

Bolger Center - June 30 2015

Eukaryotic gene regulation is 3D and complex

Eric Keller Drew Berry

Most phenotype associatedmutations are outside coding regions

Most phenotype associatedmutations are outside coding regions

DNA

A C

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 2: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Eukaryotic gene regulation is 3D and complex

Eric Keller Drew Berry

Most phenotype associatedmutations are outside coding regions

Most phenotype associatedmutations are outside coding regions

DNA

A C

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 3: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Most phenotype associatedmutations are outside coding regions

Most phenotype associatedmutations are outside coding regions

DNA

A C

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 4: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Most phenotype associatedmutations are outside coding regions

DNA

A C

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 5: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 6: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 7: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Most phenotype associatedmutations are outside coding regions

DNA

A C

T C

G A

Cardiomyopathy Gene A Gene B Gene C associated variant or human-chimp

difference

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 8: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 9: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Cardiomyopathy associated variant or human-chimp

difference

Gene A Gene B

G A

Most phenotype associatedmutations are outside coding regions

A T C C

DNA Gene C

bull Enhancers individual ChIP-seq data sets identify lt50 ofknown enhancers plus many false positives13

bull Gene targets closest gene is right ~10 of time

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 10: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can we reconstruct 3D interactions between enhancers and promoters

from 2D genomic data 13

13

13

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 11: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 12: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 13: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 enhancer and promoter

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 14: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13 13

13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

Conserved13 synteny of13 bullChIA-shy‐PET enhancer and promoter bullRNA-shy‐seq

bullChIP-shy‐seq (TFs histones) bullenhancerpromoterwindowbullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 15: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 16: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13 13 13

13 13 13 13 13

13 13 13 13 13 13 13 13 13 13 13

13 13 13 13

TargetFinder Training Teach a machine learning algorithm to discriminate true versusfalse enhancer-shy‐promoter interac6ons based on their features

Training DataAcDve enhancerexpressed genePosiDves = Hi-shy‐C +NegaDves = Hi-shy‐C13 -shy‐Rao et al 2014 1-Kb resolution

Computa6onal Algorithm

Decision trees good for interacting features 13 Ensemble learning build many imperfect classifiers and combine them to improve prediction accuracy

EvoluDonary ConservaDon FuncDonal Genomics Sequence AnnotaDons Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

Conserved13 synteny of13 bullChIA-shy‐PET K-shy‐mer correlaDon enhancer and promoter bullRNA-shy‐seq Annotated funcDons and

bullChIP-shy‐seq (TFs histones) pathways of gene and bullenhancerpromoterwindow enhancer bound TFs bullChromHMM13 amp Segway

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 17: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Performance AUC=094-096 Precision

=90-95 Recall=76-83 Power=85-89 at 10 FPR 13 Significantly better than random and logistic regression

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 18: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder performs well at verylong distances

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 19: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 20: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 21: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 22: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance Most predictive features mark the window between the enhancer and the promoter

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 23: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 24: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 25: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 26: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

TargetFinder Feature Importance Most useful features for prediction are TF andhistone marks in the window between the enhancer and the promoter

bull True interactions - Enhancer-associated proteins P300 JUN TFs - Marks of heterochromatin lack of DNA methylation - Marks of paused or poised RNA polymerase

bull False interactions - Cohesin complex CTCF RAD21 SMC3 ZNF143 - Histone marks of open chromatin and elongation - Marks of active promoters and gene bodies

Many ldquowindowrdquo features have a different meaning when marking promoters and enhancers (eg cohesin)

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 27: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Predictive features colocate and form complexes

K562

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 28: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Predictive features colocate and form complexes

K562

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 29: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 30: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 31: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

What is a minimal set of experimentsfor accurate prediction optimal 16+

minimal 8

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 32: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 33: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 34: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 35: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Can TargetFinder work outsideENCODE cell lines

Test if models generalize across cell types EVALUATE

GM12878 K562 HeLa-S3 HUVEC

GM12878 083 040 043 039

K562 046 085 045 044

HeLa-S3 043 038 088 041

HUVEC 039 040 038 -

TRAIN

Fmax13 values

Expect ~35 precision and 55 recall on anew cell type with ~10 ChIP-seq datasets

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 36: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 37: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 38: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Cardiomyopathy associated variant

Gene A Gene B

G A

TargetFinder accurately annotatesenhancer-promoter pairs

A T C C

DNA Gene C

Massive data integration improves predictionbull Closest gene

- Usually fails to identify the right promoter - Many false positives

bull TargetFinder - Identifies 95-90 of known pairs (55 with less data) - Few false positives

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 39: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Which human genome sequences function as long-range enhancers

13

13

13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 40: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

VISTA 13 Browser

minus

+

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 41: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13

Training Data

+VISTA 13 Browser

EvoluDonary ConservaDon Human

ChimpFeatures Mouse

Rat

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

minus

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 42: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13 13

13 13 13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics Human

ChimpFeatures Mouse

Rat

bullChIP-shy‐seq (TFs histones) bullDNase HypersensiDvity bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 43: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13 13 13 13

13 13 13 13 13

13 13 13

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

Training Data

+VISTA 13 Browser

minus

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 44: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

13 13 13 13

13 13 13 13 13

13 13 13

Training Data

+VISTA 13 Browser

minus

Computa6onal Algorithm

Support vector machine separates 2 groups 13 Multi-kernel good for combining heterogeneous data types with different weights

EnhancerFinder Training Teach a machine learning algorithm to iden6fy developmentalenhancers ac6ve13 in different 6ssues based on their13 features

EvoluDonary ConservaDon FuncDonal Genomics DNA Sequence MoDfs Human AAAAAAACAAAGAAAT13

AACAAACCAACGAACT13ChimpFeatures AAGAAAGCAAGGAAGT13Mouse AATAAATCAATGAATT13

ACAAACACACAGACAT13Rat

bullChIP-shy‐seq (TFs histones) bull short13 k-shy‐mers bullDNase HypersensiDvity bull known13 TF moDfs bullENCODE13 bullEpigenomics RoadmapbullBench-shy‐to-shy‐Bassinet13

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 45: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

EnhancerFinder Performance

AUC=096 Power=85 at 10 FPR Recall=85 at 93 Precision 13 FDR ~10-50 Significantly better than other methods 13 gt80 in vivo validation rate

Erwin et al (2014) PLoS Comp Bio

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 46: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

EnhancerFinder Predictions

239 Predicted HAR13 Enhancers

Heart13 (28)

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 47: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96)

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 48: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 49: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

EnhancerFinder Predictions

bull 84301 developmental enhancer predictions - Cover 2 of the human genome

239 Predicted HAR13 Enhancers

Heart13 (28)

- Nearby genes have high expression and annotated functions in the relevant fetal tissue - Significant overlap with disease mutations - Cluster around developmental transcription factors and signaling genes

Limb13 (41) Brain (96) bull 239 predictions overlap a Human Accelerated Region (33 of HARs) 2530 validated in vivo

bull Identify sites with fitness effects (Gulko et al 2015) Erwin et al (2014) PLoS Comp Bio Capra et al (2014) PTRSB

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 50: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

HARs

Figure 2 Massively parallel functional testing of HARs For each HAR the chimp version and versions with all combinations of human mutations will be shotgun cloned along with synthetic 20bp barcodes into an enhancer assay vector This vector library will be introduced to human and chimp neuronal differentiated cells via lentivirus HAR sequences that function as enhancers in these cells will produce barcoded RNA Each unique HAR sequence will have an average of 90 tags to allow for a quantitative readout RNA sequencing (RNA-seq) will allow for the measurement of the relative activity of each mutation in the encoded HAR MinP (minimal promoter) mut (mutations)

Neuronal differentiated cells

Massively Parallel Reporter Assaysand capture Hi-C for validation

AGGT

MinP Luciferase

ACGT ACGT

TCGT ACGC

Enhancer vector

Transcribed barcodes

RNA-seq

Lentivirus

Text

bullTest gt12000 170bp enhancers in parallel bull Quantitative activity assayed via RNA-seq bull Compare genotypes bull Human vs chimp bull Disease SNPs

iPS derived cells

Enhancers Human

Chimp

Vector library

Hane Ryu Nadav Ahituv Jay Shendure Yin Shen

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 51: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 52: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Induced pluripotent stem cell derivedneuronal and cardiac lines

Chi

mp

Hum

an

Human iPSC derived cardiomyocytes

Hane Ryu Alex Pollen Nadav Ahituv Arnold Kriegstein Bruce Conklin

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 53: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Discovering the role of enhancers inhuman biology and evolution

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 54: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 55: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 56: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

G A

Discovering the role of enhancers inhuman biology and evolution

T Computational13 C Characterization

Gene B Gene C

iPSC based screening In vivo molecular studies

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes

Page 57: Interference of 3D Regulatory Interactions from 2D Genomic ... · Erwin et al. (2014) PLoS Comp Bio, Capra et al. (2014) PTRSB. EnhancerFinder: Predictions • 84,301 developmental

Collaborators

EnhancerFinder Gen Haliburton Tony Capra Dennis Kostka John Rubenstein

TargetFinder Sean Whalen Rebecca Truty Benoit Bruneau

MotifDiverge Dennis Kostka Tara Friedrich Deb Ritter Jeff Chuang

Funding from NHLBI PhRMA Foundation Gladstone Institutes