Identifying and mapping cell-type-specific chromatin ...Identifying and mapping cell-type-specific...

10
Identifying and mapping cell-type-specific chromatin programming of gene expression Troels T. Marstrand a and John D. Storey a,b,1 a Lewis-Sigler Institute for Integrative Genomics, and b Department of Molecular Biology, Princeton University, Princeton, NJ 08544 Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved January 2, 2014 (received for review July 2, 2013) A problem of substantial interest is to systematically map variation in chromatin structure to gene-expression regulation across con- ditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the exis- tence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity and genome- wide gene-expression levels in 20 diverse human cell types. We show that 25% of genes show cell-type-specific expression ex- plained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., ±200 kb) capture more genes with this relationship than local regions (e.g., ±2.5 kb), yet the local regions show a more pronounced effect. By exploiting variation across cell types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type-specific expression, which we show have a range of contextual uses. This quantitative frame- work is likely applicable to other settings aimed at relating continu- ous genomic measurements to gene-expression variation. epigenetics | gene regulation | computational biology | association | encode H umans, like all other multicellular organisms, possess a large number of distinct cell types, each of which is specialized for a particular function within the body. Cells from a variety of tissue types exhibit different gene-expression profiles relating to their function, where typically only a fraction of the genome is expressed. As all somatic cells share the same genome, special- ization is in part achieved by physically sequestering regions containing nonessential genes into heterochromatin structures. Genes that are needed for the particular task of the cell type display an accessible chromatin structure allowing for the bind- ing of transcription factors and other related DNA machinery and subsequent gene expression. To date, most studies have been limited to considering the chromatin accessibility surrounding the promoter region of genes, which is typically proximal (<10 kb) to the transcription region in just one or very few cell types or experimental conditions (13). However, it is also of interest to understand how larger regions (10 kb) of chromatin structure relate to a genes expression var- iation across multiple cell types, disease states, or environmental conditions. Recently, several large-scale international collabo- rations have started to generate data that can be used for this purpose (4, 5), although doing so requires new developments in computational methods (68). A collection of landmark papers from the Encyclopedia of DNA Elements (ENCODE) project were recently published that summarize their most recent efforts to comprehensively un- derstand functional elements in the human genome (e.g., refs. 5, 9, 10). Using ENCODE data, we undertook a well-targeted ge- nome-wide investigation to characterize the relationship between variations in chromatin structure and gene-expression levels across 20 diverse human cell lines (SI Appendix, Table S1). We used data on chromatin structure as ascertained through DNaseI hypersen- sitivity (DHS) measured by next-generation deep-sequencing tech- nology and gene-expression data measured by Affymetrix exon arrays. Replicated data on 10 cell lines were also used to assess the robustness of our method. Relating DHS to gene-expression levels across multiple cell types is challenging because the DHS represents a continuous variable along the genome not bound to any specific region, and the relationship between DHS and gene expression is largely uncharacterized. To exploit variation across cell types and test for cell-type-specific relationships between DHS and gene expres- sion, the measurement units must be placed on a common scale, the continuous DHS measure associated to each gene in a well- defined manner, and all measurements considered simultaneously. Moreover, the chromatin and gene-expression relationship may only manifest in a single cell type, making standard measures of correlation between the two uninformative because their relation- ship is not linear over a continuous range, as shown in Fig. 1 (fur- ther details in SI Appendix and Figs. S1S5). The computational approach developed here provides a pow- erful, tractable, and intuitive way of representing these data and capturing biologically informative relationships. We were able to characterize the level to which variation of chromatin accessibility is associated with gene-expression variation in a cell-type-specific manner. Within genomic segments of significant chromatin gene- expression concordance, our methodology is further capable of pinpointing the most likely local sites related to the detected as- sociation. We show that such sites are context specific and can be shared across genes within a single cell type or across several cell types. Our quantitative framework has some generality in that it may be readily applied to associate any quantitative measure along the genome to gene-expression variation. Results Genome-Wide Profiling of Chromatin Accessibility and Gene Expression. We used data on genome-wide, high-resolution chromatin acces- sibility measurements for 20 distinct human primary and culture cell lines that were obtained by an established sequencing-based method (11). In principle, accessible openchromatin is cleaved Significance In order for genes to be expressed in humans, the DNA corre- sponding to a gene and its regulatory elements must be ac- cessible. It is hypothesized that this accessibility and its effect on gene expression plays a major role in defining the different cell types that make up a human. We have only recently been able to make the measurements necessary to model DNA acces- sibility and gene-expression variation in multiple human cell types at the genome-wide level. We develop and apply a new quanti- tative framework for identifying locations in the human genome whose DNA accessibility drives cell-type-specific gene expression. Author contributions: J.D.S. designed research; T.T.M. and J.D.S. performed research; T.T.M. and J.D.S. contributed new reagents/analytic tools; T.T.M. analyzed data; and T.T.M. and J.D.S. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Freely available online through the PNAS open access option. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1312523111/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1312523111 PNAS | Published online January 27, 2014 | E645E654 STATISTICS BIOPHYSICS AND COMPUTATIONAL BIOLOGY PNAS PLUS Downloaded by guest on June 24, 2021

Transcript of Identifying and mapping cell-type-specific chromatin ...Identifying and mapping cell-type-specific...

  • Identifying and mapping cell-type-specific chromatinprogramming of gene expressionTroels T. Marstranda and John D. Storeya,b,1

    aLewis-Sigler Institute for Integrative Genomics, and bDepartment of Molecular Biology, Princeton University, Princeton, NJ 08544

    Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved January 2, 2014 (received for review July 2, 2013)

    A problem of substantial interest is to systematically map variationin chromatin structure to gene-expression regulation across con-ditions, environments, or differentiated cell types. We developedand applied a quantitative framework for determining the exis-tence, strength, and type of relationship between high-resolutionchromatin structure in terms of DNaseI hypersensitivity and genome-wide gene-expression levels in 20 diverse human cell types. Weshow that ∼25% of genes show cell-type-specific expression ex-plained by alterations in chromatin structure. We find that distalregions of chromatin structure (e.g., ±200 kb) capture more geneswith this relationship than local regions (e.g., ±2.5 kb), yet the localregions show a more pronounced effect. By exploiting variationacross cell types, we were capable of pinpointing the most likelyhypersensitive sites related to cell-type-specific expression, whichwe show have a range of contextual uses. This quantitative frame-work is likely applicable to other settings aimed at relating continu-ous genomic measurements to gene-expression variation.

    epigenetics | gene regulation | computational biology | association |encode

    Humans, like all other multicellular organisms, possess a largenumber of distinct cell types, each of which is specialized fora particular function within the body. Cells from a variety oftissue types exhibit different gene-expression profiles relating totheir function, where typically only a fraction of the genome isexpressed. As all somatic cells share the same genome, special-ization is in part achieved by physically sequestering regionscontaining nonessential genes into heterochromatin structures.Genes that are needed for the particular task of the cell typedisplay an accessible chromatin structure allowing for the bind-ing of transcription factors and other related DNA machineryand subsequent gene expression.To date, most studies have been limited to considering the

    chromatin accessibility surrounding the promoter region of genes,which is typically proximal (

  • by the nonspecific endonuclease DNaseI, and the cleaved fragmentsare sequenced to provide a high-resolution, genome-wide map ofDHS for every cell type (SI Appendix, Table S2). The interpre-tation of these data is that increased fragment counts within aregion are indicative of greater chromatin accessibility. To investi-gate the impact of regional chromatin accessibility on gene-expressionvariation, we likewise used genome-wide gene-expression meas-urements in each cell line from Affymetrix exon arrays (SI Appendix,

    Table S4). A total of 19,215 genes were analyzed after preprocessing(Materials and Methods).With these quantifications, we sought to characterize the re-

    lationship between chromatin accessibility and gene expressionin a cell-type-specific manner, summarized in Figs. 1 and 2. Whatwe mean by “cell-type specific” is that, when pairing gene-expressionand chromatin-accessibility measures according to cell type, weobserve a relationship between the two measures, typically for one

    AG

    0445

    0

    BJ

    CA

    CO

    2

    GM

    0699

    0

    H7E

    SC

    HA

    Epi

    C

    HC

    F

    HC

    PE

    piC

    HL6

    0

    HM

    EC

    HR

    CE

    HU

    VE

    C

    Hel

    a

    Hep

    G2

    K56

    2

    Pan

    c

    SA

    EC

    SK

    MC

    SK

    NS

    H

    TH

    1

    0

    500

    Scalechr20:

    HNF4A

    HNF4A

    Txn Factor ChIP

    50 kb42350000 42400000 42450000 42500000

    RefSeq Genes

    Placental Mammal Basewise Conservation by PhyloP

    ENCODE Transcription Factor ChIP-seq

    BJ 100 _

    1 _

    CACO2 100 _

    1 _

    HL-60 100 _

    1 _

    HRCE 100 _

    1 _

    Hela 100 _

    1 _

    HepG2 100 _

    1 _

    Th1 100 _

    1 _

    Mammal Cons

    3 _

    -0.5 _

    AG

    0445

    0

    BJ

    CA

    CO

    2

    H7E

    SC

    HA

    Epi

    C

    HC

    F

    HC

    PE

    piC

    HL6

    0

    HM

    EC

    HR

    CE

    HU

    VE

    C

    Hel

    a

    Hep

    G2

    K56

    2

    Pan

    c

    SA

    EC

    SK

    MC

    SK

    NS

    H

    TH

    1

    0

    2

    4

    6

    8

    Scalechr20:

    HNF4A

    HNF4A

    Txn Factor ChIP

    50 kb42350000 42400000 42450000 42500000

    RefSeq Genes

    Placental Mammal Basewise Conservation by PhyloP

    ENCODE Transcription Factor ChIP-seq

    Hela 100 _

    1 _

    HepG2 100 _

    1 _

    CACO2 100 _

    1 _

    HNF4A-Hela

    1374 _

    0 _

    HNF4A-HepG2

    1374 _

    0 _

    HNF4A-CACO2

    1374 _

    0 _

    Mammal Cons

    3 _

    -0.5 _

    DH

    S d

    ata

    per cell-type for HNF4A selected cell-types

    sepyt-llecdetcelesrofseliforpSRAlacoLA4FNHrofepyt-llecrepSRAgnitluseR

    AR

    S p

    rofil

    esD

    HS

    dat

    a

    Correlation: 0.592 (with Hela) q-value: 0.107Correlation: 0.715 (without Hela) q-value: 0.024

    ARS: 13.0 (with Hela)q-value: < 10e-07ARS: 13.5 (without Hela)q-value: < 10e-07

    0.0 0.2 0.4 0.6 0.8

    −0.

    20.

    00.

    20.

    40.

    6

    Scaled and centered data

    Gene expression

    DH

    S v

    olum

    e

    Hela

    HepG2

    CACO2

    87°

    44°

    14°

    1000

    1500

    GM

    0699

    0

    10

    12

    Gene expression values DNase I hypersensitive sites for HNF4A,A B

    C

    D E

    Fig. 1. Overview of data and proposed approach. (A) Gene-expression measurements for 20 cell lines on an example gene, HNF4A. (B) DHS fragment se-quencing counts in a region about the gene. (C) The DHS signal is captured by summing the overall number of fragments over a given segment size (e.g., ±100 kb)about the gene’s TSS to obtain a DHS volume. After global normalization, the gene-expression data and DHS volume measures are scaled to lie on the unitinterval [0,1] and the data are centered about the origin according to the 2D medoid. For the HNF4A example, three outliers are clearly visible; for example,HepG2 displays both chromatin accessibility and active gene expression, whereas HeLa displays only chromatin accessibility. The goal is to quantitatively capturethe isolated relationship seen in HepG2 and assess whether this relationship is statistically significant. Traditional measures of linear correlation are not suitable foridentifying this type of signal, as shown by the substantial change seen after removal of a single cell line, HeLa, even though the data for HeLa are expected toexist for many genes and cell lines. The proposed ARS is robust to HeLa because the measure is based on angular placement and the median distance to themedoid of the data (dashed circle). (D) The ARS is calculating by first quantifying the relative distance to the origin for each cell line in a robust manner. Anangular penalty for each cell line is then calculated to quantify cell types concordant in both expression and DHS measured. This quantity is measured in terms ofangular distance from the 45° line, and it is then multiplied times its respective relative distance to give and overall score for each cell line. The maximum score istaken as the statistic for the given gene, allowing a comparison across all genes. (E) A local version of the ARS we introduce can pinpoint DHS “peaks” con-tributing the most to the detected association. See main text for details on the proposed methods.

    E646 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfwww.pnas.org/cgi/doi/10.1073/pnas.1312523111

  • or very few cell types. To this end, the cell-type specific chromatinprofiles were quantified by integrating the DHS fragment countsover increasingly larger genomic segments relative to the gene ofinterest (SI Appendix, Fig. S6) to obtain a cell-type-specific re-gional DHS volume. We selected a range of segments that werelikely to encompass all proximal [transcriptional start site (TSS) ±2.5 kb] and most distal regulatory elements (TSS ± 50, ± 100, ±150, ± 200, and ± 100 kb minus proximal 2.5 kb, and ± 200 kbminus proximal 2.5 kb). In addition, to account for copy numbervariation and chromosome-arm-related effects, the obtained DHSvolumes were scaled on either side of the centromere to arrive at

    equilibrium across samples (SI Appendix, Fig. S7). Alternativerepresentations of DHS signal (8, 12) could be used at this step,although we did not identify any advantages in doing so. Gene-expression values were summarized as the mean intensity acrossall probe sets linked to a given RefSeq-gene.

    Detecting Cell-Type-Specific Chromatin Accessibility and Gene-Expression Concordance. Due to the “on–off” nature of DHSand subsequent transcription, there will not necessarily be a lin-ear relationship between DHS and gene-expression measures.Using correlation or correlation-like statistics to associate the

    0 500 1000 1500 2000

    020

    000

    4000

    060

    000

    8000

    0

    GM06990

    K562

    TH1

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    GM06990

    K562

    TH1

    05

    1015

    Pan

    c

    HC

    PE

    piC

    SA

    EC

    HA

    Epi

    C

    H7E

    SC

    HM

    EC

    HR

    CE BJ

    AG

    0445

    0

    Hel

    a

    HU

    VE

    C

    HC

    F

    SK

    NS

    H

    Hep

    G2

    SK

    MC

    HL6

    0

    CA

    CO

    2

    GM

    0699

    0

    K56

    2

    TH

    1

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    HA

    Epi

    C

    HC

    PE

    piC

    AG

    0445

    0

    Pan

    c

    SA

    EC BJ

    SK

    MC

    HL6

    0

    Hel

    a

    CA

    CO

    2

    Hep

    G2

    SK

    NS

    H

    HC

    F

    HM

    EC

    H7E

    SC

    HU

    VE

    C

    HR

    CE

    K56

    2

    GM

    0699

    0

    TH

    1

    02

    46

    810

    12

    HA

    Epi

    C

    HC

    PE

    piC

    Pan

    c

    SA

    EC BJ

    AG

    0445

    0

    HM

    EC

    H7E

    SC

    Hel

    a

    HR

    CE

    HC

    F

    SK

    NS

    H

    SK

    MC

    HL6

    0

    Hep

    G2

    HU

    VE

    C

    CA

    CO

    2

    GM

    0699

    0

    K56

    2

    TH

    1

    0.4 0.2 0.0 0.2 0.4

    0.2

    0.0

    0.2

    0.4

    RAW DATA NORMALIZED DATA

    RELATIVE DISTANCE TO ORIGIN PENALIZED ANGULAR FROM

    ANGLE-RATIO STATISTICS (ARS) RANDOMIZED DATA

    COMPARE ARSmax

    IDENTIFY STATISTICALLYSIGNIFICANT GENES WITH DHS & EXPRESSION CONCORDANCE

    Gene Expression

    DH

    S V

    olum

    e

    Step 1

    Step 2

    Step 3

    Step 4

    Fig. 2. Overview of ARS method, applied to example gene CD69. (Step 1) For a given gene, DHS volume and gene expression are calculated for all 20 celllines as described in the text. DHS volume and gene expression are respectively scaled to lie on the unit interval [0,1] and then median centered beforeconsidering their joint distribution. Each cell type corresponds to a single point. (Step 2) To form the “ratio” component of the ARS, ri, the distance from theorigin to each point is calculated and then scaled by the median distance (Left). The angular distance between each point and the identity line is calculatedand evaluated in an exponential function to determine an angular penalty ai for each cell type (Right). (Step 3) The final ARS is computed as the product ofthe normalized distances and the angular penalties, ARSi ¼ ai × ri . The maximal statistic ARSmax is calculated for each gene and the corresponding cell typerecorded, in this case the TH1 cell line. (Step 4) A randomization method is performed to generate null data, upon which null ARSmax are calculated. These arecompared with the observed ARSmax values to calculate the statistical significance of each gene.

    Marstrand and Storey PNAS | Published online January 27, 2014 | E647

    STATIST

    ICS

    BIOPH

    YSICSAND

    COMPU

    TATIONALBIOLO

    GY

    PNASPL

    US

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdf

  • two measurements across all cell types proved to be unreliableand uninformative (further details in SI Appendix and Figs. S1–S5). One of the key types of relationships we sought to detect isof the type shown in Fig. 1, where one or very few cell types areoutliers from the others. The standard Pearson correlation sta-tistic is not well suited for this scenario. First, it requires the datato be jointly normal to obtain parametric P values, but the nor-mal assumption does not hold for these data (SI Appendix, Fig.S2). Second, this correlation statistic is unstable when there areoutliers, even when using permutation-based P values, demon-strated directly on these data (SI Appendix, Figs. S1 and S3). Therank-based Spearman correlation statistic is a potential alterna-tive, but it shows very poor power relative to the method pro-posed here as shown in Fig. 3 (see also SI Appendix, Figs. S4 andS5). For example, at a false discovery rate (FDR) ≤ 0.05, theproposed method identifies 2,538 genes with a cell-type-specificDHS and gene-expression relationship, whereas the Spearmanstatistic identifies only 286 (Fig. 3 and SI Appendix, Figs. S4and S5).The statistic proposed here is designed to be appropriate for

    scenarios when both measurements are restricted to a narrowrelative range, with one or very few cell types appearing as dis-tinct outliers. To evaluate the relationship between the DHSvolume of a genomic segment and gene expression, we took intoaccount the compactness of the measurements versus any dis-tinct outliers in both dimensions and whether the outliers wereconcordant in both measurements (i.e., a simultaneous increaseor decrease) to form an overall composite measure called anangle ratio statistic (ARS; detailed in Figs. 1 and 2, Materials andMethods, and SI Appendix). To summarize, we first scale andmedian center the DHS volume and expression data, respectively,for a given gene. We then calculate the relative distance of eachcell type to the overall center of the data, which serves as a way tomeasure the degree to which each cell type is an outlier. Tomeasure concordance of DHS volume and gene expression, wecalculate the angular distance between each point and the 45° lineof identity, penalizing points farther away from the line of identityaccording to a data-derived exponential function. These twoquantities are then multiplied to form an ARSi value for each celltype ði ¼ 1;   2;  . . . ; 20Þ, and the maximal value ARSmax is theoverall statistic that quantifies cell-type-specific DHS volume andgene-expression concordance for a given gene.To identify statistically significant genes from ARSmax, we con-

    structed a null distribution based on randomization of the observedexperimental data (Materials and Methods, SI Appendix and Fig.S8). ARSmax values obtained from the randomized data were usedas a basis for determining a P value of the observed ARSmax foreach gene. FDR-based statistical significance and the proportion ofgenes with a true chromatin accessibility and gene-expression re-lationship were estimated from the P values (ref. 13; Fig. 3B).We estimate that ∼25% of genes show concordance between

    chromatin accessibility and gene-expression variation in a cell-type-specific manner. Although our strategy is capable of detectingoutliers showing negative concordance (decreased chromatin ac-cessibility and decreased gene expression), none were found to besignificant at FDR ≤ 0.05. The number of significant genes in-creased by inclusion of distal DHS volume (Fig. 3B, column 2),indicating that distal chromatin programming effects are morewidespread in a genome-wide sense (14). On the other hand, usingthe proximal DHS volume we observe a greater empirical effectsize compared with the distal DHS volumes (Fig. 3B, columns 3–5).This observation is explained by the aggregation of genes

    significant for the same cell type along the genome (15). Testingwhether one or more significant genes within a ±100-kb regionwere associated with the same cell type we found that 481 out of668 significant genes within the specified boundary stem fromthe same cell type (Fishers exact test P < 2.2e-16; SI Appendixand Fig. S14). It is however important to note that inclusion of

    increasingly distal regions also increases the noise in the DHSvolume, wherefore the effect size and ultimately the number oftrue associations starts to decline (Fig. 3A).

    Experimental Replication. To assess reproducibility, we tested theconcordance of significant results among replicated data for 10cell types. Based on two independent measurements of DHS and

    010

    0020

    0030

    0040

    00

    No. Significant Genes at Different FDRs

    DHS segment size

    # si

    gnifi

    cant

    gen

    es

    2.5k

    b

    50kb

    100k

    b

    150k

    b

    200k

    b

    500k

    b

    1mb

    2mb

    FDR

  • gene expression, respectively, we calculated the fraction of pre-dictions preserved in all four-way comparisons (SI Appendix). Wefound that between 86 and 91% of significant genes (FDR ≤ 0.05)were identical (SI Appendix, Fig. S15).

    Gene Ontology and Pathway Analysis. To determine the biologicalcoherence of the set of genes found to be significant for eachcell-type, we performed a gene ontology (GO) enrichment analysis(16). The method computes enrichment within the process andfunction components of GO categories and assigns a numericalsignificance to the findings. In nearly all cases the results were inagreement with the actual biology; see http://encode.princeton.edu/for results on all DHS segment sizes. For example, human T cellsshowed a strong enrichment of T-cell receptor related genes,whereas hepatic cells showed enrichment of lipid metabolism-related genes. KEGG pathways (17, 18) were likewise enrichedin a cell-type-specific manner. For example, HepG2 (a hepato-cellular carcinoma cell line) showed significant enrichment forgenes within the bile-acid synthesis and drug metabolism, andHL60 (a human promyelocytic leukemia cell line) showed sig-nificant enrichment within the hematopoietic cell lineage.Furthermore, all genes detected within each cell type at

    FDR < 0.05 (±100kb DHS volume) were analyzed through theuse of Ingenuity Pathways Analysis (Ingenuity Systems, www.ingenuity.com). For all but 3 cases out of 20 (two cell types likelyhad too few significant genes detected to get reliable annota-tions), the category “physiological system development and func-tion” was in clear correspondence with that expected given the celltype (SI Appendix, Fig. S13). For instance, TH1 was enriched forcell-mediated immune response, K562 for hematological systemdevelopment and function, and H7ESC for embryonic devel-opment. For each gene, there tended to be low relative ARSiacross the remaining cell types, indicating that we detected trulycell-type-specific genes as clear outliers on a genome-wide scale.However, some cases showed large relative ARSi in a few tissues,which prompted us to investigate these instances further.Among genes with a statistically significant ARSmax statistic,

    additional inspection of the remaining ARSi were explored fordetection of possible substructures. We calculated relative ARSvalues within each gene dividing all ARSi by ARSmax. In additionto many instances of singular outliers, we detected a gradientbehavior among significant genes, where a few cell types wereevident as outliers (SI Appendix, Fig. S16).

    Local ARS Profiles. The DHS data itself provides a rich source ofinformation about regulatory elements in the genome. However,when used in conjunction with gene-expression data acrossdiffering cell types, there is an opportunity to discover whichlocations of chromatin accessibility drive gene expression in a cell-type-specific manner. This goal prompted us to develop a tech-nique to model the relationship for fine-scale segments of DHSvolume across the larger segments. As the above strategy focusedon examination of chromatin gene-expression interactions overgenomic segments, investigation of fine-scale patterns allowed usto: (i) validate that distal regulatory regions were indeed presentas peaks in chromatin accessibility concordant with gene expres-sion in a cell-type-specific manner, (ii) perform sequence analysesof these chromatin accessibility peaks, (iii) compare localizedassociations across cell types or within a single gene, and (iv)provide a framework for quantifying regions of interest on a con-tinuous scale for investigation of regulatory elements.We therefore extended our approach to allow one to identify

    and map DHS sites to genes on which they show strong evidencefor playing a regulatory role in a cell-type-specific manner. Thiswas carried out by providing a fine-scale version of the ARSquantification, called a local ARS profile for genes with a sta-tistically significant ARSmax statistic over a larger segment. Thepeaks of the local ARS profiles pinpoint which DHS are most

    influential in explaining the cell-type-specific gene-expressionvariation, thereby indicating that they have the most regulatorypotential. We retained the gene-expression values for a givensignificant gene, and now considered the DHS volume withinnonoverlapping consecutive regions at a high-resolution 60-base-pair windows. The ARS was calculated for each 60-bp window,which can then be plotted over the entire region used in iden-tifying the gene as statistically significant. (The original sequencealignments for the DHS data were mapped into 20-bp windowsspanning the entire genome, so we chose windows of size 60 bpto smooth the local DHS measure across neighboring sites.) Forexample, for a gene significant with respect to a ±200-kb DHSvolume, we calculated ∼6,700 local ARS values for each cell type.These can then be plotted in such a way that the signal emanatingfrom that location is visible, loosely analogous to a logarithm ofodds (LOD) score profile in linkage analysis. Additional stepswere taken, involving scaling across the 60-bp windows to preservea valid interpretation of their relative magnitudes (SI Appendix).We first selected the subset of local ARS profile “peaks” by

    thresholding the local ARS profiles in a principled manner(Materials and Methods), and we analyzed both positional biasesand sequence compositions as they relate to function. We thenanalyzed the entire trajectories of local ARS profiles at specificloci, showing that they identify both known and putative regu-latory DHS for given genes.

    Positional Bias of Putative Regulatory DHS. Because the overallstatistical significance increases when calculating DHS volumeover more distal regions up to 200 kb (Fig. 3), we investigated thepositional bias of local ARS peaks in a cell-type-specific manner.Fig. 4A shows smoothed densities of positional local ARS peakcounts by cell type, which exhibit high cell-type-specific differ-ences, specifically the density around the TSS. Random densitieswere generated by randomly assigning positional counts to tissuesin equal proportions to the observed counts, where it can be seenthat the cell-type differences are no longer present (SI Appendix,Fig. S17). This points to the existence of cell-type-specific dif-ferences in the base-pair distance of regulatory DHS to TSS.

    Sequence Analysis of Peaks in Local ARS Profiles. We next sought tocharacterize the functional significance of sequences correspond-ing to local ARS peaks. Because a general indicator of function-ality is conservation, we extracted the conservation track values(phastCons44wayPrimate, hg18) (19) corresponding to the localARS peaks and to the negative control set (Materials andMethods).Values range from 0 to 1, with 1 indicating the most conserved.The regions with local ARS peaks were significantly more con-served than regions from the negative control set (Kolmogorov–Smirnov P < 2.2e-16; SI Appendix, Fig. S18), indicating substantialconservation of sequences corresponding to local ARS peaks.DNase-I hypersensitive sites are well established markers

    of regulatory and other DNA-binding proteins. We thereforesought to establish whether known cell-type-specific transcrip-tion-factors binding sites (TFBSs) are overrepresented in thelocal ARS peaks relative to the negative control set (Materialsand Methods). Because regions distal to the TSS are rarelystudied in this context, we eliminated all local ARS peaks andnegative controls that fell within ±10 kb of the TSS. This stepwas taken to demonstrate that the proposed approach is capableof detecting distal TFBS, up to 200 kb from the TSS.We used the JASPAR database (20) to identify TFBS that

    are differentially represented in the local ARS peaks relative tothe negative control set (Materials and Methods). The over- andunderrepresented TFBS show distinct cell-type-specific patterns(21) and provide a rich insight into cell-type-specific gene reg-ulation (Fig. 4B), several of which are listed here:

    Marstrand and Storey PNAS | Published online January 27, 2014 | E649

    STATIST

    ICS

    BIOPH

    YSICSAND

    COMPU

    TATIONALBIOLO

    GY

    PNASPL

    US

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://encode.princeton.edu/http://www.ingenuity.comhttp://www.ingenuity.comhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdf

  • • Among the hepatocyte nuclear factors we found HNF1B (tran-scription factor 2, TCF-2) and HNF4A to have significant chro-matin gene expression concordance in HRCE (a human renalcortical epithelial cell line) and HepG2, respectively (SI Ap-pendix, Figs. S19 and S20). Furthermore we found the localARS profiles in the respective tissues to display a marked over-representation of the factor in question, HNF1B in HRCE andHNF4A in HepG2. Mutations in HNF1B have been associatedwith a broad range of renal diseases (22), and HNF4A is essen-tial for hepatocyte differentiation and morphology (23).

    • H7ESC (a human embryonic stem cell line) was found to showoverrepresentation of SOX2 and POU5F1 (Oct-4) both essen-tial for self-renewal in undifferentiated stem cells.

    • NFYA (a CCAAT-binding protein) was found overrepre-sented in almost all tissues. This factor is essential for en-hancer function by recruiting distal transcription factors to

    the proximal promoter region (24). The ubiquitous CCAAT-binding-factor family is linked to cellular differentiation ina variety of tissues (25).

    • Retinoid X receptors (RXR)–retinoic acid receptors (RAR)were found in human amniotic epithelial cells (HAEpiC). Thecoexpression of RAR and RXR (26) is essential for properplacental development, and RXR null mouse mutants arelethal after 10 d due to placental defects (27).

    • Forkhead binding sites were found to be primarily underrep-resented, specifically FOXD3 was under-represented in, amongothers, the leukemic cell types. Silencing of FOXD3 by aber-rant chromatin modification has been implicated in leukemo-genesis (26). Overexpression of FOXD3 prevents neural crestformation (28). It is interesting to note that binding sites forthe factor were underrepresented in SKNSH, a neuroblastomaderived from neural crest cells.

    Distance from TSS

    dens

    ity

    0.00000

    0.00005

    0.00010

    0.00015

    0.00020

    0.00025

    0.00030

    -200kb -120kb -40kb +40kb +120kb

    tissue

    CACO2

    GM06990

    H7ESC

    HAEpiC

    Hela

    HepG2

    HL60

    HRCE

    K562

    SKNSH

    TH1

    H7E

    SC

    HA

    Epi

    C

    Hel

    a

    HR

    CE

    SK

    NS

    H

    K56

    2

    CA

    CO

    2

    HL6

    0

    Hep

    G2

    TH

    1

    GM

    0699

    0Prrx2Pdx1ARID3AHltfNKX3−1FOXL1Foxd3FOXI1SOX9RESTTal1::Gata1TEAD1NR3C1RXR::RAR_DR5Zfxznf143Sox2Pou5f1MIZFLhx3PPARG::RXRAHNF4AHNF1BPax6HNF1ACTCFE2F1TFAP2ASP1Klf4ELK1NFYAELK4Egr1GABPA

    −3.48 −2.8 −2.25−3.1 −2.49 −2.07

    −3.82 −2.18−3.02−3.43

    −2.49 −4.84 −2.18 −3.06 −2.2−4.31 −3.3 −3.07 −2.89 −2.47 −3.13 −4.49−3.52 −3.26 −3.91 −2.61 −2.36 −2.76 −4.11−2.84 −3.66 −2.43

    7.2580.635.273.3

    2.48 2.75 2.44 3.155.086.04

    −2.34 − 20.368.29.2 2.45.52

    7.017.08

    2.64 4.71−4.1

    17.332.33.27

    −3.36 3.763.674.79

    15.539.333.536.4 4.742.45 3.48 2.38 3.3 2.69 3.65

    2.34 3.11 3.04 2.43 2.2643.256.2 4.14 4.11 3.45 2.92

    2.33 11.21.2 4.02 4.12 3.54 2.912.2 3.98 3.08 3.65 3.34 3.662.71 78.251.545.390.2 4.21 3.84 4.11

    2.45 2.73 2.97 4.47 3.86 4.26 4.2 4.6197.512.331.491.3 2.24 4.97 4.65 4.72

    92.532.536.583.469.548.372.320.2

    −4 −2 0 2 4 6

    Log2-fold change

    050

    150

    Cou

    nt

    Cell-types

    Transcription factor models

    A

    B

    Fig. 4. Analysis of local ARS profiles. (A) Distribution of local ARS peaks relative to the TSS according to cell type. The positional bias of cell-type-specific localARS peaks as measured by the density of local ARS peaks within cell lines with respect to position from to the TSS. Clear differences in the amount of distalregulation are seen across the cell types and the density around the TSS differ markedly among cell types. For example, HL60 shows a more proximal signalrelative to that of HAEpiC. (B) Transcription-factor binding site analysis among local ARS peaks occurring 10 to 200 kb from the TSS. Sequences correspondingto local ARS peaks within significant cell-type-specific genes were searched with known transcription-factor binding site models, and the relative over- andunderrepresentation was assessed based on a negative control set. Instances of absolute log2 fold-change ≥2 are displayed within the relevant cell types.Overrepresentation is indicative of a preferential transcription factor binding site, and is therefore a likely regulatory candidate for the observed geneexpression. Underrepresented sites indicate factors that should be avoided to maintain proper cell-type-specific expression profiles. For instance, Sox2 andPou5f1 (Oct4) were observed solely overrepresented in the embryonic cells, H7ESC.

    E650 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfwww.pnas.org/cgi/doi/10.1073/pnas.1312523111

  • • NF-κB was found over-represented in TH1, where it promotesthe expression of, among others, interleukin 12 (IL-12) essen-tial for TH1 development (29).

    The differentially represented TFBSs were distributed largelydistal. For all cell types, from 68 to 79% were located more than ±50 kb away from the TSS. We repeated the analysis with onlythe proximal regions (±10 kb from the TSS), and we found thatseveral important known cell-type-specific motifs were no longerdetected (SI Appendix, Fig. S21).To validate our in silico TFBS predictions with an indepen-

    dent data source, we downloaded the UCSC genome tracks forPOU5F1 (Oct-4) in HESC and NFYA in K562 (http://genome.ucsc.edu/ENCODE/). (Note that HESC is not completely iden-tical to H7ESC, but both are embryonic stem cells.) We nextcalculated the overlap between the POU5F1 binding sites deter-mined from the ChIP-seq data and the positive local ARS peaksthat we identified in H7ESC. Of the 1,614 peaks used for TFBSanalysis in H7ESC, 200 overlapped with the POU5F1 ChIP-seqcalls; of the remaining 37,205 positive local ARS peaks in othercell types, where POU5F1 was not predicted to be enriched, only119 overlapped. Similarly for K562, of the 4,861 positive localARS peaks 439 overlapped ChIP-seq calls for NFYA. For celltypes where NFYA was not detected [HRCE, SNKSH (a neuro-blastoma cell line), and HeLa], of 8,176 positive local ARS peaks,only 72 had overlapping ChIP-seq calls. Hence, we obtain an en-richment of 39-fold for POU5F1 (P < 10−16) and 10-fold for NFYA(P < 10−16), respectively. Overall, this supports the method’sability to map relevant regulatory regions on a fine scale.

    Single-Nucleotide Polymorphisms. The local ARS peaks were alsoinvestigated with respect to SNPs found to be significant in genome-wide association studies (GWAS). The database compiled byJohnson and O’Donnell (30) containing a total of 52,622 uniqueSNPs associated with disease phenotypes (56,411 total associa-tions) was mapped against the genomic coordinates of the localARS peaks. We found that only 42 SNPs fell within local ARSpeaks, and the expected value under completely random place-ment of SNPs is 68. A statistical test indicates that there is sig-nificant underrepresentation of GWAS SNPs (two-sided exacttest P < 0:001). This is likely linked to the conserved state of thesequences corresponding to local ARS peaks (shown above) andthat these regions harbor cell-type-specific regulatory elements,as indicated by the above TFBS analysis.This does not preclude GWAS SNPs from appearing in local

    ARS peaks. Indeed, such an occurrence may be particularlynoteworthy. An interesting example is that of rs1890131 (31),which lies in a local ARS peak driving the ARS-based statisticalsignificance of three different genes (SI Appendix, Fig. S22):CREG1 in HL60, RCSD1 in TH1, and CD247 in TH1 (SI Ap-pendix, Fig. S23). The SNP rs1890131 has been associated withcoronary spasms via CREG1, where cellular repressor of E1A-stimulated genes (CREG) attenuates cardiac hypertrophy (32)by blocking mitogen-activated protein kinases/extracellular signal-regulated kinase (MEK-ERK)1/2-dependent signaling. RCSD1(also termed capZ-interacting protein) is suggested to regulate theability of CapZ to remodel actin filament assembly in vivo (33),and thereby affects cardiac isometric tension generation (34). Last,CD247 (TCRζ) is a master regulator of adaptive immune re-sponses, where loss of TCRζ-chain expression may contribute tothe inflammatory process that triggers coronary instability by ac-cumulating in coronary plaques (35). The SNP rs1890131 is cen-tered in a hub of transcription-factor binding sites as detected bythe ENCODE transcription-factor track (SI Appendix, Fig. S24).The local ARS peak containing this SNP contributed significantlyto the association detected for all three genes, which are distal andin different orientation from the site of the SNP (SI Appendix,Fig. S22).

    Mapping Putative Regulatory DHS to Genes. We also investigatedthe utility of considering the entire trajectory of local ARS pro-files at a locus to characterize the regulatory architecture ofcell-type-specific expression. We investigated in detail two well-characterized examples of regulatory interactions at the β-globin(HBB) locus control region and at the stem cell leukemia (SCL)gene, also known as TAL1, with several more appearing in SIAppendix, Figs. S25–S37. It can be seen from these analyses thatthe local ARS profiles provide a means to map DHS sites togenes in a cell-type-specific manner.The HBB (β-globin) locus control region (LCR) comprises an

    array of functional elements that in vivo gives rise to five majorDNase I hypersensitive sites (HS1–HS5; refs. 36–38; Fig. 5)upstream of HBE1 (e-globin) on the short arm of chromosome

    Scalechr11:

    20 kb hg185210000 5220000 5230000 5240000 5250000 5260000

    RefSeq GenesHBB HBD HBBP1 HBG1

    HBG2HBE1

    HBB-K562510 _

    0 _

    HBB-HL60510 _

    0 _

    HBE1-K5627100 _

    0 _

    HBE1-HL607100 _

    0 _

    K562 DHS 100 _

    1 _

    HL60 DHS 100 _

    1 _

    BJ DHS100 _

    1 _

    CACO2 DHS 100 _

    1 _

    Th1 DHS 100 _

    1 _

    Loca

    l AR

    SR

    aw D

    HS

    sig

    nal

    Scalechr1:

    20 kb hg1847410000 47420000 47430000 47440000 47450000 47460000 47470000

    RefSeq GenesPDZK1IP1 TAL1

    TAL1-CACO2

    1300 _

    0 _

    TAL1-HL60

    1300 _

    0 _

    TAL1-HRCE

    1300 _

    0 _

    TAL1-K562

    1300 _

    0 _

    PDZK1IP1-CACO2

    3800 _

    0 _

    PDZK1IP1-HL60

    3800 _

    0 _

    PDZK1IP1-HRCE

    3800 _

    0 _

    PDZK1IP1-K562

    3800 _

    0 _

    CACO2 DHS

    100 _

    1 _

    HL60 DHS

    100 _

    1 _

    HRCE DHS

    100 _

    1 _

    K562 DHS

    100 _

    1 _

    Loca

    l AR

    S T

    AL1

    Loca

    l AR

    S P

    DZ

    K1I

    P1

    Raw

    DH

    S S

    igna

    l

    A

    B

    Fig. 5. Mapping putative regulatory DHS with local ARS profiles at two loci.(A) The β-globin locus control region. The DHS data for five cell lines (out of20) are shown, as well as the local ARS profiles for HBB and HBE1 in the K562and HL60 cell lines. The transparent yellow boxes indicate regulatoryregions, specifically hypersensitive regions 1–5 (HS1–5), together with a lesscharacterized site upstream of HBD. It can be seen that HBB and HBE1 showdifferent local ARS profiles, indicative of differences in use of regulatoryelements. The local ARS profile shows no peak in HL60 despite the existenceof a hypersensitive site when considering DNaseI profile alone. The full dataand local ARS profiles for all 20 cell lines and both genes are displayed in SIAppendix, Figs. S25–S29. (B) TAL1 locus. We identified TAL1 as statisticallysignificant with its maximal ARS in the K562 cell line across all tested ge-nomic segments. Local ARS profiles show a dominant effect from the +40enhancer region (green box), spanning PDZK1IP1. DHS signals across multi-ple cell types were correctly not detected to be associated with the expres-sion of TAL1. Furthermore, note that even though the DHS data for TAL1and PDZK1IP1 are largely overlapping, they nevertheless have distinct localARS profiles due to their different patterns of gene expression. This dem-onstrates that ARS is capable of separating interwoven signals across celltypes for neighboring genes, and that there is information to be gained bycombining DHS and gene-expression profiling. The full data for all 20 celllines and local ARS profiles are displayed in SI Appendix, Figs. S31–S33.

    Marstrand and Storey PNAS | Published online January 27, 2014 | E651

    STATIST

    ICS

    BIOPH

    YSICSAND

    COMPU

    TATIONALBIOLO

    GY

    PNASPL

    US

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://genome.ucsc.edu/ENCODE/http://genome.ucsc.edu/ENCODE/http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdf

  • 11. All five sites were present in cell line K562 according to ourDHS data (see SI Appendix, Figs. S25–S29 for complete dataacross all 20 cell types, and SI Appendix, Fig. S30 for local ARSprofiles across all genes at this locus control region). Althoughthe DHS volume at these sites contributed to both HBE1 andHBB yielding statistically significant ARS values, the relativeimportance of HS1-5 differs significantly between these twogenes, clearly detected by the local ARS profiles (Fig. 5A).In the case of HBE1, we observed local ARS peaks for HS1

    at −6.1 kb and to a lesser extent HS3 and HS4 (−14.7 and −18kb, respectively). For HBB we observed similar local ARS pro-files for HS1, HS3, and HS4, and smaller local ARS values forHS2 (−10.9 kb). It has previously been shown that HS1 is a stablechromatin structure (37) throughout development and essentialfor HBE1 expression (39) due to a GATA-1 binding site, andHS2, 3, and 4 show synergistic enhancement of expression ofHBB by formation of the LCR holocomplex (40–42). Finally, theelement upstream of HBD has also been reported to specificallyenhance transcription of HBB (43). Although HS5 is present inthe DHS data for K562, similar open chromatin structures weredetected in other tissues. HS5 (−21 kb) is not in concordancewith tissue-specific gene expression of either HBE1 or HBB, anobservation in line with this site’s function as an insulator andCTCF-binding site (44).T-cell acute lymphocytic leukemia protein 1 (TAL1) encodes

    a basic helix–loop–helix protein, which is essential for the for-mation of all hematopoietic lineages (SI Appendix, Figs. S31–S33for all data across the 20 cell types). Previous studies usingchromatin structure, comparative sequence analysis, transfectionassays (45, 46), and transgenic mice (47–51) have identified atotal of five enhancers modulating the expression of TAL1. Wedetect TAL1 as significant with maximal cell line K562 across alltested genomic segments (from ±2.5 to ±200 kb) with the mostsignificant ARSmax occurring for ±50 kb. Further investigation bythe local ARS profile (Fig. 5B) showed that although proximalregulatory sites were correctly identified, the most dominant signalis by far confined to the +40 enhancer region and is an order ofmagnitude greater than other signals. Although the TAL1 +40region is downstream of PDZK1IP1, it was not linked to the ex-pression of this gene, which was detected as significant in HRCE.The +40 enhancer region has been shown to direct expressionin transgenic mice to primitive, but not definitive erythroblasts,such as the phenotype displayed by K562. This example dem-onstrates that our methodology is capable of identifying regionsof regulatory potential, which otherwise requires laborious ef-fort to annotate.Local ARS profiles showed both differences and similarities

    across genes as well as cell types. A few examples included:

    • CCR2 and CCR5 were significant for two different cell-types,HL60 and TH1, respectively (SI Appendix, Fig. S34).

    • Part of the HOX-cluster crucial for kidney development inmammals (HOXD8, HOXD4, and HOXD3) showed identicallocal ARS profiles (SI Appendix, Fig. S35), and all were sig-nificant genes in HRCE (52).

    • Another example of shared profiles, but across several celltypes instead of across several genes, was seen with LOXL2,a gene essential for biogenesis of connective tissue, which isdetected as an outlier in human skeletal muscle cells (SKMC)and has high relative ARS values in HAEpiC and BJ (skinfibroblast) (SI Appendix, Fig. S36). Further fine-scale investi-gation showed a solid overlap in the local ARS profiles (SIAppendix, Fig. S37).

    These observations point to a potentially widespread sharingof regulatory mechanisms both across genes and cell types.

    Web ResourceTo provide an interface for the community to use the resultsfrom this work, the local ARS profiles across any given gene inany of the 20 cell types can be calculated via our web service athttp://encode.princeton.edu/, where all results encompassing thelarger DHS regions are also searchable.

    DiscussionAs the epigenome in multicellular organisms is a dynamic entitywhose variation leads to reprogramming of gene expression (53),it is a likely candidate in the etiology of disease complementaryto that of mutations in DNA (54, 55). It is therefore of consid-erable interest to identify and characterize the regulatory regionscontributing to gene-expression variation with respect to a giv-en disease.We have presented a framework for quantifying relationships

    between chromatin structure and gene expression across multi-ple conditions (here, cell types), facilitating avenues for un-derstanding cellular responses by localizing and characterizingregions of regulatory potential. The local ARS profiles we in-troduced allow specific hypersensitive regions to be associatedwith condition-specific gene expression, thereby conferring con-textual regulatory information not obtainable using DHS dataalone. This effectively pinpoints a short list of primary candidatesfor further functional studies. We found the peaks from the localARS profiles in statistically significant segments to be both highlyconserved and enriched for known transcription-factor bindingsites as far as 200 kb from transcription start sites. Although be-yond the scope of the current work, we believe our approach couldbe used in conjunction with quantitative trait analyses to increasethe power for detecting true cis- and trans-acting SNPs by in-terfering with transcription factor binding sites, which in turn leadsto altered DHS signals in a similar manner as Degner et al. (56).As measurements from high-throughput sequencing platforms

    become commonplace in molecular biology, there will be anincreasing demand for the development of new statisticalapproaches for these data. A major challenge is that sequencingmeasurements are rarely in units directly relatable to one another;e.g., DHS measures chromatin accessibility, ChIP-seq measuresbinding affinity, RNA-seq measures RNA molecule abundance,etc. Our framework provides the initial development of a statisticthat captures relationships among these measurements and ena-bles statistical testing of associations among them. Moreover, byexploiting variation across multiple conditions, the sensitivity ofour approach should only increase with additional data andsources of variation. Hence, the presented framework can likely beapplied to test for associations between appropriate continuousquantitative genomic measurements and gene expression, therebyfacilitating a comparable basis for metaanalyses on the interplay ofepigenetic features.

    Materials and MethodsDHS and Gene-Expression Data. The data used in this study were generatedthrough the ENCODE consortium and are publicly available. Established celllines and primary cells used in this study were procured from commercial orother sources as listed in SI Appendix, Table S1. The cells were cultured as perthe vendor recommendations, and individual cell growth protocols areavailable in the University of California, Santa Cruz (UCSC) human genomebrowser. The DHS data are available at the UCSC genome browser bydownloading the track IDs listed in SI Appendix, Table S2 and the web ad-dress shown therein. Normalized probe-level expression data were obtainedfrom the Gene Expression Omnibus; the accession numbers for all arrays areshown in SI Appendix, Table S4. Probes were mapped to genes according toHG18 using bowtie (57) allowing for two mismatches and up to 10 maps tothe genome, including the best match. Only probe sets for which all probeshad a unique best match and fully corresponded to exon boundaries foundin RefSeq annotations (HG18) were retained for further analysis. If a RefSeqgene had multiple splice variants, these were aggregated to a metagenestructure. In the rare event that a gene mapped to currently ambiguous

    E652 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://encode.princeton.edu/http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1312523111/-/DCSupplemental/sapp.pdfwww.pnas.org/cgi/doi/10.1073/pnas.1312523111

  • regions (e.g., chr6_random) such regions were not included. To arrive ata gene-specific expression value, the mean expression across all probe setswithin the exon boundaries of the gene model was calculated. This yieldedexpression measures for 19,215 genes on 20 cell lines.

    Statistical Methods. The ARS algorithm and statistical analyses were written inthe R programming language (58). The main ARS algorithm, results, GOanalyses, and preprocessed data are available at http://encode.princeton.edu/. Complete details of the ARS algorithm, including the null randomi-zation strategy and estimation of the angular penalty, are provided inSI Appendix.

    A schematic of the method is shown in Fig. 1. We represented themeasurements of a single gene by two paired vectors x ¼ ðx1, . . . ,xmÞ forgene expression and y ¼ ðy1, . . . ,ymÞ for DHS volume, wherem is the numberof cell types under consideration (here, m ¼ 20). To place the two variableson a common scale, each vector was scaled by its maximum observationxs ¼ xmaxfx1 ,...,xmg and ys ¼

    ymaxfy1 ,...,ymg so that all values are now in ½0,1�. Each

    vector was then centered by its median medðxsÞ and medðysÞ to formx* ¼ xs −medðxsÞ and y* ¼ ys −medðysÞ. Hence the data for a given geneand segment are now centered around the 2D medoid where the center ofmass of the data lies at the origin. If there is little variation across themultiple cell types, all points would cluster around the medoid, and singularcell-types displaying greater variation would be present as distinct outliers(SI Appendix, Figs. S8 and S9). To gauge potential outliers the Euclidean

    distances di ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffix*2i þ y*2i

    qwere calculated for every cell type i ¼ 1, . . . ,m to

    produce the distance vector d ¼ ðd1, . . . ,dmÞ. We formed a ratio statisticaccording to ri ¼ dimedðdÞ, thereby quantifying the relative distance of eachpoint to the medoid.

    Although the ratios ri describe the dispersion of the data, it does notaccount for any concordance between the measurements. A perfectly con-cordant relationship between the two measurements would result in pointslying along the 45° diagonal identity line. We therefore calculated the angleθi for each data point ðx*i ,y*i Þ relative to the unit vector ð1,0Þ for i ¼ 1, . . . ,m,where 0≤ θi ≤ 360. The angular penalty involves first calculating the smallerof the two angular distances between θi and the identity line, denoted as Δi .For example, Δi ¼ j45− θi j for 0≤ θi < 135. The angular penalty is calculatedas ai ¼ expðc×ΔiÞ, where c≤ 0 and is determined empirically to satisfya correct null distribution (SI Appendix). Therefore, the value ai measures thepenalized angular distance of ðx*i ,y*i Þ from the identity line in a symmetricfashion (SI Appendix, Fig. S10). The statistic applied to each ðx*i ,y*i Þ pair isthen ARSi ¼ ai × ri , with the gene’s overall statistic being the maximum,ARSmax ¼ maxðARS1, ARS2,  . . . , ARSmÞ. In addition to calculating thesequantities for each gene, we also recorded the ordering of the cell types asdetermined by their relative ARSi values.

    Inclusion of the angular penalty had a twofold purpose. First, it correctlyeliminated points that were outliers in only one dimension, gene expressionor DHS alone, and therefore not of interest here because there is no directrelationship between the two measurements. Second, penalizing such pointsacted as a tuning parameter adjusting for the degree of off-diagonal noise inthe data, and thereby ensured a correct null distribution and P values (SIAppendix, Fig. S11). The specific value of c was determined such that ob-

    served null P values over ðp≥ 0:5Þ had a Uniform(0,1) distribution accordingto a Kolmogorov–Smirnov test (SI Appendix). Reassuringly, this lead tonearly identical values for c across all genomic segment sizes of DHS volumeconsidered (SI Appendix, Fig. S12).

    The scaled data xs and ys for all genes were aggregated into a singledistribution in the unit square ½0,1� × ½0,1�. From this, randomized data setswere created by sampling 20 points that preserves the fact that either onepoint must lie on ð1,1Þ or two points lie on ðx,1Þ and ð1,yÞ, respectively. The20 sampled points are then median centered and the ARSmax statistic iscalculated. We performed this 100 times to generate 100 sets of null ARSmaxstatistics for every gene (for a total of 100 × 19,215 null statistics). A P valuewas then formed for each gene by calculating the frequency by which nullstatistics exceed the observed statistic. The P values were then used to cal-culate FDR q values for the genes, as previously described (13). See SI Ap-pendix for full details on this randomization method.

    Selecting Local ARS Profile Peaks for Further Analysis.We first identified genescalled significant at FDR < 0.10 for the ARS analysis performed on the seg-ment size of ±200 kb about the TSS. We recorded the maximal cell type foreach of these genes (i.e., the cell type yielding the ARSmax value), producinga list of significant gene/cell-type pairs. We limited our selection of gene/cell-type pairs to those cell types that were maximal at this threshold for at least100 genes. For each of these selected gene/cell-type pairs, we scaled its localARS profile by the maximal value in the ±200 kb segment about the TSS. AllDNA sequences ±50 bp with scaled local ARS profile value >0.5 were thenselected as local ARS peaks. Likewise, all DNA sequences ±50 bp with scaledlocal ARS profile value

  • 24. Wright KL, et al. (1994) CCAAT box binding protein NF-Y facilitates in vivo re-cruitment of upstream DNA binding transcription factors. EMBO J 13(17):4042–4053.

    25. Lekstrom-Himes J, Xanthopoulos KG (1998) Biological role of the CCAAT/enhancer-binding protein family of transcription factors. J Biol Chem 273(44):28545–28548.

    26. Sapin V, Ward SJ, Bronner S, Chambon P, Dollé P (1997) Differential expression oftranscripts encoding retinoid binding proteins and retinoic acid receptors duringplacentation of the mouse. Dev Dyn 208(2):199–210.

    27. Sapin V, Dollé P, Hindelang C, Kastner P, Chambon P (1997) Defects of the chorio-allantoic placenta in mouse RXRalpha null fetuses. Dev Biol 191(1):29–41.

    28. Pohl BS, Knöchel W (2001) Overexpression of the transcriptional repressor FoxD3prevents neural crest formation in Xenopus embryos. Mech Dev 103(1-2):93–106.

    29. Murphy TL, Cleveland MG, Kulesza P, Magram J, Murphy KM (1995) Regulation ofinterleukin 12 p40 expression through an NF-kappa B half-site. Mol Cell Biol 15(10):5258–5267.

    30. Johnson AD, O’Donnell CJ (2009) An open access database of genome-wide associa-tion results. BMC Med Genet 10:6.

    31. Suzuki S, et al. (2007) A novel genetic marker for coronary spasm in women froma genome-wide single nucleotide polymorphism analysis. Pharmacogenet Genomics17(11):919–930.

    32. Bian Z, et al. (2009) Cellular repressor of E1A-stimulated genes attenuates cardiachypertrophy and fibrosis. J Cell Mol Med 13(7):1302–1313.

    33. Eyers CE, et al. (2005) The phosphorylation of CapZ-interacting protein (CapZIP)by stress-activated protein kinases triggers its dissociation from CapZ. Biochem J389(Pt 1):127–135.

    34. Pyle WG, La Rotta G, de Tombe PP, Sumandea MP, Solaro RJ (2006) Control of cardiacmyofilament activation and PKC-betaII signaling through the actin capping protein,CapZ. J Mol Cell Cardiol 41(3):537–543.

    35. Ammirati E, et al. (2008) Expansion of T-cell receptor zeta dim effector T cells in acutecoronary syndromes. Arterioscler Thromb Vasc Biol 28(12):2305–2311.

    36. Tuan D, Solomon W, Li Q, London IM (1985) The “beta-like-globin” gene domain inhuman erythroid cells. Proc Natl Acad Sci USA 82(19):6384–6388.

    37. Forrester WC, Thompson C, Elder JT, Groudine M (1986) A developmentally stablechromatin structure in the human beta-globin gene cluster. Proc Natl Acad Sci USA83(5):1359–1363.

    38. Grosveld F, et al. (1990) The dominant control region of the human beta-globin do-main. Ann N Y Acad Sci 612:152–159.

    39. Shimotsuma M, Okamura E, Matsuzaki H, Fukamizu A, Tanimoto K (2010) DNase Ihypersensitivity and epsilon-globin transcriptional enhancement are separable in lo-cus control region (LCR) HS1 mutant human beta-globin YAC transgenic mice. J BiolChem 285(19):14495–14503.

    40. Fraser P, Hurst J, Collis P, Grosveld F (1990) DNaseI hypersensitive sites 1, 2 and 3 of thehuman beta-globin dominant control region direct position-independent expression.Nucleic Acids Res 18(12):3503–3508.

    41. Molete JM, et al. (2001) Sequences flanking hypersensitive sites of the beta-globinlocus control region are required for synergistic enhancement. Mol Cell Biol 21(9):2969–2980.

    42. Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W (2002) Looping and interactionbetween hypersensitive sites in the active beta-globin locus. Mol Cell 10(6):1453–1465.

    43. Acuto S, et al. (1996) An element upstream from the human delta-globin-encodinggene specifically enhances beta-globin reporter gene expression in murine eryth-roleukemia cells. Gene 168(2):237–241.

    44. Tanimoto K, et al. (2003) Human beta-globin locus control region HS5 contains CTCF-and developmental stage-dependent enhancer-blocking activity in erythroid cells.Mol Cell Biol 23(24):8946–8952.

    45. Fordham JL, Göttgens B, McLaughlin F, Green AR (1999) Chromatin structure andtranscriptional regulation of the stem cell leukaemia (SCL) gene in mast cells. Leu-kemia 13(5):750–759.

    46. Göttgens B, et al. (2002) Establishing the transcriptional programme for blood: TheSCL stem cell enhancer is regulated by a multiprotein complex containing Ets andGATA factors. EMBO J 21(12):3039–3050.

    47. Chapman MA, et al. (2003) Comparative and functional analyses of LYL1 loci establishmarsupial sequences as a model for phylogenetic footprinting. Genomics 81(3):249–259.

    48. Göttgens B, et al. (2001) Long-range comparison of human and mouse SCL loci: Lo-calized regions of sensitivity to restriction endonucleases correspond precisely withpeaks of conserved noncoding sequences. Genome Res 11(1):87–97.

    49. Göttgens B, et al. (2002) Transcriptional regulation of the stem cell leukemia gene(SCL)—Comparative analysis of five vertebrate SCL loci. Genome Res 12(5):749–759.

    50. Sánchez M, et al. (1999) An SCL 3′ enhancer targets developing endothelium togetherwith embryonic and adult haematopoietic progenitors. Development 126(17):3891–3904.

    51. Sinclair AM, et al. (1999) Distinct 5′ SCL enhancers direct transcription to developingbrain, spinal cord, and endothelium: Neural expression is mediated by GATA factorbinding sites. Dev Biol 209(1):128–142.

    52. Di-Poï N, Zákány J, Duboule D (2007) Distinct roles and regulations for HoxD genes inmetanephric kidney development. PLoS Genet 3(12):e232.

    53. Wong AH, Gottesman II, Petronis A (2005) Phenotypic differences in geneticallyidentical organisms: The epigenetic perspective. Hum Mol Genet 14(Spec No 1):R11–R18.

    54. Schneider E, et al. (2010) Spatial, temporal and interindividual epigenetic variationof functionally important DNA methylation patterns. Nucleic Acids Res 38(12):3880–3890.

    55. Hatchwell E, Greally JM (2007) The potential role of epigenomic dysregulation incomplex human disease. Trends Genet 23(11):588–595.

    56. Degner JF, et al. (2012) DNase I sensitivity QTLs are a major determinant of humanexpression variation. Nature 482(7385):390–394.

    57. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficientalignment of short DNA sequences to the human genome. Genome Biol 10(3):R25.

    58. R Development Core Team (2009) R: A Language and Environment for StatisticalComputing (R Foundation for Statistical Computing, Vienna).

    E654 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    24,

    202

    1

    www.pnas.org/cgi/doi/10.1073/pnas.1312523111