Supplementary Materials For - Cancer Discovery

24
Supplementary Materials For Frequent alterations and epigenetic silencing of differentiation pathway genes in structurally rearranged liposarcomas Barry S. Taylor, Penelope L. DeCarolis, Christina V. Angeles, Fabienne Brenet, Nikolaus Schultz, Cristina R. Antonescu, Joseph M. Scandura, Chris Sander, Agnes J. Viale, Nicholas D. Socci, Samuel Singer Supplementary Methods Alignment. All reads were aligned to the reference human genome (NCBI build 36.1, hg18). Mate-paired and methylation sequence reads were aligned with ABI Bioscope seeded extension mapper (ver. 1.2). Exome reads were aligned either with Bioscope or with BWA (1) to allow gaps for small indel detection, as described in below. RNA sequencing reads were aligned with the Bioscope whole-transcriptome pipeline. For all experiments on each of the four samples, both fragment and mate-pair reads mapping to the reference genome with sufficient quality were converted to SAM format (2) for subsequent analyses and for visualization in the Integrative Genomics Viewer (3). DNA copy number from whole-genome sequence. Copy-number alterations were assessed in the whole-genome sequence with the SegSeq algorithm (4) (w=400, a=100, b=10) using only the forward reads of mate pairs that mapped uniquely to the genome in each of the tumor/normal pairs. Duplicate reads aligned to unique genome positions were not excluded. Previous simulations with fragment libraries of 50bp reads indicate that ~2.55Gb (~82.8% of the genome) is mappable assuming an edit distance of two. Therefore, this was used as the alignable portion of the genome for this analysis, and are likely conservative estimates for empirical data aligned with progressive mapping (see alignment details in Methods, main text). We adapted RAE (5), an algorithm originally developed to detect copy number alterations from array data (aCGH or SNP arrays), to the analysis of whole-genome sequencing data (Fig. S9). The two samples were individually parameterized and predicted alterations were identified with the adapted multi-component model as low-level gain (A 0 0.9), high-level amplification (A 0 0.9 and A 1 > 0.25), heterozygous loss (D 0 0.9), and homozygous deletion (D 0 0.9 and D 1 0.9). We additionally co-hybridized source tumor DNA from each sample to Agilent 244K array comparative genomic hybridization (aCGH) microarrays with a pool of reference normal DNA according to the manufacturer’s instructions (Agilent Technologies, Wilmington, DE). Raw data were obtained and normalized as previously described (6). Probe-level data were segmented with Circular Binary Segmentation and analyzed with the original implementation of RAE, both as previously described (5, 7).

Transcript of Supplementary Materials For - Cancer Discovery

Supplementary Materials For

Frequent alterations and epigenetic silencing of differentiation pathway genes in structurally rearranged liposarcomas

Barry S. Taylor, Penelope L. DeCarolis, Christina V. Angeles, Fabienne Brenet, Nikolaus Schultz, Cristina R. Antonescu, Joseph M. Scandura, Chris Sander, Agnes J. Viale, Nicholas D.

Socci, Samuel Singer Supplementary Methods Alignment. All reads were aligned to the reference human genome (NCBI build 36.1, hg18). Mate-paired and methylation sequence reads were aligned with ABI Bioscope seeded extension mapper (ver. 1.2). Exome reads were aligned either with Bioscope or with BWA (1) to allow gaps for small indel detection, as described in below. RNA sequencing reads were aligned with the Bioscope whole-transcriptome pipeline. For all experiments on each of the four samples, both fragment and mate-pair reads mapping to the reference genome with sufficient quality were converted to SAM format (2) for subsequent analyses and for visualization in the Integrative Genomics Viewer (3). DNA copy number from whole-genome sequence. Copy-number alterations were assessed in the whole-genome sequence with the SegSeq algorithm (4) (w=400, a=100, b=10) using only the forward reads of mate pairs that mapped uniquely to the genome in each of the tumor/normal pairs. Duplicate reads aligned to unique genome positions were not excluded. Previous simulations with fragment libraries of 50bp reads indicate that ~2.55Gb (~82.8% of the genome) is mappable assuming an edit distance of two. Therefore, this was used as the alignable portion of the genome for this analysis, and are likely conservative estimates for empirical data aligned with progressive mapping (see alignment details in Methods, main text). We adapted RAE (5), an algorithm originally developed to detect copy number alterations from array data (aCGH or SNP arrays), to the analysis of whole-genome sequencing data (Fig. S9). The two samples were individually parameterized and predicted alterations were identified with the adapted multi-component model as low-level gain (A0 ≥ 0.9), high-level amplification (A0 ≥ 0.9 and A1 > 0.25), heterozygous loss (D0 ≥ 0.9), and homozygous deletion (D0 ≥ 0.9 and D1 ≥ 0.9). We additionally co-hybridized source tumor DNA from each sample to Agilent 244K array comparative genomic hybridization (aCGH) microarrays with a pool of reference normal DNA according to the manufacturer’s instructions (Agilent Technologies, Wilmington, DE). Raw data were obtained and normalized as previously described (6). Probe-level data were segmented with Circular Binary Segmentation and analyzed with the original implementation of RAE, both as previously described (5, 7).

2

Unsupervised hierarchical clustering of copy number alterations in the larger set of DLPS used for sample selection was performed as previously described (8). Structural rearrangement detection. The non-redundant mate pairs (excludes duplicate mate pairs) aligned to the reference genome for each sample were first classified into groups based on the alignment position (strand and orientation) and distance separating paired reads. This distance was based on the empirical distribution of the insert sizes estimated from the alignment of each sample. A putative rearrangement, focusing here on intra- and inter-chromosomal rearrangements and associated aberrations, was defined as an event supported by a cluster of multiple atypically paired reads in a tumor sample, that are lacking from its corresponding matched normal. We excluded both singleton non-overlapping atypical mates and clusters of overlapping atypical mates where the mate chromosome and position was inconsistent. All remaining structurally atypical mates (inter-chromosomal or intra-chromosomal indicating inversion or not) were processed with the GASV algorithm (9), pairing matched normal and tumor data to filter mates indicative of a breakpoint in the normal sample and therefore to determine only somatic rearrangements. Candidate rearrangements were excluded if either breakpoint: (a) overlapped a previously characterized structural variant in normal populations (as described in ref. (6)) or in previous individual genome sequencing (10-17) (b) appeared within 1Mb of a sequence gap or, (c) were supported by fewer than 5 atypical reads. The most common repetitive elements adjacent to rearrangement breakpoints were Alu, L1, L2, and low-complexity sequence (data not shown). The copy number at each breakpoint was determined to be the extreme copy number segmentation value (calculated as described above) that overlapped the breakpoint or adjacent sequence (5’ and 3’ sequence equal in length to the breakpoint). Rearrangements were encoded as mixed-type complex (Table S2) if their origin was either ambiguous or combinatorial, or if major and minor annotated alteration types both exceeded 25% of the supporting reads. The approach described here is fairly stringent, designed to minimize the number of false positive events.

To assess the accuracy of rearrangement discovery, we performed validation with array-based copy number data generated on the same samples, an approach similar to the 1000 Genomes Consortium (18). Here, we focused on rearrangements associated with a copy number alteration (CNA; 97.6% of total) and tested for the presence of the breakpoints in segmentation of aCGH data. This is predicated on the fact that CNAs arise from double-stranded DNA breaks and should therefore harbor an associated rearrangement (cluster of structurally atypical mates at its breakpoints) from mate-paired sequencing. Nevertheless, the resolution of the low-pass whole-genome sequencing data is much greater than the accompanying aCGH platform (near-base resolution versus 235,829 probes separated by a median of ~9kb of intervening genome sequence), so we limited this analysis to those rearrangements whose associated segment of CNA (sequence-based, see above) spanned 3 or more probes of the corresponding array design. The rationale for this criterion is that a CNA that is smaller in size than can be reasonable detected with fixed-resolution aCGH would lack sufficient data for validation. Additionally, for rearrangement breakpoints that fell in a gap of aCGH segmentation (a region of breakpoint ambiguity between two adjacent probes that marks the end and start of the 5’ and 3’ adjacent segments respectively),

3

the rearrangement was assigned to the adjacent segment of extreme copy number. We considered a rearrangement either partially or completely concordant if one or both breakpoints agreed with the breakpoints called from aCGH and was associated with the same CNA. Otherwise, the event was considered discordant and a likely false positive. In total, 91.3% of rearrangements had complete concordance (both breakpoints) with array-based data, which corresponds to an estimated FDR of 8.7%. For balanced rearrangements (~2.4% of those detected here), we assume the false positive and negative rates were higher because of the low depth-of-coverage of the genome sequencing.

To determine if rearrangement breakpoints were over-represented in genic regions, we compared the number of observed genic breakpoints to a distribution of random rearrangements. We generated a set of random rearrangements breakpoints conditioned on both the size distribution of the observed breakpoints and their chromosome of origin. Because nearly all of the somatic rearrangements were associated with a CNA, this distribution is likely to reflect the overall distribution of rearrangements. In total, we performed 10,000 permutations, in each producing an expected count of rearrangements in which one or both randomized breakpoints fall within the genomic footprint of a gene. We calculated an empirical p-value for the enrichment of breakpoints in genes by comparing these to the observed number.

Mutation detection. Single-nucleotide variants were determined from exome and RNA sequencing reads in regions of sufficient coverage. We first exclude all degenerate reads from further analysis, defined here as any read from either experiment whose chromosome, start position, strand, and color-space sequence matched another aligned read. For exome and RNA data, base quality recalibration, variant detection, and variant annotation were performed with the GATK framework (19, 20). Specifically, after base quality recalibration for color-space reads, variant detection in exome data was performed with the UnifiedGenotyper. For high-coverage exome experiments, variants were excluded if their variant quality was <30, genotype quality <5, or if they were associated with either homopolymer runs or excessive strand bias. Novel variants, those not previously identified in either dbSNP ver. 130 (excluding overlap with COSMIC ver. 48) or 1000genomes (18), were required to be derived from base-space reads not duplicated from non-duplicate color-space reads, were not resident exclusively in higher-error base positions (positions 38-50) and had evidence of the variant allele in reads mapping to both strands. Candidate somatic mutations were those with a variant genotype in the tumor and reference genotype in the normal sample with minimum coverage of ≥10 and 6 reads respectively. Additionally, we required that the tumor variant frequency was ≥10%, and each variant was detected in 4 or more tumor reads.

Our pipeline for small insertion and deletion (indel) detection was as follows. Gapped realignment of exome sequencing reads was performed with BWA. The alignment output was sorted and duplicate reads removed with the Picard pipeline and BAM files created and indexed with Samtools. Interval detection, local realignment, indel genotyping, and post-processing were performed with the GATK framework after base quality recalibration, as described above. Retained indels were those with sufficient quality and coverage and not associated with homopolymer runs of 5bp or greater.

4

In the case of non-uniform coverage RNA experiments (the consequence of variable transcript expression levels), the UnifiedGenotyper was run in low-pass mode with more permissive quality thresholds and without a strand bias requirement (as the SOLiD RNA sequencing protocol is strand-specific). Final candidate variants were determined similarly to above. We also determined candidate sites of A>I(G) RNA editing from the high-coverage and high-quality A>G variants detected from RNA sequence data. Specifically, candidates were stringently filtered to require, from cDNA data, normal and tumor variant frequencies of <10% and ≥10% respectively at coverage levels in DNA data of 10x or greater in both matched normal and tumor exomes and were identified as either homozygous A or called as G at exome frequencies <5%.

All novel candidate somatic and loss-of-heterozygosity variants were annotated for their coding sequence context and effect. The effect of individual mutations on protein function was assessed computationally using a combination of evolutionary information from protein-family sequence alignment and residue position in known or homology-deduced three-dimensional protein and complex structures (B. Reva, Y. Antipin, C. Sander, submitted; http://mutationassessor.org/). For the purposes of calculating exome-wide somatic mutation rates, we included only those bases considered covered; those with at least 10 and 6 reads overlapping the position in the tumor and normal samples respectively (~31.6Mb per sample) after pre-processing and de-duplication. Experimental validation of candidate somatic mutations. Genomic DNA was extracted from tumor samples using standard procedures and used for PCR amplification with 454 tails (F:GCCTCCCTCGCGCCATCAG and R:GCCTTGCCAGCCCGCTCAG, primers listed in Table S7). For selected regions, PCR was performed using the KAPA HiFi HotStart DNA polymerase. Between 1500 and 2000 reads per amplicon were generated using a 454 FLX platform (Roche). Alignment and variant detection was alternatively carried out with both ssahaSNP (Sanger Institute; http://www.sanger.ac.uk/resources/software/ssahaSNP/), and with a pipeline that combined BWA alignment, Picard tools, and VarScan (21) variant detection. Confirmed mutations were those called by both pipelines. The recurrence rate of mutations in HDAC1, MAPKAP1, PTPN9, and DAZAP2 was determined with pooled 454 sequencing. In total, 4224 amplicons, all exons in each gene and in each of 96 samples were multiplexed (12 samples pooled and barcoded in each of an 8-region plate) and sequenced according to manufacturer’s instructions. Detected mutations (see pipeline above) were manually reviewed to exclude sequence artifacts and were confirmed somatic with Sanger sequencing of matched normal blood or fat (primers; Table S7). Transcript and allele-specific expression. Gene expression was estimated from read densities per sample as reads per kilobase of exon model per million mapped reads (rpkM) excluding 3’ UTRs (22, 23). These expression estimates are based on scaled library sizes according to TMM normalization (24) in which the reference was the average of the two normal samples.

We determined allele-specific expression, the preferential expression of one allele over the other in a tumor-specific manner, for transcripts in which we detected heterozygous SNPs either genotyped by HapMap3 (25) or characterized in the 1000genomes phase I release (18). For these known exonic heterozygous SNPs with

5

10x coverage or greater in both the tumor and matched normal samples, we assessed the significance of allelic ratio differences between tumor and normal by Fisher exact test. After multiple hypothesis correction (26), genes with FDR < 25% in one of the two tumors studied here were considered to have significant allele-specific expression.

For small RNA analysis (27) we considered only those microRNAs with read coverage in ≥50% of samples sequenced. Log-clone counts across microRNAs and samples were quantile normalized for expression analysis. For the analysis of global expression changes in predicted targets of miR-193b in samples with concurrent small RNA sequencing and methylation status (n=23), RNA was hybridized to HG-U133A oligonucleotide arrays (Affymetrix) as previously described (28). Expression probe-sets were estimated with robust multi-array average (RMA) and expression differences between DLPS or WLPS tumors compared to normal adipose tissues was determined with an empirical Bayes approach (29). Log fold-change differences were converted to empirical cumulative distribution functions and their differences were tested with a one-sided Kolmogorov-Smirnov test. Identifying and validating fusion transcripts. For structural rearrangements detected by mate-paired sequencing data (see above), we required that fusion candidates be validated by RNA sequencing. A fusion candidate was defined as a DNA rearrangement in which both breakpoints fell within the protein-coding locus of a RefSeq transcript. A reference sequence database was generated to include all possible fusion junctions between adjacent upstream and downstream exons of each of a pair of breakpoints in the two affected transcripts in all possible orientations. Alignment to the fusion junction reference database was performed using SHRiMP (30) with a spaced seed (1111001111), only a single tolerated seed match per window, and default Smith-Waterman scoring parameters. Alignments were converted to SAM format, duplicate reads were removed, and both in- and out-of-frame candidate chimeras reported as expressed if the orientation was consistent with the DNA rearrangement and supported by three or greater RNA fusion junction reads that did not also align to a wildtype sequence of the human transcriptome. To validate candidate fusion transcripts, reverse transcription PCR (RT-PCR) was performed with primers (Table S7) targeting the novel exon-exon junctions of each using the Qiagen One-Step RT-PCR kit accordingly to manufacturer’s instructions. Methylation analysis. Peaks of methylation were computed (from Bioscope mapping, reads uniquely aligning at lengths ≥45bp) by a convolution function based on the distribution of fragment sizes, described in detail elsewhere (31). For the analysis of differential methylation, we first normalized methylation read counts in the tumors by their intrinsic copy number calculated from whole-genome sequencing (see above). We first defined a disjoint set of breakpoints (adaptive bins) generated from the boundaries of methylation peaks (methylation signal >2) determined in each sample (autosomal regions number ~249K, median size of 266bp; 34-904bp, 10-90th percentile). In these regions we mapped segmental copy number determined from SegSeq analysis of whole-genome sequencing data (described above), and assigned methylation read counts as the number of overlapping reads extended by a fragment size sampled randomly from the known fragment size distribution (31). To these we fit a robust linear

6

model of the log segment mean (Cr, ratio of tumor to normal sequencing reads) to the log ratio of methylation read counts (Mr) in the tumor and normal sample (mt and mn respectively) plus a pseudo count (ρ = 0.5). The resulting fit was used to estimate the methylation read count in the tumor with the intrinsic copy number removed:

Differential methylation between a given sample pair or between normal and tumor samples was determined for all regions from these normalized methylation read counts using an over-dispersed negative binomial model (32). Regions of significant differential methylation were those with a false discovery rate (FDR) of <1%. Methylation and CEBPA validation. The validation of specific methylation events was performed with bisulfite conversion and pyrosequencing. Specifically, bisulfite conversion and cleanup of genomic DNA was performed using the Epitect Bisulfite Kit (Qiagen). Selected regions were amplified by Pyromark PCR using primers (Table S7) designed using Pyromark Assay Design 2.0. PCR products were then analyzed on a PSQ HS 96A using the standard protocol.

The DDLS8817 cell line was established as described in the main text. Primary human adipocyte-derived stem cells (ASCs) were isolated as previously described using subcutaneous fat tissue samples from consenting patients (33). Cell lines were maintained in DMEM HG:F12 supplemented with 10% heat-inactivated fetal bovine serum (FBS), 1% penicillin/streptomycin (Invitrogen, Carlsbad, CA), and kept at 37oC in 5% carbon dioxide. Cell lines were treated with decitabine (5-aza-2'-deoxycytidine) for 48 hours followed by 24 hours of treatment with the histone deacetylase inhibitor SAHA. DNA content in cell lines was estimated using the CyQuant Cell Proliferation Kit (Molecular Probes, Eugene, OR) and the Spectramax M2 fluorescence microplate reader (Molecular Devices, Sunnyvale, CA) at 480/520nm excitation/emission, as per manufacturers instructions. Evaluation of apoptosis was carried out using the Guava PCA (Guava Technologies) on cells stained with Annexin V: PE Apoptosis Kit (BD Biosciences) per manufacturer's instructions. Reverse transcription was performed using random hexamer priming and TaqMan reverse transcription reagents (Applied Biosystems, Foster City, CA). Quantitative real-time PCR (qPCR) using TaqMan Gene Expression Assays (Applied Biosystems) was done on the ABI Prism 7900HT Sequence Detection System and analyzed using SDS version 2.1 software (Applied Biosystems). The expression of genes was determined relative to the endogenous control 18s rRNA using ΔΔCT analysis. DDLS8817 tumor cells were subcutaneously implanted into SCID mice and treated with decitabine 4 mg/kg daily for 5 days with or without SAHA 50 mg/kg biweekly. Genome annotation. Unless otherwise specified, all annotations correspond to the hg18 build of the human genome. Promoter classification was based on the transcription start sites inferred from all isoforms of RefSeq genes. Human promoters were considered to span -0.5kb to +2kb of transcription start sites and were classified as high, intermediate,

n(mt ) = eM r −( ˆ α + ˆ β Cr ) × mn + ρ

7

or low-CpG content (HCP, ICP, or LCP respectively) based on their combination of G+C fraction and their observed to expected ratio of CpGs as described in ref. (34).

8

Supplementary Results Tumor sample selection and histology. Tumor samples were selected on the basis of several criteria including the features of their genome inferred from aCGH analysis (see main text). Histology review indicated that both tumors had a dedifferentiated component of >50%. Both also had a heterogeneous appearance of dedifferentiated liposarcoma composed mainly of areas resembling high-grade pleomorphic malignant fibrous histiocytoma. DLPS1 had areas of low-grade fibrosarcoma-like cellularity with areas resembling desmoid-like and other areas resembling keloid collagen solitary fibrous tumor (SFT). Additional features of aggressive behavior included direct invasion into the renal parenchyma and areas of necrosis. Areas of well-differentiated liposarcoma, both sclerosing and lipoma-like was also observed. DLPS2 had areas of focally high-grade myxofibrosarcoma as well as areas of sclerosing well-differentiated liposarcoma. Only high-grade areas were analyzed genomically. Mutant genes. Among genes found mutated here (Table 1, main text) and in recent large-scale efforts or in whole-genome sequencing of individual tumors are CADM2, which is structurally aberrant in a subset of prostate cancers (35) and mutated in renal carcinomas (36); MAPKAP1 in a small-cell lung cancer genome (37); XIRP2, mutated in a metastatic melanoma genome (38); and both SACS and PTPN9, mutated in multiple cancer types (39). Other genes found mutant here are part of previously described translocations, including PDE4DIP, which is the fusion partner of PDGFRB in a subset of myeloproliferative disorders (40). Finally, specific mutations such as those in HDAC1, MLK4, and DAZAP2 appear to reflect dysregulation in key pathways in DLPS. MLK4 encodes a MAP kinase that may activate JNK signaling upstream of JUN, whose gene amplification mediates an adipogenic block through C/EBPβ repression (41). DAZAP2 functions in both Wnt and TGFb-signaling, and HDAC1 is described in more detail in the main text. Regulatory mutations. RNA sequencing data, in addition to cross-validating abnormalities also visible genomically (e.g. coding mutations and rearrangements producing transcriptional chimeras), were used to explore substitutions in non-coding regions of genes that may affect their expression. Among these, we found and confirmed two somatic mutations in the 3’ UTRs of MAP3K4 (373T>C) and RAB11FIP2 (3879A>G) respectively (Table 1, main text). Neither mutation created a premature cleavage or polyadenylation signal. However, because endogenous small non-coding RNAs direct the repression of target mRNAs through their pairing with small cis-regulatory microRNA seed sequences (42), we investigated whether these mutations affected the binding of a microRNA to the 3’UTR of the respective genes. In fact, both mutations altered a seed sequence (for miR-495 in MAP3K4 and miR-155 in RAB11FIP2; Fig. 2C, main text). Interestingly, from small RNA sequencing we found that both microRNAs are expressed in an independent cohort of normal adipose tissue, well-differentiated liposarcomas (WLPS), and DLPS (see Supplementary Methods). Consistent with the concept that microRNA target site mutations may allow escape from microRNA-mediated repression, the expression of both MAP3K4 and RAB11FIP2 was elevated in tumor relative to the matched normal adipose tissue, as measured from

9

RNA sequencing data (Fig. 2C, main text). Neither of these specific sites was mutated in 96 additional tumors, indicating these are rare mutations in liposarcoma. Nevertheless, this does not preclude the possibility of mutations affecting another of the conserved microRNA target sites complementary to either miR-495 or miR-155 in the respective 3’ UTRs. Our findings indicate a possible functional role for somatic target site mutations releasing MAP3K4 an RAB11FIP2 from repression, and the resulting over-expression may be oncogenic. Nevertheless, additional studies will be necessary to confirm and extend these data. Allele-specific expression. We investigated allele-specific expression (ASE), focusing here on identifying sites of differential ASE between normal and tumor transcriptomes. Using heterozygous variants among known coding SNPs (see Supplementary Methods), we identified 32 genes with a significant change in ASE in the tumor transcriptomes (Table S5). We explored the origins of ASE in these tumors by integrating results from concurrent DNA sequencing. Previously, we showed that DLPS genomes have few regions of recurrent copy-neutral loss-of-heterozygosity (LOH) (43), so this is likely an uncommon source of ASE in DLPS. Instead, 37% of all ASE events identified here appear to result from allele-specific copy number amplification. While most of these events amplified the alternative allele as opposed to the reference allele (71% of those attributed to allele-specific CNA), this was not significant in a binomial model that assumes the background probability a given allele is amplified is fundamentally random (pr = 0.5) (Table S5). Interestingly, ASE resulting from a single allele-specific genomic amplification was found for three genes within or directly adjacent to the CDK4 amplicon on 12q13.3-14.1 (Fig. 2D, main text). These were the Rho/RAC exchange factor GEFT, the endoplasmic reticulum lectin OS9, and the methyltransferase-like METTL1. We have previously shown that METTL1, when amplified, is essential for the proliferation of dedifferentiated liposarcoma cells (43). An additional methyltransferase-like gene METTL3 was identified as having ASE, though unlike METTL1, expression was homozygous for the alternative allele in the normal and heterozygous in the tumor. Copy-neutral examples of tumor-specific ASE appeared in imprinted genes, including the maternally expressed GNAS, and in genes otherwise subject to chromosome X inactivation in the female patient, such as BGN and TIMP1, but predominantly expressed only the reference allele in the tumor, whereas both alleles were expressed in the matched normal. These observations indicate the clonality of the tumor’s genotype in the female patient. RNA editing. We explored ADAR-catalyzed adenosine-to-inosine [A>I(G)] RNA editing in these patients by mining the cDNA data for A>G transitions present has homozygous A alleles in exome data (see Supplementary Methods). We identified evidence of editing in 51 genes across the tumor and normal samples (Table S6), but few examples of somatic editing. Other edit sites appearing in one of two samples in a pair (typically the normal adipose sample), but were absent from the other, could be attributed to insufficient coverage in the other sample. These preliminary data indicate that A>I editing contributes little somatic transcriptome diversity in liposarcomas. If RNA editing does contribute to somatic mutations in tumor transcriptomes, this is likely lineage- or tumor-type-specific. We noted, however, that ADAR was expressed at only modest

10

levels in the normal samples and reduced in the tumors (average of ~1.4-fold). Lower catalytic activity of the deaminase may therefore reduce A>I editing levels transcriptome-wide in these liposarcomas. Global patterns of DNA methylation in liposarcoma genomes. We found that the MDB-based method for methylation sequencing used here was highly specific, sequencing ~24% ± 0.7 of all unique CpG dinucleotides in the genome at an average ~16x coverage in all samples with modest sequencing output. Conversely, this approach is unlikely to detect methylation in the non-CpG context or significant methylation in regions of particularly low CpG density (31).

Having generated peaks of methylation from read distributions (Supplementary Methods), we found that the pattern of methylation across sequence features was stable between cell types and between patients (Fig. S5). Globally, we did not observe wholesale re-patterning of DNA methylation between adipose and liposarcoma genomes, which appear highly concordant macroscopically (Fig. S6). Cytosine methylation was highly correlated between individual patients (Spearman ρ = 0.84; methylation signal in 10,000 randomly selected regions of normal adipose tissue genomes), and between normal and tumor tissue in a given patient (ρ = 0.77). Nevertheless, clear differences exist. We therefore investigated differences in local methylation between the tumor and matched normal genomes. As expected from an affinity/enrichment-based assay, methylation signal was affected by genomic amplifications in these tumors. Within a copy number amplicon, however, considerable heterogeneity of methylation existed and was not entirely attributed to the tumor’s intrinsic copy number (Fig. S8). Therefore, we normalized methylation signal in the tumors by their copy number using an adaptation of RAE (5) to support the analysis of copy number from whole-genome sequencing (Fig. S9). Using the normalized data, we identified persistently differentially methylated regions (DMRs) between tumor and normal genomes (FDR < 1%, see Supplementary Methods).

Hypermethylated DMRs constituted 70% of methylation changes between each individual tumor and its matched normal, while they constituted fully 97% of changes that were recurrent across the two tumors. Consequently, hypomethylated DMRs in the tumor were only 3% of recurrent DMRs, which may indicate that hypomethylation is a more transient event than hypermethylation. Additionally, regions of statistically significant somatic gain or loss of methylation appeared to arise across diverse sequence features in a highly non-random fashion (Fig. 3A, main text). We identified a significant loss of methylation in satellite repeat regions in these liposarcomas relative to their normal adipose genomes, a pattern similar to that observed in the methylomes of another sarcoma, malignant peripheral nerve sheath tumors (44). In light of recent observations of high satellite repeat expression in human cancers (45), it is appealing to hypothesize that, across a broad range of epithelial and mesenchymal malignancies, this is due to the systematic loss of methylation like that which we observed here.

In core promoters, both normal and tumor samples had greater methylation levels in intermediate- and low-content CpG promoters than in those high in CpG content, and methylation peaked downstream of the transcription start site in the first exon (Fig. S7). This pattern of increased intragenic methylation is consistent with prior reports and has been linked to transcriptional silencing (31, 46).

11

Supplementary Figures

Figure S1: Sample selection and histology. A. Thirty-seven dedifferentiated liposarcomas were classified by unsupervised clustering of DNA copy-number alterations inferred from segmentation of Agilent 244K array comparative genomic hybridization (aCGH) data. This indicates that the samples sequenced here had profiles representative of the bulk of DLPS. H&E images from both DLPS1 and DLPS2 indicate the dedifferentiated component predominated (>50% with typical heterogeneous appearance). DLPS1 had regions resembling high-grade pleomorphic malignant fibrous histiocytoma (MFH) (B) and regions resembling low-grade fibrosarcoma, including desmoid-like (C) and keloid collagen, SFT-like (D). H&E images of DLPS2 indicated regions resembling high-grade pleomorphic MFH (E), focally high-grade myxofibrosarcoma (F-G), and areas of well-differentiated liposarcoma, sclerosing type (H).

12

Figure S2: Sequencing output. Summarized here is the output from sequencing of the (A) genome, (B) transcriptome, (C) methylome, and (D) exome in each sample.

13

Figure S3: RT-PCR validation of candidate fusion transcripts. RT-PCR of RNA across the rearranged exon-exon junction implied by the DNA rearrangement and supported by RNA sequencing reads confirmed 5 of 7 candidate fusion transcripts.

14

Figure S4: Copy number alterations in exome and genome sequencing. Coverage differences between tumor and matched normal samples in exons captured during exome sequencing indicate their underlying copy number. While most exons (gray) had similar coverage between paired samples, reflecting their diploid copy number (e.g. JUN, as indicated), many had coverage increases in the tumor that reflect their copy number amplification (e.g., other indicated genes). It is also possible to resolve intragenic amplicons or only partial amplification, as with exons 10-11 and 14 of MDM1 (gold), which are amplified by a complex discontinuous amplicon, while the balance of MDM1 exons had variable coverage, but were diploid in copy number. On the right is copy number inferred from segmentation of the ratio of tumor-normal read counts from whole genome sequencing (x-axis, see Methods), and a similar ratio of exon coverage levels from exome capture where most exons are diploid (gray), but a subset are amplified and the relationship is linear between exome and genome data (blue). A fraction of exons are not linear in copy number between genome and exome sequencing (green), perhaps reflecting capture variability due to underlying sequence complexity.

15

Figure S5: Stable patterns of methylation among sequence features. Here, the fraction of peaks of methylation found in each of a set of sequence contexts indicates a stable profile between patients and between tumor and normal samples.

16

Figure S6: Tumor and normal methylomes. The density of methylation between tumor and normal pairs (blue and green respectively) across the genome (chromosomes labeled, centromeres in red) in the (A) primary tumor DLPS1, and (B) the recurrent tumor DLPS2.

17

Figure S7: Methylation in the core human promoter. The density of methylation reads (y-axis) is indicated in each of the tumor and normal samples across the core human promoter (x-axis; TSS, transcription start site). Inset, shown here is the fraction of all human promoters with high, intermediate, or low CpG content (HCP, ICP, and LCP respectively) and those in which differentially methylated regions were identified.

18

Figure S8: Heterogeneous methylation levels within a copy number amplicon. A, Relationship between segmented copy number and methylation, both ratios of tumor-to-normal read counts, in regions of overt genomic amplification (copy number ratio >2) indicate a dosage-dependent relationship, yet substantial variability. A randomly chosen amplicon is highlighted (12q13.3-q14.1; black, boxed in green) in which heterogeneous methylation levels (inset) were observed in a single amplicon. B, The highlighted amplicon (panel A) is shown in greater detail, indicating that while this locus was amplified at a constant level, methylation in the tumor (blue; matched normal in yellow) is variably elevated.

19

Figure S9: Copy number analysis of whole-genome sequence data. The parameterization of copy number states (parameters as indicated) was performed based on the frequency distribution of autosomal segmentation of whole-genome sequence data in tumor and normal samples (black). Dotted and solid blue lines represent deletions and losses respectively, while red solid and dashed lines are low- and high-level gains respectively. Parameterization was performed as described in ref. (5), adapted to non-log genome sequence data with noise values as indicated [derivative noise (DN) and full-width at half-maximum indicated as solid black bars, wide and narrow respectively]. The accompanying standard parameterization of Agilent 244K aCGH data for the same samples is provided (inset).

20

Figure S10: Methylation and expression of CEBPA in DDLS8817 cells. A. Pyrosequencing of bisulfite-converted DNA indicates significant CEBPA promoter methylation in DDLS8817 cells, which is reduced by 5-aza but not reduced further upon the addition of SAHA. B. CEBPA expression levels were determined by RT-PCR and normalized to the corresponding level in untreated cells. CEBPA expression levels were low in the fully methylated cell line, induced 3-fold in the presence of 5-aza alone, but induced 19-fold after the addition of SAHA, attaining a level similar to those in undifferentiated adipose stem cells (ASC, gray).

21

Supplementary Tables Table S1: Clinical characteristics of sequenced patient tumors. Table S2: Somatic rearrangements in liposarcoma genomes. Table S3: Expressed gene fusions. Table S4: Somatic mutations in liposarcoma. Table S5: Allele-specific expression in liposarcoma transcriptomes. Table S6: Adenosine-to-inosine RNA editing events. Table S7: Primers used for mutation and methylation validation Supplied as separate files

22

Supplementary References 1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler

transform. Bioinformatics. 2009;25:1754-60. 2. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence

Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078-9. 3. Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, et al.

Integrative genomics viewer. Nat Biotechnol. 2011;29:24-6. 4. Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, et al. High-resolution

mapping of copy-number alterations with massively parallel sequencing. Nat Methods. 2009;6:99-103.

5. Taylor BS, Barretina J, Socci ND, Decarolis P, Ladanyi M, Meyerson M, et al. Functional copy-number alterations in cancer. PLoS One. 2008;3:e3179.

6. Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061-8.

7. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657-63.

8. Taylor BS, Schultz N, Hieronymus H, Gopalan A, Xiao Y, Carver BS, et al. Integrative genomic profiling of human prostate cancer. Cancer Cell. 2010;18:11-22.

9. Sindi S, Helman E, Bashir A, Raphael BJ. A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009;25:i222-30.

10. Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622-9.

11. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53-9.

12. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56-64.

13. Kim JI, Ju YS, Park H, Kim S, Lee S, Yi JH, et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460:1011-5.

14. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420-6.

15. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254.

16. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature. 2008;456:60-5.

17. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872-6.

18. Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Gibbs RA, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061-73.

23

19. Depristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491-8.

20. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297-303.

21. Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics. 2009;25:2283-5.

22. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621-8.

23. Ramskold D, Wang ET, Burge CB, Sandberg R. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol. 2009;5:e1000598.

24. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25.

25. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52-8.

26. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological). 1995;57:289-300.

27. Ugras S, Brill ER, Jacobsen A, Hafner M, Socci N, Decarolis PL, et al. Small RNA sequencing and functional characterization reveals microRNA-143 tumor suppressor activity in liposarcoma. Cancer Res. 2011.

28. Singer S, Socci ND, Ambrosini G, Sambol E, Decarolis P, Wu Y, et al. Gene expression profiling of liposarcoma identifies distinct biological types/subtypes and potential therapeutic targets in well-differentiated and dedifferentiated liposarcoma. Cancer Res. 2007;67:6626-36.

29. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3.

30. Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M. SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol. 2009;5:e1000386.

31. Brenet F, Moh M, Funk P, Feierstein E, Viale AJ, Socci ND, et al. DNA methylation of the first exon is tightly linked to transcriptional silencing. PLoS One. 2011;6:e14524.

32. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009.

33. Gimble J, Guilak F. Adipose-derived adult stem cells: isolation, characterization, and differentiation potential. Cytotherapy. 2003;5:362-9.

34. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448:553-60.

24

35. Berger MF, Lawrence MS, Demichelis F, Drier Y, Cibulskis K, Sivachenko AY, et al. The genomic complexity of primary human prostate cancer. Nature. 2011;470:214-20.

36. Dalgliesh GL, Furge K, Greenman C, Chen L, Bignell G, Butler A, et al. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature. 2010;463:360-3.

37. Pleasance ED, Stephens PJ, O'Meara S, McBride DJ, Meynert A, Jones D, et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010;463:184-90.

38. Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010;463:191-6.

39. Forbes SA, Bhamra G, Bamford S, Dawson E, Kok C, Clements J, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr Protoc Hum Genet. 2008;Chapter 10:Unit 10 1.

40. Wilkinson K, Velloso ER, Lopes LF, Lee C, Aster JC, Shipp MA, et al. Cloning of the t(1;5)(q23;q33) in a myeloproliferative disorder associated with eosinophilia: involvement of PDGFRB and response to imatinib. Blood. 2003;102:4187-90.

41. Mariani O, Brennetot C, Coindre JM, Gruel N, Ganem C, Delattre O, et al. JUN oncogene amplification and overexpression block adipocytic differentiation in highly aggressive sarcomas. Cancer Cell. 2007;11:361-74.

42. Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215-33.

43. Barretina J, Taylor BS, Banerji S, Ramos AH, Lagos-Quintana M, Decarolis PL, et al. Subtype-specific genomic alterations define new targets for soft-tissue sarcoma therapy. Nat Genet. 2010;42:715-21.

44. Feber A, Wilson GA, Zhang L, Presneau N, Idowu B, Down TA, et al. Comparative methylome analysis of benign and malignant peripheral nerve sheath tumors. Genome Res. 2011;21:515-24.

45. Ting DT, Lipson D, Paul S, Brannigan BW, Akhavanfard S, Coffman EJ, et al. Aberrant overexpression of satellite repeats in pancreatic and other epithelial cancers. Science. 2011;331:593-6.

46. Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D'Souza C, Fouse SD, et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature. 2010;466:253-7.