STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
224 -
download
1
Transcript of STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin,...
STAC:STAC:A multi-experiment method A multi-experiment method for analyzing array-based for analyzing array-based
genomic copy number datagenomic copy number dataSharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara Naylor,
Christian J. Stoeckert, Jr., Barbara L. Weber, John M. Maris, Gregory R. Grant
University of Pennsylvania
Children’s Hospital of Philadelphia
MGED 8 Meeting
Bergen, Norway
September 11-13, 2005
Background
Gain and loss of chromosomal DNA occurs in many cancers
Regions of recurrent gain or loss contain genes critical to the genesis and/or progression of cancer
Accurate identification of such regions is essential for prioritizing follow-up efforts
Array Comparative Genomic Hybridization (aCGH) is a method for detecting genomic copy number variation on a genome-wide scale with high resolution BAC, cDNA, ROMA, Affymetrix SNP chips, Agilent technology
Sam
ples
Chromosome 8
Researchers traditionally rely on a simple frequency threshold to identify “significant” regions of gain/loss
This is followed by tedious manual review of the regions to define boundaries
This process is time consuming at best, lacks statistical control, is subject to investigator bias, and may miss essential regions
Selecting significant aberrations across samples
Research Goal
Develop a statistical method for assessing the significance of consistent copy number aberrations
across multiple samples
Validate this method using known biology and comparison to traditional methods
Example Data and Terminology
A location is a fixed width stretch of genomic DNA (eg. 1 Mb)
Experiments/samples are plotted along the vertical axis; one per row
A sequence of one or more aberrant locations is called an aberrant interval
We call a set of intervals for a given sample a profile for that sample
The ProblemThe Problem
Find locations which have more intervals (gains/losses) covering them than would be expected by chance
True underlying aberration rate is unknown
Take the observed aberrations as given and test for the significance of consistent aberrations across samples
Statistical ApproachStatistical Approach
Null Model: observed intervals of aberration are equally likely to occur anywhere in the stretch of the genome being considered
General Approach:(1) Choose an appropriate statistic
(2) Apply a permutation procedure under the null model to estimate a null distribution of the statistic
(3) Assess the (multiple testing corrected) significance of observed values of the statistic by comparing to the null distribution
Permutation: random rearrangement of intervals within each profile
Frequency statistic results
freq = 9
Need statistic sensitive to tight alignment, even if the aberration is not significantly frequent
The footprint statistic
Stack: set S of aligned intervals containing at most one interval per profile and with at least one location common to all intervals
Footprint:
F(S) = the number of locations c such that c is contained in some interval of stack S
•In practice, F(S) is normalized:
NF(S) = F(S)/E(F(S))
Null Distributions: Find the minimal NF(S) for each (sample) subset size using a heuristic search
• use distributions to assign (multiple testing corrected) p-values to locations (details omitted)
Footprint statistic results
footprint statistic coupled with search strategy reveals locations significantly consistent within subsets
p-value = 0.0001 p-value = 0.0050
INPUT:
matrix of binary gain/no change (or loss/no change) calls for each location along a chromosome arm
OUTPUT:for each location along chromosome arm:
a) the best stack covering that location
b) two p-values for that location (one for each statistic)
STAC Algorithm Specification
Sample chr1:1-1000000 chr1:1000001-2000000
1712DZ1T10 0 1
1714DZ1T10 0 1
. . .
. . .
Validation Data
UPenn BAC Array (Greshock et al. 2004, Gen. Res.)
• ~4,200 BAC Clones1. 69% BAC end sequenced2. 28% STS Mapping3. 3% Full BAC Sequence
• Spacing: ~0.91 Mb (chrs 1-X)
aCGH BAC Coverage (chr13)
Publicly available data sets:42 Neuroblastoma cell lines (Mosse et al. 2005, Genes Chr Cancer)
47 Primary sporadic breast tumors (Naylor et al. 2005, submitted)
Traditional Processing – Many Samples
1. Define regions of aberration for each sample2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)
Traditional Processing – Many Samples
90%70% 90% 60%
Example Common Regions of Aberration
1. Define regions of aberration for each sample2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)
Breast Cancer
92% (11/12) gain regions
85% (11/13) loss regions
also
86% (47/55) of the gains (suppl. data)
Avg pval gain: 0.00549
loss: 0.00899
Boundaries differ by < 1 Mb on average and in several cases are narrowed by STAC
ValidationValidationNeuroblastoma
83% (19/22) gain regions
100% (12/12) loss regions
Avg pval gain: 0.00447
loss: 0.00719
STAC identifies prognostically relevant regions in neuroblastoma. Shown: MYCN amplification at 2p24.
2p gain
Additional Regions IdentifiedAdditional Regions Identified
Neuroblastoma
94 Gains covering 341 Mb
80 Losses covering 305 Mb
Neuroblastoma 94 Gains covering 341 Mb 80 Losses covering 305 Mb
Breast Cancer
149 Gains covering 525 Mb
124 Losses covering 384 Mb
Regions segregate with known biology
Neuroblastoma
Cell Lines
646 Mb of significant locations scored (gain, loss, no change)
Agglomerative hierarchical, Pearson correlation, complete linkage
Evidence for 2 sample clusters- Cluster 1 characterized by
pattern of loss
- Cluster 2 characterized by pattern of gain
* missed by traditional method
Future PlansFuture Plans
Release stand alone Java version of STAC
Extend STAC to account for high-level gains and Extend STAC to account for high-level gains and homozygous deletionshomozygous deletions
Extend STAC to handle stacks with 2 or more Extend STAC to handle stacks with 2 or more intervals per profile (intervals per profile (co-occurring aberrationsco-occurring aberrations) )
http://www.cbil.upenn.edu/STAC