STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin,...

STAC:STAC:A multi-experiment method A multi-experiment method for analyzing array-based for analyzing array-based

genomic copy number datagenomic copy number dataSharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara Naylor,

Christian J. Stoeckert, Jr., Barbara L. Weber, John M. Maris, Gregory R. Grant

University of Pennsylvania

Children’s Hospital of Philadelphia

MGED 8 Meeting

Bergen, Norway

September 11-13, 2005

Background

Gain and loss of chromosomal DNA occurs in many cancers

Regions of recurrent gain or loss contain genes critical to the genesis and/or progression of cancer

Accurate identification of such regions is essential for prioritizing follow-up efforts

Array Comparative Genomic Hybridization (aCGH) is a method for detecting genomic copy number variation on a genome-wide scale with high resolution BAC, cDNA, ROMA, Affymetrix SNP chips, Agilent technology

Sam

ples

Chromosome 8

Researchers traditionally rely on a simple frequency threshold to identify “significant” regions of gain/loss

This is followed by tedious manual review of the regions to define boundaries

This process is time consuming at best, lacks statistical control, is subject to investigator bias, and may miss essential regions

Selecting significant aberrations across samples

Research Goal

Develop a statistical method for assessing the significance of consistent copy number aberrations

across multiple samples

Validate this method using known biology and comparison to traditional methods

Example Data and Terminology

A location is a fixed width stretch of genomic DNA (eg. 1 Mb)

Experiments/samples are plotted along the vertical axis; one per row

A sequence of one or more aberrant locations is called an aberrant interval

We call a set of intervals for a given sample a profile for that sample

The ProblemThe Problem

Find locations which have more intervals (gains/losses) covering them than would be expected by chance

True underlying aberration rate is unknown

Take the observed aberrations as given and test for the significance of consistent aberrations across samples

Statistical ApproachStatistical Approach

Null Model: observed intervals of aberration are equally likely to occur anywhere in the stretch of the genome being considered

General Approach:(1) Choose an appropriate statistic

(2) Apply a permutation procedure under the null model to estimate a null distribution of the statistic

(3) Assess the (multiple testing corrected) significance of observed values of the statistic by comparing to the null distribution

Permutation: random rearrangement of intervals within each profile

Frequency statistic results

freq = 9

Need statistic sensitive to tight alignment, even if the aberration is not significantly frequent

The footprint statistic

Stack: set S of aligned intervals containing at most one interval per profile and with at least one location common to all intervals

Footprint:

F(S) = the number of locations c such that c is contained in some interval of stack S

•In practice, F(S) is normalized:

NF(S) = F(S)/E(F(S))

Null Distributions: Find the minimal NF(S) for each (sample) subset size using a heuristic search

• use distributions to assign (multiple testing corrected) p-values to locations (details omitted)

Footprint statistic results

footprint statistic coupled with search strategy reveals locations significantly consistent within subsets

p-value = 0.0001 p-value = 0.0050

INPUT:

matrix of binary gain/no change (or loss/no change) calls for each location along a chromosome arm

OUTPUT:for each location along chromosome arm:

a) the best stack covering that location

b) two p-values for that location (one for each statistic)

STAC Algorithm Specification

Sample chr1:1-1000000 chr1:1000001-2000000

1712DZ1T10 0 1

1714DZ1T10 0 1

. . .

. . .

Validation Data

UPenn BAC Array (Greshock et al. 2004, Gen. Res.)

• ~4,200 BAC Clones1. 69% BAC end sequenced2. 28% STS Mapping3. 3% Full BAC Sequence

• Spacing: ~0.91 Mb (chrs 1-X)

aCGH BAC Coverage (chr13)

Publicly available data sets:42 Neuroblastoma cell lines (Mosse et al. 2005, Genes Chr Cancer)

47 Primary sporadic breast tumors (Naylor et al. 2005, submitted)

Traditional Processing – Many Samples

1. Define regions of aberration for each sample2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

Traditional Processing – Many Samples

90%70% 90% 60%

Example Common Regions of Aberration

1. Define regions of aberration for each sample2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

Breast Cancer

92% (11/12) gain regions

85% (11/13) loss regions

also

86% (47/55) of the gains (suppl. data)

Avg pval gain: 0.00549

loss: 0.00899

Boundaries differ by < 1 Mb on average and in several cases are narrowed by STAC

ValidationValidationNeuroblastoma

83% (19/22) gain regions

100% (12/12) loss regions

Avg pval gain: 0.00447

loss: 0.00719

STAC identifies prognostically relevant regions in neuroblastoma. Shown: MYCN amplification at 2p24.

2p gain

Additional Regions IdentifiedAdditional Regions Identified

Neuroblastoma

94 Gains covering 341 Mb

80 Losses covering 305 Mb

Neuroblastoma 94 Gains covering 341 Mb 80 Losses covering 305 Mb

Breast Cancer

149 Gains covering 525 Mb

124 Losses covering 384 Mb

Regions segregate with known biology

Neuroblastoma

Cell Lines

646 Mb of significant locations scored (gain, loss, no change)

Agglomerative hierarchical, Pearson correlation, complete linkage

Evidence for 2 sample clusters- Cluster 1 characterized by

pattern of loss

- Cluster 2 characterized by pattern of gain

* missed by traditional method

Future PlansFuture Plans

Release stand alone Java version of STAC

Extend STAC to account for high-level gains and Extend STAC to account for high-level gains and homozygous deletionshomozygous deletions

Extend STAC to handle stacks with 2 or more Extend STAC to handle stacks with 2 or more intervals per profile (intervals per profile (co-occurring aberrationsco-occurring aberrations) )

http://www.cbil.upenn.edu/STAC

STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin,...

Documents

Transcript of STAC: A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin,...