DNA Copy Number Analysis Qunyuan Zhang Division of Statistical Genomics Department of Genetics &...

DNA Copy Number AnalysisDNA Copy Number Analysis

Qunyuan Zhang

Division of Statistical Genomics

Department of Genetics & Center for Genome Sciences

Washington University School of Medicine

04 - 23 – 2010

GEMS Course: M 21-621 Computational Statistical Genetics

1

What is Copy Number ?What is Copy Number ?

Gene Copy Number

The gene copy number (also "copy number variants" or CNVs) is the amount of copies of a particular gene in the genotype of an individual. Recent evidence shows that the gene copy number can be elevated in cancer cells. For instance, the EGFR copy number can be higher than normal in Non-small cell lung cancer. …Elevating the gene copy number of a particular gene can increase the expression of the protein that it encodes.

From Wikipedia www.wikipedia.org

2

DNA Copy Number A Copy Number Variant (CNV) represents a copy number change involving a

DNA fragment that is ~1 kilobases or larger. From Nature Reviews Genetics, Feuk et al. 2006

DNA Copy Number ≠ DNA Tandem Repeat Number (e.g. microsatellites) <10 bases

DNA Copy Number ≠ RNA Copy Number RNA Copy Number = Gene Expression Level

DNA transcription mRNA

Copy Number is the amount of copies of a particular fragment of nucleic acid molecular chain. It refers to DNA Copy Number in most publications.

3

Why study Copy Number ?Why study Copy Number ?

Motive 1: Genetic Polymorphisms

- restriction fragment length polymorphism (RFLP)- amplified fragment length polymorphism (AFLP)- random amplification of polymorphic DNA (RAPD)- variable number of tandem repeat (VNTR; e.g., mini- and microsatellite)- single nucleotide polymorphism (SNP)- presence/absence of transportable elements…- structural alterations (deletions, duplications, insertions, inversions … )- DNA copy number variant (CNV)

Association with phenotypes/diseases genes/genetic factors

4

Motive 2: Genetic Aberrations in Tumor Cells Mutation, LOH, Copy Number Aberration (CNA)

Homologous repeats

Segmental duplications

Chromosomal rearrangements

Duplicative transpositions

Non-allelic recombinations

……

Normal cell

Tumor cells

deletion amplification

CN=0 CN=1 CN=2 CN=3 CN=4

CN=2

5

How to measure/quantify Copy Number?How to measure/quantify Copy Number?

Quantitative Polymerase Chain Reaction (Q-PCR) : DNA Amplification

(dNTPs, primers, Taq polymerase, fluorescent dye)

PCR

less CN amplification less DNA low fluorescent intensity

more CN amplification more DNA high fluorescent intensity

(one fragment each time)

Microarray : DNA Hybridization

(dNTPs, primers, Taq polymerase, fluorescent dye)

PCR

less CN amplification less DNA arrayed probes low intensities

more CN amplification more DNA arrayed probes high intensities

(multiple/different fragments, mixed pool)

Hybridization6

Array Comparative Genomic Hybridization (CGH)

Tumor: red intensity

Normal: green intensity

Red < Green: Deletion (CN<2)

Red > Green: Amplification (CN>2)

Red = Green: No Alteration (CN=2)

more DNA copy number more DNA hybridization higher intensity

7

SNP Array

Tumor NormalAffymetrix Mapping

250K Sty-I chip

~250K probe sets

~250K SNPs

CN=1

CN=0

CN>2

CN=2

CN=2

CN=2

probe set (24 probes)

Deletion

Deletion

Amplification

more DNA copy number more DNA hybridization higher intensity 8

Genotyping & Copy Number Calling

CN=0

CN=1

CN=2

CN=3

CN=4

2 copy deletion, genotype (_//_)

1 copy deletion, genotype (_//B)

1 copy amplification, genotype (AA//B)

Normal , genotype (A//B)

2 copy amplification, genotype (AA//BB) 9

BB

BBBB

AB

AABB

AAA_

10

Copy Number AnalysisCopy Number Analysis

Data Pre-processing

Individual Sample Analysis

Population Analysis

11

An Example

Finished chips (scanner) Raw image data [.DAT files] (experiment info [ .EXP]) (image processing software)

Probe level raw intensity data [.CEL files]

Background adjustment, Normalization, Summarization

Summarized intensity data

Raw copy number (CN) data [log ratio of tumor/normal intensities]

Significance test of CN changesEstimation of CN

Smoothing and boundary determination Concurrent regions among population

Amplification and deletion frequencies among populationsAssociation analysis

Preprocessing :

chip description file [.CDF]

12

Background Adjustment/Correction

Reduces unevenness of a single chip Makes intensities of different positions on a chip comparable

Before adjustment After adjustment

Corrected Intensity (S’) = Observed Intensity (S) – Background Intensity (B)

For each region i, B(i) = Mean of the lowest 2% intensities in region i

AffyMetrix MAS 5.0 13

Eliminates non-specific hybridization signalObtains accurate intensity values for specific hybridization

Background Adjustment/Correction

PM only, PM-MM, Ideal MM, etc.

quartet probe set

sense or antisense strands

25 oligonucleotide probes

14

Normalization

Reduces technical variation between chips Makes intensities from different chips comparable

Before normalization After normalization

Base Line Array (linear); Quantile Normalization etc.

S – Mean of S

S’ =

STD of S

S’ ~ N(0,1 )

15

Combines the multiple probe intensities for each probe set to produce a summarized value for subsequent analyses.

Summarization

Average methods:

PM only or PM-MM, allele specific or non-specific

Model based method : Li & Wong , 2001

Gene Expression Index

16

Raw Copy Number Data

S : Summarized raw intensity

S’ : Log transformation, S’ = log2(S)Log ratio of sample i / sample ref.

CN_log2 = log2(Si/Sref)

CN = 2(Si/Sref )

before Log transformation

S

after Log transformation

Log(S)

Raw CN

17

Individual Level Individual Level AnalysisAnalysis

Smoothing

Significance test of amplification and deletion

Segmentation

CN estimation

18

Sliding Window Sliding Window

… .. … … . . . . .. …… …… .. … … . . . . .. …… … .. …… … ..

Window 1Window 2

Window 3Window 4

Window 5Window 6

Window 7Window 8

Window 9Window 10

Window N

Window k

………..

………..

Each window (k) contains n consecutive SNPs (k, k+1, k+2, k+3, …, k+n-1)

19

Smoothing (sliding window=30 snps)Smoothing (sliding window=30 snps)

Affymetrix

IlluminaChrom. 7

Chrom. 7 Chrom. 7

Mbp

CN

Mbp

Chrom. 7

CN

Mbp

CN

Mbp

CN

20

Significance Test of CN ChangesSignificance Test of CN Changes

CN

SD

Mbp

CN

CN

Mbp Mbp

-log F

DR

Mbp

-log P

Mbp 21

Window Selection (FDR < 0.05)

CN

Mbp

-log F

DR

Mbp

epidermal growth factor receptor (EGFR) 22

SegmentationSegmentation

(Break chrom. into CN-homologous pieces)BioConductor R Packages (www.bioconductor.org)DNAcopy package, circular binary segmentation (CBS)GLAD package, adaptive weights smoothing (AWS)

23

CBS AlgorithmCBS Algorithm

1,2,3, ….,i-1, i, i+1,…,j-1,j, j+1,...n ijC

ijnijij

n

kkn

j

kkj

i

kki

ZZ

ijnij

ijnSSSijSSZ

xSxSxS

njianyGiven

max

)(1)(1

)()()()(

,,

,,

111

Iterate until Zc is not significant.

Olshen et al. Biostatistics. 2004 Oct;5(4):557-72.24

CN Estimation: Hidden Markov Model (HMM) CNAT(www.affymetrix.com); dChip (www.dchip.org) ; CNAG (www.genome.umin.jp)

CN=?

CN=?

CN=?

CN=?

CN=?

log ratio

log ratio

log ratio

log ratio

log ratio

… SNP_i SNP_i+1 SNP_i+2 SNP_i+3 SNP_i+4 … position

hidden status(unknown CN )

observed status(raw CN = log ratio of intensities)

CN estimation: finding a sequence of CN values which maximizes the likelihood of observed raw CN.

Algorithm: Viterbi algorithm (can be Iterative)

Information/assumptions below are needed

Background probabilities: Overall probabilities of possible CN values.

P(CN=x); x=0,1,2,3,4,…, n (usually,n<10)

Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one.

P(CN_i+1=xi | CN_i=xj); x=0,1,2,3,4,…, or n

Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status.

P(log ratio<x|CN=y)=f(x|CN=y); x=one of real numbers; y=0,1,2,3,4, …, or n25

HMM Results (An Example)Black: Normal Intensities, Red: Tumor Intensities, Green: Tumor- Normal

Blue: HMM estimated CNs in Tumor Tissue

CN=2 CN=1

CN=4CN=3

26

References for Single Sample Analysis

•Hsu et al. 2005. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics 6: 211-226.•Hupe et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 20: 3413-3422.•Jong et al. 2004. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics 20: 3636-3637.•Lai et al. 2005. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21: 3763-3770.•Lai et al. 2005. A statistical method to detect chromosomal regions with DNA copy number alterations using SNP-array-based CGH data. Comput Biol Chem 29: 47-54.•Olshen et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572.•Picard et al. 2005. A statistical approach for array CGH data analysis. BMC Bioinformatics 6: 27.•Shah et al. 2007. Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics 23: i450-458.•Nilsson et al. Bioinformatics. 2009 Apr 15;25(8):1078-9. Epub 2009 Feb 19.

27

http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=Search&Term=%22Nilsson%20B%22%5BAuthor%5D&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVCitation

Population Level AnalysisPopulation Level AnalysisCommon/Reocurrent Region Identification

samples

28

Nature 2007, 450, 893-898

Genome-wide Raw Copy Number Changes(sliding window plot, averaged over ~400 pairs )

29

Frequency Test

30

Diskin et al. 2006. STAC, Genome Res 16: 1149-1158. Permutation test

Amplitude Test

31

GISTIC Beroukhim et al. 2007. Proc Natl Acad Sci U S A 104: 20007-20012

Weir et al. Nature 2007, 450, 893-898

Population-based One-step Analysis

32

CMDS MethodQ Zhang et al. Bioinformatics, 2009 doi:10.1093/bioinformatics/btp708

Referencesfor Multiple Sample Analysis

•(GISTIC ) Beroukhim et al. 2007. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A 104: 20007-20012.

•(STAC) Diskin et al. 2006. STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res 16: 1149-1158.

•(MSA) Guttman et al. 2007. Assessing the significance of conserved genomic aberrations using high resolution genomic microarrays. PLoS Genet 3: e143.

•(GFA) Lipson et al. 2006. Efficient calculation of interval scores for DNA copy number data analysis. J Comput Biol 13: 215-228.

•(MAR) Rouveirol et al. 2006. Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics 22: 849-856.

•(CMDS) Zhang et al. CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics, 2009 doi:10.1093/bioinformatics/btp708

33

Sequencing Datacoverage/depth based analysis

34Nature Genetics 41, 1061 - 1067 (2009)

Sequencing Datapaired-end data based analysis

35

Science 2007:Vol. 318. pp. 420 - 426DOI: 10.1126/science.1149504

HomeworkDownload the data file

dsgweb.wustl.edu/qunyuan/data/cn_data.csv

Use any published or self-developed method/software to analyze/present the data

Write a report of your analysis

Send to [email protected] in two weeks36

mailto:[email protected]

DNA Copy Number Analysis Qunyuan Zhang Division of Statistical Genomics Department of Genetics &...

Documents

Transcript of DNA Copy Number Analysis Qunyuan Zhang Division of Statistical Genomics Department of Genetics &...