First analysis steps o quality control and optimization o calibration and error modeling o data...

First analysis stepsFirst analysis stepso quality control and optimizationo calibration and error modelingo data transformations

Wolfgang Huber

Dep. of Molecular Genome Analysis (A. Poustka)

DKFZ Heidelberg

Acknowledgements

Anja von HeydebreckGünther Sawitzki

Holger Sültmann, Andreas Buness, Markus Ruschhaupt, Klaus Steiner, Jörg Schneider, Katharina Finis, Stephanie Süß, Anke Schroth, Friederike Wilmer, Judith Boer, Martin Vingron, Annemarie Poustka

Sandrine Dudoit, Robert Gentleman, Rafael Irizarry and Yee Hwa Yang: Bioconductor short course, summer 2002

and many others

4 x 4 or 8x4 sectors

17...38 rows and columns per sector

ca. 4600…46000probes/array

sector: corresponds to one print-tip

a microarray slideSlide: 25x75 mm

Spot-to-spot: ca. 150-350 m

Terminologysample: RNA (cDNA) hybridized to the array,

aka target, mobile substrate.

probe: DNA spotted on the array, aka spot, immobile substrate.

sector: rectangular matrix of spots printed using the same print-tip (or pin), aka print-tip-group

plate: set of 384 (768) spots printed with DNA from the same microtitre plate of clones

slide, array

channel: data from one color (Cy3 = cyanine 3 = green, Cy5 = cyanine 5 = red).

batch: collection of microarrays with the same probe layout.

Raw datascanner signal

resolution:5 or 10 m spatial, 16 bit (65536) dynamical per channel

ca. 30-50 pixels per probe (60 m spot size)40 MB per array

Image Analysis

spot intensities2 numbers per probe (~100-300 kB)… auxiliaries: background, area, std dev, …

Image analysis

1. Addressing. Estimate location of spot centers.

2. Segmentation. Classify pixels as foreground (signal) or background.

3. Information extraction. For each spot on the array and each dye

• foreground intensities;• background intensities; • quality measures.

R and G for each spot on the array.

Segmentation

adaptive segmentationseeded region growing

fixed circle segmentation

Spots may vary in size and shape.

spot intensity dataspot intensity data

two-color spotted arrays

Pro

bes (

gen

es)

n one-color arrays (Affymetrix, nylon)

conditions (samples)

Which genes are differentially transcribed?

Which genes are differentially transcribed?

same-same tumor-normal

log-ratio

ratios and fold changes

Fold changes are useful to describe continuous changes in expression

10001500

3000

x3

x1.5

A B C

0200

3000

?

?

A B C

But what if the gene is “off” (below detection limit) in one condition?

ratios and fold changes

Many interesting genes will be off in some of the conditions of interest

1.If you want expression measure (“net normalized spot intensity”) to be an unbiased estimator of abundance

many values 0 need something more than

(log)ratio

2. If you let expression measure be biased

can keep ratios. how do you chose the bias?

Raw data are not mRNA concentrations

o tissue contamination

o clone identification and mapping

o image segmentation

o RNA degradation

o PCR yield, contamination

o signal quantification

o amplification efficiency

o spotting efficiency

o ‘background’ correction

o reverse transcription efficiency

o DNA-support binding

o hybridization efficiency and specificity

o other array manufacturing-related issues

The problem is less that these steps are ‘not perfect’; it is that they may vary from gene to gene, array to array, experiment to experiment.

Sources of variationSources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-photodetection

PCR yieldDNA qualityspotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Calibration Error model

Systematic o similar effect on many measurementso corrections can be estimated from data

Stochastico too random to be ex-plicitely accounted for o “noise”

iik ika a

ai per-sample offset

ik ~ N(0, bi2s1

2)

“additive noise”

bi per-sample normalization factor

bk sequence-wise probe efficiency

ik ~ N(0,s22)

“multiplicative noise”

exp( )iik k ikb b b

ik ik ik ky a b x

modeling ansatz

measured intensity = offset + gain true abundance

The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

B. Durbin, D. Rocke, JCB 2001

Calibration ("normalization")

Calibration ("normalization")

Correct for systematic variations.To do: fit appropriate "correction parameters" ai, bi (and possibly more…) and apply to the data."Heteroskedasticity" (unequal variances) weighted regression or variance stabilizing transformation

Outliers: use a robust method

the variance-mean dependence

the variance-mean dependence

data (cDNA slide):

relation between

mean u=E(Yik)

andvariance

v=Var(Yik):2 2 2

0( ) ( )v u c u u s

variance stabilization


Xu a family of random variables with

EXu=u, VarXu=v(u).

Define

var f(Xu ) independent of u

1( )

v( )

x

f x duu

derivation: linear approximation

0 20000 40000 60000

8.0

8.5

9.0

9.5

10

.01

1.0

raw scale

tra

nsf

orm

ed

sca

le


f(x)

x

variance stabilizing transformations

variance stabilizing transformations

1( )

v( )

x

f x duu

1.) constant variance

( ) constv u f u

2.) const. coeff. of variation

2( ) logv u u f u

4.) microarray

2 2 00( ) ( ) arsinh

u uv u u u s f

s

3.) offset2

0 0( ) ( ) log( )v u u u f u u

the arsinh transformationthe arsinh transformation

- - - log u

——— arsinh((u+uo)/c)

2arsinh( ) log 1

arsinh log log2 0limx

x x x

x x

intensity-200 0 200 400 600 800 1000

parameter estimationparameter estimation

2Yarsinh , (0, )iki

k ki kii

aN c

b

:

o maximum likelihood estimator: straightforward – but sensitive to deviations from normality

o model holds for genes that are unchanged; differentially transcribed genes act as outliers.

o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression.

o works as long as <50% of genes are differentially transcribed

ii k i k i ka a L a i p e r - s a m p l e o ff s e t

L i k l o c a l b a c k g r o u n d p r o v i d e d b y i m a g e a n a l y s i s

i k ~ N ( 0 , b i2 s 1

2 )

“ a d d i t i v e n o i s e ”

b i p e r - s a m p l en o r m a l i z a t i o n f a c t o r

b k s e q u e n c e - w i s el a b e l i n g e ffi c i e n c y

i k ~ N ( 0 , s 22 )

“ m u l t i p l i c a t i v e n o i s e ”

e x p ( )ii k k i kb b b

i k i k i k i ky a b x

m e a s u r e d i n t e n s i t y = o ff s e t + g a i n * t r u e a b u n d a n c e

Least trimmed sum of squares regression

Least trimmed sum of squares regression

0 2 4 6 8

02

46

8

x

y 2n/2

( ) ( )i=1

( )i iy f x

minimize

- least sum of squares - least trimmed sum of squares

evaluation: effects of different data transformations

evaluation: effects of different data transformations

diff

ere

nce r

ed

-g

reen

rank(average)

Coefficient of

variation

Coefficient of

variation

cDNA slide: H. Sueltmann

evaluation: a benchmark for Affymetrix genechip expression measures

o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts

o Benchmark: 15 quality measures regarding-reproducibility-sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

ROC curves

affycomp results (28 Sep 2003) good

bad

SummarySummary

log-ratio

'glog' (generalized log-ratio)

- interpretation as "fold change"

+ interpretation even in cases where genes are off in some conditions

+ visualization

+ can use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale

2 2

2 2

log

log

i

j

i i i

j j j

xx

x x c

x x c

Availability

o implementation in Ro open source package

vsn on www.bioconductor.org

o Bioconductor is an international collaboration on open source software for bioinformatics and statistical omics

Quality control:

diagnostic plots and artifacts

Scatterplot, colored by PCR-plateTwo RZPD Unigene II filters (cDNA nylon membranes)

PCR platesPCR plates

PCR platesPCR plates

PCR plates: boxplotsPCR plates: boxplots

array batchesarray batches

print-tip effectsprint-tip effects

-0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.2

0.4

0.6

0.8

1.0

41 (a42-u07639vene.txt) by spotting pin

log(fg.green/fg.red)

F̂

1:11:21:31:42:12:22:32:43:13:23:33:44:14:24:34:4

q (log-ratio)

F(q

)

spotting pin quality declinespotting pin quality decline

after delivery of 3x105 spots

after delivery of 5x105 spots

H. Sueltmann DKFZ/MGA

spatial effectsspatial effects

R Rb R-Rbcolor scale by rank

spotted cDNA arrays, Stanford-type

another array:

print-tip

color scale

~ log(G)

color scale

~ rank(

G)

10 20 30 40 50 60

1020

3040

5060

1:nrhyb

1:nr

hyb

1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Batches: array to array differences dij = madk(hik -hjk)

arrays i=1…63; roughly sorted by time

Density representation of the scatterplot(76,000 clones, RZPD Unigene-II filters)

Oligonucleotide chips

Affymetrix files

Main software from Affymetrix: MAS - MicroArray Suite.

DAT file: Image file, ~10^7 pixels, ~50 MB.

CEL file: probe intensities, ~400000 numbers

CDF file: Chip Description File. Describes which probes go in which probe sets (genes, gene fragments, ESTs).

Image analysisDAT image files CEL filesEach probe cell: 10x10 pixels.Gridding: estimate location of probe cell

centers.Signal:

– Remove outer 36 pixels 8x8 pixels.– The probe cell signal, PM or MM, is the 75th

percentile of the 8x8 pixel values.Background: Average of the lowest 2% probe

cells is taken as the background value and subtracted.

Compute also quality values.

Data and notationData and notationPMijg , MMijg = Intensities for perfect match and

mismatch probe j for gene g in chip i

i = 1,…, n one to hundreds of chips

j = 1,…, J usually 11 or 16 probe pairs

g = 1,…, G 6…30,000 probe sets.

Tasks: calibrate (normalize) the measurements from different chips

(samples)summarize for each probe set the probe level data, i.e., 16 PM

and MM pairs, into a single expression measure.compare between chips (samples) for detecting differential

expression.

expression measures: MAS 4.0

expression measures: MAS 4.0

Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:

o sort dj = PMj -MMj o exclude highest and lowest valueo J := those pairs within 3 standard

deviations of the average

1( )

# j jj J

AvDiff PM MMJ

Expression measures MAS 5.0

Expression measures MAS 5.0

Instead of MM, use "repaired" version CTCT= MM if MM<PM = PM / "typical log-ratio"if MM>=PM

"Signal" = Tukey.Biweight (log(PM-CT))

(… median)

Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Expression measures: Li & Wong

Expression measures: Li & Wong

dChip fits a model for each gene

where

– i: expression index for gene i

– j: probe sensitivity

Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.

Need at least 10 or 20 chips.

Current version works with PMs only.

2, (0, )ij ij i j ij ijPM MM N

Affymetrix: IPM = IMM + Ispecific ?

log(PM/MM)0From: R. Irizarry et al.,

Biostatistics 2002

Chemistry

i

25

1

log log ( )i ii

Y x w s

wi

position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)

Expression measures RMA: Irizarry et al. (2002)

Expression measures RMA: Irizarry et al. (2002)

o Estimate one global background value b=mode(MM). No probe-specific background!

o Assume: PM = strue + b

Estimate s0 from PM and b as a conditional expectation E[strue|PM, b].

o Use log2(s).

o Nonparametric nonlinear calibration ('quantile normalization') across a set of chips.

AvDiff-like

with A a set of “suitable” pairs.

Li-Wong-like: additive model

Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

2

1RMA log ( )j j

j A

PM BG

Robust expression measures RMA: Irizarry et al. (2002)

Robust expression measures RMA: Irizarry et al. (2002)

2log ( )ij i j ijPM BG a b

Software for pre-processing of Affymetrix

data• Bioconductor R package affy.• Background estimation.• Probe-level normalization.• Expression measures• Two main functions: ReadAffy,

expresso.• Can use vsn as a normalization

method for expresso.

ReferencesNormalization for cDNA microarray data: a robust composite method

addressing single and multiple slide systematic variation. YH Yang, S Dudoit, P Luu, DM Lin, V Peng, J Ngai and TP Speed. Nucl. Acids Res. 30(4):e15, 2002.

Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. W.Huber, A.v.Heydebreck, H.Sültmann, A.Poustka, M.Vingron. Bioinformatics, Vol.18, Supplement 1, S96-S104, 2002.

A Variance-Stabilizing Transformation for Gene Expression Microarray Data. : Durbin BP, Hardin JS, Hawkins DM, Rocke DM. Bioinformatics, Vol.18, Suppl. 1, S105-110.

Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Irizarry, RA, Hobbs, B, Collin, F, Beazer-Barclay, YD, Antonellis, KJ, Scherf, U, Speed, TP (2002). Accepted for publication in Biostatistics. http://biosun01.biostat.jhsph.edu/~ririzarr/papers/index.html

A more complete list of references is in:Elementary analysis of microarray gene expression data. W. Huber,

A. von Heydebreck, M. Vingron, manuscript. http://www.dkfz-heidelberg.de/abt0840/whuber/

First analysis steps o quality control and optimization o calibration and error modeling o data...

Documents

Transcript of First analysis steps o quality control and optimization o calibration and error modeling o data...