1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The...

52
1 Pre-processing - Normalization Databases atistics for Microarray Data Analysis – Lect The Fields Institute for Research in Mathematical Sciences May 25, 2002

Transcript of 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The...

Page 1: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

1

Pre-processing - NormalizationDatabases

Statistics for Microarray Data Analysis – Lecture 2The Fields Institute for Research in

Mathematical SciencesMay 25, 2002

Page 2: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

2

Biological question

Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Normalization

TestingEstimation Clustering Discrimination

Analysis

Page 3: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

3

Diagnostic Plots

Page 4: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

4

• Was the experiment a success?

• What analysis tools should be used?

• Are there any specific problems?

Begin by looking at the dataBegin by looking at the data

Page 5: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

5

Red/Green overlay images

Good: low bg, lots of d.e. Bad: high bg, ghost spots, little d.e.

Co-registration and overlay offers a quick visualization,revealing information on colour balance, uniformity ofhybridization, spot uniformity, background, and artifiactssuch as dust or scratches

Page 6: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

6

Always log, always rotate

log2R vs log2G M=log2R/G vs A=log2√RG

Page 7: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

7Signal/Noise = log2(spot intensity/background intensity)

Histograms

Page 8: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

8Slide 3 of the swirl data: used in all that follows.

Page 9: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

9

MA-plot

Page 10: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

10

Spatial plots: background

Page 11: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

11

Spatial plots: log ratios (M)

Page 12: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

12

Boxplots: Print-tips effects

Page 13: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

13

Boxplots: plate effects

Page 14: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

14

Time of printing effects

Page 15: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

15

Serious time of printing effects

Another data set, green channel intensities (log2G). Printed over 4.5 days.

spot number

Page 16: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

16

Normalization

Page 17: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

17

Normalization

Why?To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples.

How do we know it is necessary? By examining self-self hybridizations, where no true differential expression is occurring.

We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….

Page 18: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

18

Self-self hybridizations

False color overlay Boxplots within pin-groups Scatter (MA-)plots

Page 19: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

19

A series of non self-self hybridizations

From the NCI60 data set Early Ngai lab, UC Berkeley

Early PMCRI, Melbourne AustraliaEarly Goodman lab, UC Berkeley

Page 20: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

20

Normalization

• Within-slides

- Which genes to use?- Location normalization- Scale normalization

• Paired-slides (dye-swap)

- Self-normalization

• Between-slides

Page 21: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

21

Within-slide normalization

Page 22: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

22

Which spots to use?

• All genes on the array.

• Constantly expressed genes (housekeeping).

• Controls- Spiked controls (e.g. plant genes);- Genomic DNA titration series.

• Rank invariant set.

Page 23: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

23

Which spots to use, cont.?

The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability. For example, we can justify the use of a global LOWESS approach by supposing that, when stratified by mRNA abundance, a) only a minority of genes are expected to be differentially expressed, or b) any differential expression is as likely to be up-regulation as down-regulation. Print-tips-group LOWESS requires stronger assumptions: that one of the above applies within each print-tip-group. The use of other sets of genes, e.g. control or housekeeping genes, involve similar assumptions.

Page 24: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

24

log2R/G log2R/G - c = log2R/ (kG)

Standard practice (in most software)

c is a constant such as the mean or median log ratio.

Our preference

c is a function of overall spot intensity A and print-tip

group, and possibly other variables.

Compute c by robust locally weighted regression of M

On these variables. E.g. lowess.

Location normalization

Page 25: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

25

Location normalization: details

a) Normalization based on a global adjustment

log2 R/G -> log2 R/G - c = log2 R/(kG)

Choices for k or c = log2k are c = median or mean of log ratios for a particular gene set (e.g. housekeeping genes). Or, total intensity normalization, where k = ∑Ri/ ∑Gi.

b) Intensity-dependent normalization.

Here we run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e.

log2 R/G -> log2 R/G - c (A) = log2 R/(k(A)G).

One estimate of c(A) is made using the LOWESS function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing.

Page 26: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

26

MA-plot

Page 27: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

27

MA-plot

Page 28: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

28

Boxplot: print-tip effects remain after global normalization

Page 29: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

29

Normalization: details cont.c) Within print-tip group normalization.

In addition to intensity-dependent variation in log ratios, spatial bias can also be a significant source of systematic error. Most normalization methods do not correct for spatial effects produced by hybridization artifacts or print-tip or plate effects during the construction of the microarrays.

It is possible to correct for both print-tip and intensity-dependent bias by performing LOWESS fits to the data within print-tip groups, i.e.

log2 R/G -> log2 R/G - ci(A) = log2 R/(ki(A)G),

where ci(A) is the LOWESS fit to the MA-plot for the ith grid only.

Page 30: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

30

Print-tip normalized data: M vs A plot

Page 31: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

31

Print-tip normalized data: spatial and box plots

Page 32: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

32

Smoothed histograms of M values

Black: unnormalized; red: global median; green: global lowess; blue: print-tip lowess

Page 33: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

33

Within-slide

Paired slide

Page 34: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

34

Follow-up experiment

On each slide, half the spots (8) are differentially expressed, the other half are not.

Page 35: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

35

Paired-slides: dye-swap

• Slide 1, M = log2 (R/G) - c

• Slide 2, M’ = log2 (R’/G’) - c’

Combine by subtracting the normalized log-ratios:

[ (log2 (R/G) - c) - (log2 (R’/G’) - c’) ] / 2

[ log2 (R/G) + log2 (G’/R’) ] / 2

[ log2 (RG’/GR’) ] / 2

provided c = c’.

Assumption: the normalization functions are the

same for the two slides.

Page 36: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

36

Checking the assumption

MA plot for slides 1 and 2: it isn’t always like this.

Page 37: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

37

Result of self-normalization(M - M’)/2 vs. (A + A’)/2

Page 38: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

38

MSP titration series(Microarray Sample Pool)

Control set to aid intensity- dependent normalization

Different concentrations

Spotted evenly spread across the slide

Pool the whole library

Page 39: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

39

Yellow: GAPDH, tubulin Light blue: MSP pool / titration

Orange: Schadt-Wong rank invariant set Red line: lowess smooth

MSP normalization compared to other methods

Page 40: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

40

Composite normalization

Before and after composite normalization

-MSP lowess curve-Global lowess curve-Composite lowess curve(Other colours control spots)

ci(A)=Ag(A)+(1-A)fi(A)

Page 41: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

41

Comparison of Normalization Schemes(courtesy of Jason Goncalves)

No consensus on best segmentation or normalization method Scheme was applied to assess the common normalization methods Based on reciprocal labeling experiment data for a series of 140 replicate experiments on two different arrays each with 19,200 spots

Page 42: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

42

DESIGN OF RECIPROCALDESIGN OF RECIPROCALLABELING EXPERIMENTLABELING EXPERIMENT

Replicate experiment in which we assess the same mRNA pools but invert the fluors used.

The replicates are independent experiments and are scanned, quantified and normalized as usual

Page 43: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

43

The following relationship would be observedfor reciprocal microarray experiments in

which the slides are free of defects and the normalization scheme performed ideally

We can measure using real data sets how well each microarray normalization scheme

approaches this ideal

2.2/1

21.2/1

2 )(log)(log ExpChCh

GeneAExpChCh

GeneA RatioRatio

Page 44: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

44

Deviation metric to assessDeviation metric to assessnormalization schemesnormalization schemes

We now use the mean array average deviation to compare the We now use the mean array average deviation to compare the normalization methods. Note that this comparison addresses normalization methods. Note that this comparison addresses only variance (precision) and not bias (accuracy) aspects of only variance (precision) and not bias (accuracy) aspects of normalization.normalization.

2.2/1

21.2/1

2 )(log)(log ExpChCh

GeneAExpChCh

GeneASpot RatioRatioDeviation

n

RatioRatioDeviation

ExpChCh

GeneNExpChCh

GeneN

n

geArrayAvera

2.2/1

21.2/1

21

)(log)(log

Page 45: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

45

Comparison of Normalization Methods - Using 140 19K Microarrays

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

Pre Normalized Global Intensity Subarray Intensity Global Ratio Sub-Array Ratio Global LOWESS Subarray LOWESS

Normalization Method

Aver

age

Mea

n De

viat

ion

Valu

e

***

Page 46: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

46

Multiple-slide

Page 47: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

47

Scale normalization: between slides

Boxplots of log ratios from 3 replicate self-self hybridizations.

Before normalization After location normalization After scale normalization

Page 48: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

48

Before normalization After location normalization After scale normalization

Page 49: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

49

One way of taking scale into account

MADi

MADii1

II

Assumption: All slides have the same spread in M

True log ratio is mij where i represents different slides and j represents different spots.

Observed is Mij, whereMij = ai mij

Robust estimate of ai is

MADi = medianj { |yij - median(yij) | }

Page 50: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

50

Summary of normalization

• Reduces systematic (not random) effects• Makes it possible to compare several arrays

• Use logratios (M vs A-plots)• Lowess normalization (dye bias)• MSP titration series – composite normalization• Pin-group location normalization• Pin-group scale normalization• Between slide scale normalization

• More? Use controls!• Normalization introduces more variability• Outliers (bad spots) are handled with replication

Page 51: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

51

What is missing?Principally, a discussion of data quality issues. Most image analysis programs collect a wide range of measurements associated with each spot: morphological measures such as area and perimeter (in pixels), uniformity measures such as the SD of foreground and background intensities in each channel, and of ratios of intensities (with and without background) across the pixels in a spot; and spot brightness indicators such as the ratio of spot foreground to spot background, and the fraction of pixels in the foreground with intensity greater than background intensity (or a given multiple thereof). From these, further derived measures can be calculated, such as coefficients of variation, and so on. How should we make use of the various quality indicators? Most programs include procedures for flagging spots on the basis of one or more indicators, and users typically omit flagged spots from their primary analyses. “Data filtering” of this kind clearly improves the appearance of the data, but….can we do more? That is a longer story, for another time.

Page 52: 1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

52

Acknowledgments

Natalie Thorne (WEHI)

Ingrid Lönnstedt (Uppsala)

The swirl data were provided by Katrin Wuennenberg-Stapleton from the Ngai Lab at UC Berkeley. The swirl embryos for this experiment were provided by David Kimelman and David Raible at the University of Washington.

The self-self data were provided by Vivian Peng from the Ngai Lab at UC Berkeley.

Reference:

Yang et al (2002) Nucleic Acids Research 30, e15.