Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

42
Corrections and Corrections and Normalization Normalization in microarrays in microarrays data analysis data analysis Mauro Delorenzi
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    2

Transcript of Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Page 1: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Corrections and Normalization Corrections and Normalization in microarrays in microarraysdata analysisdata analysis

Mauro Delorenzi

Page 2: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

AcknowledgmentsAcknowledgmentsUni. Cal. Statistics Berkeley / / WEHI Bioinformatics

Terry Speed (Berkeley / WEHI)

Yee Hwa Yang (Berkeley)Sandrine Dudoit (Stanford)Sandrine Dudoit (Stanford)Ingrid Lönnstedt (Uppsala)Yongchao Ge Yongchao Ge (Berkeley)Natalie Thorne (WEHI)Mauro Delorenzi (WEHI)

Collaborations with:

Peter Mac CI, Melb.Brown-Botstein lab, StanfordMatt Callow (LBNL)CSIRO Image Analysis Group

Most slides were taken from our collection

Page 3: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Biological questionGene regulationClass prediction

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

R, G

16-bit TIFF files

(Rfg, Rbg), (Gfg, Gbg)

Page 4: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray

Hybridise target to microarray

mRNA target)

excitation

laser 1laser 2

emission

scanning

analysis

0.1nl/spot

overlay images and normalise

Page 5: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale.

Scanner's SpotsScanner's Spots

Page 6: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Gene Expression DataGene Expression Data

Gene expression data on p genes for n samples

Genes

Slides

Gene expression level of gene 5 in slide 4 j

= Log2( Red intensity / Green intensity)

slide 1 slide 2 slide 3 slide 4 slide 5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

Page 7: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Some statistical questionsSome statistical questions

Image analysis: addressing, segmenting, quantifyingNormalisation: within and between slidesQuality: of images, of spots, of (log) ratios

Which genes are (relatively) up/down regulated? Assigning p-values to tests / confidence to resultsPlanning of experiments: design, sample size

Discrimination and allocation of samplesClustering, classification: of samples, of genes

Selection of genes relevant to any given analysisAnalysis of time course, factorial and other special experiments

……………………& more

Page 8: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

I. The simplest problem is identifying differentially I. The simplest problem is identifying differentially expressed genes using one slideexpressed genes using one slide

• This is a common enough hope

• Efforts are frequently successful

• It is not hard to do by eye

• The problem is probably beyond formal statistical inference (valid p-values, etc) for the foreseeable future.

Page 9: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

ObjectivesObjectivesImportant aspects of a statistical analysis include:• Tentatively separating systematic sources of variation

("artefacts"), that bias the results, from random sources of variation ("noise"), that hide the truth.

• Removing the former and quantifying the latter• Identifying and dealing with the most relevant source of

variation in subsequent analyses

Only if this is done can we hope to make more or less valid probability statements about the confidence in the results

Every Correction is a new source of variability. There is a trade-off

between gains and losses. The best method depends on the characteristic of the data and this can vary.

Page 10: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Typical Statistical ApproachTypical Statistical Approach

Measured value = real value + systematic errors + noise

Corrected value = real value + noise

• Analysis of Corrected value => (unbiased) CONCLUSIONS

• Estimation of Noise => quality of CONCLUSIONS, statistical significance

(level of confidence) of the conclusions

Page 11: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Image Analysis => Rfg ; Rbg ; Gfg ; Gbg (fg = foreground, bg = background.) For each spot on the slide we calculate

Red intensity = R = Rfg - Rbg Green intensity = G = Gfg - Gbg

M = Log2( Red intensity / Green intensity)

Subtraction of background values (additive background model assuming to be locally constant …)

Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement

Not included: real cross-hybridisation and unspecific hybridisation to the probe

Step 1: Background CorrectionStep 1: Background Correction

Page 12: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

The intensity pairs (R, G) are highly processed data and the methods of image processing and background correction of the laser scan images can have a large impact. Before applying normalisation, inference, cluster analysis and the like, it is important to identify and remove systematic sources of variation such as due to different labeling efficiencies and scanning properties of the two dyes or spatial inhomogeneities.

With many different users and protocols, the portion of the variation due to systematic effects can vary substantially.

There are many sources of systematic variation which affect the measured gene expression levels. Normalisation is the term used to describe the process of re moving such variation.

Until the variation is properly accounted for or modelled, there is no question of the system being in statistical control and hence no basis for a statistical model to describe chance variation.

Page 13: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

M = log R/G = logR - logG

A = ( logR + logG ) /2

Positive controls

(spotted in varying concentrations) Negative controls

blanks

Lowess curve

Step 2: An M vs A (MVA) PlotStep 2: An M vs A (MVA) Plot

Page 14: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

A reminder on logarithmsA reminder on logarithms

Page 15: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

A numerical exampleA numerical example

Page 16: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Why use an M vs A plot ?Why use an M vs A plot ?

1. Logs stretch out region we are most interested in.2. Can more clearly see features of the data such as intensity

dependent variation, and dye-bias.3. Differentially expressed genes more easily identified.4. Intuitive interpretation

Page 17: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

S1.n. Control Slide: Dye Effect, Spread.

MVA plot: looking at data 1MVA plot: looking at data 1

Lowess curve

Spot identifier

Page 18: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

S1.p . Normalised data. Spread.

MVA plot: looking at data 2MVA plot: looking at data 2

Page 19: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

S4. A-dependent variability.

MVA plot: looking at data 3MVA plot: looking at data 3

Page 20: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

S17. Saturation

MVA plot: analysing data 4MVA plot: analysing data 4

Page 21: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

MVA plot: looking at data 5:MVA plot: looking at data 5: Unique Unique effects of different scannerseffects of different scanners

Page 22: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

• Assumption: Changes roughly symmetric

• First panel: smooth density of log2G and log2R.

• Second panel: M vs A plot with median put to zero

Step 3: Normalisation - medianStep 3: Normalisation - median

Page 23: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

• Assumption: changes roughly symmetric at all intensities.

Step 4: Normalisation - lowessStep 4: Normalisation - lowess

Page 24: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

A hypothetical quantitative modelA hypothetical quantitative model

a. linear responsea. linear response

Page 25: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

A realistic hypothetical quantitative modelA realistic hypothetical quantitative model

b. power function-b. power function-responseresponse

Median Median EffectEffect Scale Scale

EffectEffectDye-Intensity Dye-Intensity

EffectEffect

Page 26: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

• After within slide global lowess normalization.• Likely to be a spatial effect.

Print-tip groups

Lo

g-r

ati o

sStep 5: Normalisation - between groupsStep 5: Normalisation - between groups

Page 27: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Normalization between groups (ctd)Normalization between groups (ctd)

• After print-tip location- and scale- normalization.

Lo

g-r

ati o

s

Print-tip groups

Page 28: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Effects of Effects of Location Location

NormalisatiNormalisation on

(example)(example)Before

After

Page 29: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Assumption:

All (print-tip-)groups should have the same spread in M

True ratio is ij where i represents different (print-tip)-groups and j represents different spots. Observed is Mij, where Mij = ai * log(ij)

Robust estimate of ai is

Corrected values are calculated as:

Step 6: Rescaling (Spread-Normalisation)Step 6: Rescaling (Spread-Normalisation)

Page 30: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Illustration: print-tip-group - NormalisationIllustration: print-tip-group - NormalisationAssumption: For every print group: changes roughly symmetric at all intensities.

Glass Slide

Array of bound cDNA probes

4x4 blocks = 16 pin groups

Page 31: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

MVA-plot and critical curves MVA-plot and critical curves Newton’s, Sapir & Churchill’s and Chen’s single slide method

Step 7: Assessing SignificanceStep 7: Assessing Significance

Page 32: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Other ApproachesOther Approaches

These normalisation procedures are based on the assumption that spots are as likely to be higher in the first or the second dye. They work well with a high number of independent spots.

If (a few) genes were selected another approach might be needed.For the correction of dye-effects we recommend to use either:1. Paired dye-swapped slides and/or2. Internal Controls as spikes or a dilution series

In the second case, instead of all genes only the control spots are used to compute the corrections.

In the first case, the data from the two slides can be combined. Assuming identical dye-intensity interactions in the two slides, the effect is corrected by taking:

A = 0,5 (A1 + A2)

M= 0,5 (M1 – M2)

This procedure is called self-normalisation, as it is done spot-by-spot. A number of controls give indication if it is working well. It also deals with some artifacts that cause some genes to be always higher in one dye than in the other.

Page 33: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

II. The second simplest problem is identifying differentially II. The second simplest problem is identifying differentially expressed genes using replicated slidesexpressed genes using replicated slides

There are a number of different aspects:• First, between-slide normalization; then• What should we look at: averages, SDs t-statistics,

other summaries?• How should we look at them?• Can we make valid probability statements?

Page 34: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

•M •t•t M

Results from the Apo AI ko experiment

Selecting genes up/down regulated 1Selecting genes up/down regulated 1

Page 35: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Which genes are (relatively) up/down Which genes are (relatively) up/down regulated?regulated?

Two samples.

e.g. KO vs. WT or mutant vs. WT

For each gene form the t statistic:

average of n trt Mssqrt(1/n (SD of n trt Ms)2)

T C

n

n

Selecting genes up/down regulatedSelecting genes up/down regulated

Two samples with a reference (e.g. pooled control)

T C* n

C C* n

• For each gene form the t statistic: average of n trt Ms - average of n ctl Ms

sqrt(1/n (SD of n trt Ms)2 + (SD of n ctl Ms)2)

Page 36: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Which genes have changed?Which genes have changed?When permutation testing is possibleWhen permutation testing is possible

1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G).

2. For each gene form the t statistic:

average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2)

3. Form a histogram of 6,000 t values.

4. Do a normal Q-Q plot; look for values “off the line”.

5. Permutation testing.

6. Adjust for multiple testing.

Page 37: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Histogram & qq plotHistogram & qq plot

ApoA1

Page 38: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Adjusted and Unadjusted p-values for the 50 genes Adjusted and Unadjusted p-values for the 50 genes with the largest absolute t-statistics.with the largest absolute t-statistics.

Page 39: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Which genes have changed?Which genes have changed?

When Permutation testing is not possibleWhen Permutation testing is not possible

Our current approach is to use M-averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes.

We hope in due course to calibrate B and use that as our main tool.

Bconst log

2a

n s 2 M

2

2an

s2 M2

1 nc

Empirical Bayes log posterior odds ratio

Page 40: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

•T •B•t M B• t B

Page 41: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Remarks for multiarrays experimentsRemarks for multiarrays experiments

• Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene.

• Averages can be driven by outliers.

• Ts can be driven by tiny variances.

• B = LOR will, we hope– use information from all the genes– combine the best of M. and T– avoid the problems of M. and T

Page 42: Corrections and Normalization in microarrays data analysis Mauro Delorenzi.

Some web sites:

Technical reports, talks, software etc.

http://www.stat.berkeley.edu/users/terry/zarray/Html/

Especially:

Dudoit et al: “Statistical methods for …”

Yee Hwa Yang et al. “Normalization for cDNA Microarray Data”

Statistical software R “GNU’s S” http://lib.stat.cmu.edu/R/CRAN/

Packages within R environment:

-- Spot http://www.cmis.csiro.au/iap/spot.htm

-- SMA (statistics for microarray analysis) http://www.stat.berkeley.edu/users/terry/zarray/Software /smacode.html