12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this...

132
07/04/22 1 Microarray Data Pre- Processing

Transcript of 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this...

Page 1: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 1

Microarray Data Pre-Processing

Page 2: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 2

Copyright notice

• Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!

Page 3: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 3

Microarray data analysis: preprocessing

The main goal of data preprocessing is to removethe systematic bias in the data as completely aspossible, while preserving the variation in geneexpression that occurs because of biologicallyrelevant changes in transcription.

Page 4: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 4

Microarray data analysis: preprocessing

Observed differences in gene expression could be due to transcriptional changes, or they could becaused by artifacts such as:

• different labeling efficiencies of Cy3, Cy5• uneven spotting of DNA onto an array surface• variations in RNA purity or quantity• variations in washing efficiency• variations in scanning efficiency

Page 5: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Microarray data analysis: preprocessing

• Image analysis

• Background correction

• Normalization

• Summarization

04/21/23 5

Page 6: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Image analysis

• The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.

• Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Page 7: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Steps in Images Processing

1. Addressing: locate centers

2. Segmentation: classification of pixels either as signal or background. using seeded region growing).

3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Page 8: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Addressing

This is the process of assigning coordinates to each of the spots.

Automating this part of the procedure permits high throughput analysis.

4 by 4 grids19 by 21 spots per grid

Page 9: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Addressing

Registration

Registration

Page 10: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Problems in automatic addressing

Misregistration of the red and green channels

Rotation of the array in the image

Skew in the array

Rotation

Rotation

Page 11: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Segmentation methods• Fixed circles• Adaptive Circle• Adaptive Shape

– Edge detection.– Seeded Region Growing. (R. Adams and L.

Bishof (1994): Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

• Histogram Methods

– Adaptive threshold.

Page 12: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Examples of algorithms and software implementation

Methods Software / algorithms

Fixed Circle ScanAlyze, GenePix, QuantArray

Adaptive Circle GenePix

Adaptive Shape Edging and region growing.

Histogram Method QuantArray and adaptivethresholding.

Page 13: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Limitation of fixed circle method

SRG Fixed Circle

Page 14: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Limitation of circular segmentation

—Small spot—Not circular

Results from SRG

Page 15: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

Information Extraction

• Spot Intensities

– mean (pixel intensities).

– median (pixel intensities).

– Pixel variation (IQR of log (pixel

intensities).• Background values

– Local

– Morphological opening

– Constant (global)

– None

• Quality Information

Signal

Background

Page 16: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 16

Background Correction• Recall that Spot signal or simply signal is fluorescence intensity due

to target molecules hybridized to probe sequences contained in a spot (what we would like to measure) plus background fluorescence (what we would rather not measure).

• Background is fluorescence that may contribute to spot pixel intensities but is not due to fluorescence from target molecules hybridized to spot probe sequences.

• The idea is to remove background fluorescence from the spot signal fluorescence because the spot signal is believed to be a sum of fluorescence due to background and fluorescence due to hybridized target cDNA.

Page 17: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 17

Local background

• Focusing on small regions surrounding the spot mask.

• Median of pixel values in this region

• Most software package implement such an approach

ScanAlyze ImaGene Spot, GenePix

• By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

Page 18: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 18

Global background

• Global method which subtracts a constant background for all spots

• Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide– More meaningful to estimate background based on a

set of negative control spots– If no negative control spots: approximation of the

average background = third percentile of all the spot foreground values

Page 19: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 19

Background Correction Strategies(applied prior to logging signal intensity)

1. Subtract local background, e.g.,signal mean – background mean orsignal mean – background median

This can increase variation in measurements, especially for low expressing genes. Some believe that local backgroundwill overestimate the background contribution to spotfluorescence. Background fluorescence where cDNA hasbeen spotted may be different than background where nocDNA has been spotted.

Page 20: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 20

Background Correction Strategies(applied prior to logging signal intensity)

2. For each spot, find the local background of the

spot as well as the local backgrounds of all

neighboring spots. Compute the median or mean of these

local backgrounds. Subtract that summary of local

backgrounds from the spot’s signal.

This is similar to option 1 but can reduce some variation in

background estimation.

Page 21: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 21

Background Correction Strategies(applied prior to logging signal intensity)

3. Find the median or mean of local backgrounds in asector. Subtract the sector summary of local backgroundsfrom each signal in the sector.

4. Subtract the median or mean of blank spot signals ornegative control signals in a sector from all other signals ina sector.

5. Estimate the background for each spot by fitting a rowand column model to the local background values in asector. (See next slide.)

Page 22: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 22

Modeling local backgrounds within each sector (Kafadar and Phang. (2003). CSDA 44 313-338)

bij = m + ri + cj + eij

background for spotin ith row and jth column

of the sector

baselinebackground

for the sector

roweffectfor thesector

columneffectfor thesector

residual

An estimated background for each spot bij is obtained via median polish.^

Page 23: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 23

Comments on Background Correction• Subtracting background may result in a

negative or zero adjusted-signal values. Such values cannot be logged. One simple approach is to replace all negative values by zero, add one to all values (whether zero or not), and log the resulting values.

Page 24: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 24

Data Normalization

• Large sets of experiments involve dozens to hundreds arrays

• To make the arrays comparable, the data need to be normalized

• Because equal amounts of mRNA are used in all arrays, the spot intensities of an array should sum to a fixed number

Page 25: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 25

What is Normalization?• Normalization describes the process of removing

(or minimizing) non-biological variation in the measured gene expression levels of hybridized mRNA so that biological differences can be more easily detected.

• Typically normalization is attempting to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides.

• Normalization does not necessarily have anything to do with the normal distribution that plays a prominent role in statistics.

Page 26: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 26

Sources of Non-Biological Variation• Dye bias: differences in heat and light sensitivity,

efficiency of dye incorporation• Differences in the amount of labeled cDNA

hybridized to each channel in a microarray experiment – Channel is used to refer to a combination of a dye

and a slide.

• Variation across replicate slides• Variation across hybridization conditions• Variation in scanning conditions• Variation among technicians doing the lab work• etc.......................................................................

Page 27: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 27

Normalization Methods forTwo-Color Microarray Data

Page 28: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 28

Side-by-side boxplots show examples of variation across channels.

Page 29: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 29

Slide 2Cy3 Cy5Slide 1

Cy3 Cy5

median

Q3=75th percentile

Q1=25th percentile

minimum

maximum

Page 30: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 30

Interquartile range (IQR) is Q3-Q1. Points more than 1.5*IQR above Q3or more than 1.5*IQR below Q1 are displayed individually.

median

Q3=75th percentile

Q1=25th percentile

minimum

maximum

Page 31: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 31

One of the simplest normalization strategies is to align the log signals so that all channels have the same median.

• The value of the common median is not important for subsequent analyses.

• A convenient choice is zero so that positive or negative values reflect signals above or below the median for a particular channel.

• If negative normalized signal values seem confusing, any positive constant may be added to all values after normalization to zero medians.

Page 32: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 32

Log

Mea

n S

igna

l Cen

tere

d at

0

Page 33: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 33

Note that medians match but variation seems to differ greatly across channels.

Log

Mea

n S

igna

l Cen

tere

d at

0

Page 34: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 34

Scale normalization (Yang, et al. 2002. Nucliec Acids Research, 30, 4 e15)

Consider a matrix X with i=1,...,I rows and j=1,...,J columns.

Let xij denote the entry in row i and column j.

We will apply scale normalization to the matrix of log signal mean values that have already been median centered (each row corresponds to a gene and each column corresponds to a channel).

For each column j, let mj=median(x1j, x2j, ..., xIj).

For each column j, let MADj=median(|x1j-mj|,|x2j-mj|,...,|xIj-mj|).

MAD: median absolute deviation

To scale normalize the columns of X to a constant value C, multiply all the entries in the jth column by C/MADj for all j=1,...,J.

A common choice for C is the geometric mean of MAD1,...,MADJ =

The choice of C will not effect subsequent tests or p-values but will affect fold change calculations.

( ) J/J

j jMAD1

1∏

=

*Yang et al. recommended scale normalization for log R/G values.

Page 35: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 35

Log

Mea

n S

igna

l (ce

nter

ed a

nd s

cale

d)

Data after Median Centering and Scale Normalizing

Page 36: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 36

A Simple Example

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Page 37: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 37

Determine Channel Medians

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

medians 7 6 6 11

Page 38: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 38

Subtract Channel Medians

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0

This is the data after median centering.

Page 39: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 39

Find Median Absolute Deviations

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0

MAD 2 4 1 2

Page 40: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 40

Find Scaling Constant

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0

MAD 2 4 1 2

C = (2*4*1*2)1/4 = 2

Page 41: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 41

Find Scaling Factors

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0

Scaling 2 2 2 2Factors 2 4 1 2

Page 42: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 42

Scale Normalize theMedian Centered Data

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 4.5 6 2 2 0 -2.0 2 4 3 -4 0.0 -2 -3 4 -6 -0.5 -8 -2 5 2 3.5 0 0

This is the data after median centering andscale normalizing.

Page 43: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 43Log Green

Log

Red

Slide 1 Log Signal Means after Median Centering and Scaling All Channels

Evidence of intensity-dependent dye bias

Page 44: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 44 A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

M vs. A Plot of the Logged, Centered, and Scaled Slide 1 Data

Page 45: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 45

To handle intensity-dependent dye bias, Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend “lowess” normalization prior to median centering and scale normalizing.

“lowess” stands for

LOcally WEighted polynomial regreSSion.

The original reference for lowess is

Cleveland, W. S. (1979). Robust locally weightedregression and smoothing scatterplots.

JASA 74 829-836.

Page 46: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 46

LOESS

• At each point in the data set a low-degree polynomial is fit to a subset of the data, with explanatory variable values near the point whose response is being estimated.

• The polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away.

• The value of the regression function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point.

• The LOESS fit is complete after regression function values have been computed for each of the n data points.

From Wikipedia, the free encyclopedia

Page 47: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 47Log Green

Log

Red

Slide 1 Log Signal Means

Page 48: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 48 A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

M vs. A Plot for Slide 1 Log Signal Means

Page 49: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 49 A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

M vs. A Plot for Slide 1 Log Signal Meanswith lowess fit (f=0.40)

Page 50: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 50 A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

Adjust M Values

Page 51: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 51A = (Adjusted Log Green + Adjusted Log Red) / 2

M =

Adj

uste

d Lo

g R

ed –

Adj

uste

d L

og G

reen

M vs. A Plot after Adjustment

Page 52: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 52

M vs. A Plot for Slide 1 Log Signal Means

Adj

uste

d Lo

g R

ed

Adjusted Log Green

adjusted log red = log red – adj/2

adjusted log green=log green + adj/2

where adj = lowess fitted value

Page 53: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 53 A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

M vs. A Plot for Slide 1 Log Signal Meanswith lowess fit (f=0.40)

For spots with A=7, the lowess fitted value is 0.883. Thus the value of adj discussed on the previous slide is 0.883 for spots with A=7.

The M value for such spots would be moved down by 0.883. The log red value would bedecreased by 0.883/2 and the log green value would be increased by 0.883/2 to obtain adjusted log red and adjusted log green values, respectively.

0.883

Page 54: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 54

How is the lowess curve determined?Weight function

Consider the tricube weight function defined as

Suppose we have data points (x1,y1), (x2,y2),...(xn,yn).

Let 0 < f ≤ 1 denote a fraction that will determine the smoothness of the curve.

Let r = n*f rounded to the nearest integer.

t

T

(t)

Tricube Weight Function

For i=1, ..., n; let hi be the rth smallest

number among |xi-x1|, |xi-x2|, ..., |xi-xn|.

T(t) = ( 1 - | t | 3 ) 3 for | t | < 1

= 0 for | t | ≥ 1.

For k=1, 2, ..., n; let wk(xi)=T( ( xk – xi ) / hi ).

Page 55: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 55

An Examplei 1 2 3 4 5 6 7 8 9 10xi 1 2 5 7 12 13 15 25 27 30yi 1 8 4 5 3 9 16 15 23 29

x

y

Suppose alowess curve

will be fitto this datawith f=0.4.

Page 56: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 56

Table Containing |xi-xj| Values

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

x1 0 1 4 6 11 12 14 24 26 29x2 1 0 3 5 10 11 13 23 25 28x3 4 3 0 2 7 8 10 20 22 25x4 6 5 2 0 5 6 8 18 20 23x5 11 10 7 5 0 1 3 13 15 18x6 12 11 8 6 1 0 2 12 14 17x7 14 13 10 8 3 2 0 10 12 15x8 24 23 20 18 13 12 10 0 2 5x9 26 25 22 20 15 14 12 2 0 3x10 29 28 25 23 18 17 15 5 3 0

Page 57: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 57

Calculation of hi from |xi-xj| Values

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

x1 0 1 4 6 11 12 14 24 26 29 h1= 6x2 1 0 3 5 10 11 13 23 25 28 h2= 5x3 4 3 0 2 7 8 10 20 22 25 h3= 4x4 6 5 2 0 5 6 8 18 20 23 h4= 5x5 11 10 7 5 0 1 3 13 15 18 h5= 5x6 12 11 8 6 1 0 2 12 14 17 h6= 6x7 14 13 10 8 3 2 0 10 12 15 h7= 8x8 24 23 20 18 13 12 10 0 2 5 h8=10 x9 26 25 22 20 15 14 12 2 0 3 h9=12x10 29 28 25 23 18 17 15 5 3 0 h10=15

n=10, f=0.4 r=4

Page 58: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 58

Weights wk(xi) Rounded to Nearest 0.001 k

1 2 3 4 5 6 7 8 9 10 1 1.000 0.986 0.348 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2 0.976 1.000 0.482 0.000 0.000 0.000 0.000 0.000 0.000 0.000 3 0.000 0.193 1.000 0.670 0.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.000 0.820 1.000 0.000 0.000 0.000 0.000 0.000 0.000 5 0.000 0.000 0.000 0.000 1.000 0.976 0.482 0.000 0.000 0.000 6 0.000 0.000 0.000 0.000 0.986 1.000 0.893 0.000 0.000 0.000 7 0.000 0.000 0.000 0.000 0.850 0.954 1.000 0.000 0.000 0.000 8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.976 0.670 9 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.986 1.000 0.95410 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.893 0.976 1.000

i

w6(x5) = (1 - ( | x6 - x5 | / h5 ) 3 ) 3 = ( 1 - ( | ( 13 – 12 ) / 5 | ) 3 ) 3 = ( 1 – 1 / 125 ) 3 0.976~~

Page 59: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 59

How is the lowess curve determined?Regression

)x(ˆi

*0β

∑=

n

kkkik )xβ - β -y)(x(w

1

210

For each i=1, 2, ..., n; let and denote the values of and 0β 1β

that minimize .

ii*

i**

i x)x(ˆ)x(ˆy 10 β+β=For i=1, 2, ..., n; let *iii y - ye =

))s/(e(B kk 6=δ

)x(ˆi

*1β

Consider the bisquare weight function defined as

B(t) = ( 1 - t 2 ) 2 for | t | < 1

= 0 for | t | ≥ 1.

B

(t)

Bisquare Weight Function

t

For k=1,2,...,n; let

where s is the median of |e1|, |e2|, ..., |en|.

and

Page 60: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 60

)x(ˆi1β

∑1

210

n

kkkikk )xβ - β -y)(x(w

=

δ

)x(ˆi0βFor each i=1, 2, ..., n; let and denote the values of and

that minimize .

0β 1β

How is the lowess curve determined?

iy

iiii x)x(ˆ)x(ˆy 10 β+β=For i=1, 2, ..., n; let .

Now use the new fitted values to compute new as on the previous slide.

Substitute the new for the old in the expression above and repeat the

minimization described above to obtain new values. These resulting values

are the lowess fitted values. Plot these values versus x1, x2, ..., xn and connect

with straight lines to obtain the lowess curve.

iy iy

Page 61: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 61

Page 62: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 62

Page 63: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 63

Page 64: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 64

Page 65: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 65

Page 66: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 66

Page 67: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 67

Page 68: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 68

Page 69: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 69

Page 70: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 70

Page 71: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 71

Plot Showing All 10 Lines and Predicted Values after One More Iteration

Page 72: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 72

The Lowess Curve

Page 73: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 73

After a separate lowess normalization for eachslide, the adjusted values can be median centeredand scale normalized across all channels using thelowess-normalized data for each channel.

A sector represents the set of points spottedby a single pin on a single slide. The entirenormalization process described above can becarried out separately for each sector on eachchannel.

It may be necessary to normalize by sector/channelcombinations if spatial variability is apparent.

Page 74: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 74

Boxplots of Mean Signal after Logging, Lowess Normalization,Median Centering, and Scaling

N

orm

aliz

ed S

igna

l

Page 75: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 75

Bolstad, et al. (2003, Bioinformatics 19 2:185-193) propose quantile normalization for microarray data • Quantile normalization is most commonly used in

normalization of Affymetrix data

• It can be used for two-color data as well.

• Quantile normalization can force each channel to have the same quantiles.

• xq (for q between 0 and 1) is the q quantile of a data set if the fraction of the data points less than or equal to xq is at least q, and the fraction of the data points greater than or equal to xq at least 1-q.

• median=x0.5 Q1=x0.25 Q3=x0.75

Page 76: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 76

Boxplots of Log Signal Means after Quantile Normalization

Page 77: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 77Log Green

Log

Red

Original Slide 1 Log Signal Means

Page 78: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 78

Comparison of Slide 1 Log Signal Means after Quantile Normalization

Log Green

Lo

g R

ed

Page 79: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 79

Details of Quantile Normalization

1. Find the smallest log signal on each channel.

2. Average the values from step 1.

3. Replace each value in step 1 with the average computed in step 2.

4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

Page 80: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 80

A Simple Example

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Page 81: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 81

Find the Smallest Valuefor Each Channel

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Page 82: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 82

Average These Values

(1+2+2+8)/4=3.25

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Page 83: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 83

Replace Each Value by the Average

(1+2+2+8)/4=3.25

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11

Page 84: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 84

Find the Next Smallest Values

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11

Page 85: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 85

Average These Values

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11

(3+5+5+9)/4=5.5

Page 86: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 86

Replace Each Value by the Average

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11

Page 87: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 87

Find the Average of theNext Smallest Values

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11

(7+6+6+11)/4=7.5

Page 88: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 88

Replace Each Value by the Average

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50

Page 89: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 89

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50

(8+13+7+13)/4=10.25

Find the Average of theNext Smallest Values

Page 90: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 90

Replace Each Value by the Average

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50

Page 91: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 91

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50

(9+15+9+15)/4=12.00

Find the Average of theNext Smallest Values

Page 92: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 92

Replace Each Value by the Average

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.00 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 12.00 10.25 7.50 7.50

This is the data matrix after quantile normalization.

Page 93: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 93

Background Correction and Normalization of Affymetrix

GeneChip Data

Page 94: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 94

Affymetrix .CEL Files

• A .CEL file contains one number representing signal intensity for each probe cell on a single GeneChip.

• .CEL files can be read with Affymetrix software or in R using the Bioconductor package affy.

• We will discuss two methods for normalizing and obtaining expression measures using data from Affymetrix .CEL files.

Page 95: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 95

Methods

1. Microarray Analysis Suite (MAS) 5.0 Signal proposed by Affymetrix. Statistical Algorithms Description Document (2002) Affymetrix Inc.

2. Robust Multi-array Average (RMA) proposed by Irizarray et al. (2003) Biostatistics 4, 249-264.

These are perhaps the two most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are describedand compared at http://affycomp.biostat.jhsph.edu/.

Page 96: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 96

MAS 5.0 Signal: Background Adjustment

• Each chip is divided into 16 rectangular zones.

• The lowest 2% of intensities in each zone are averaged to form a zone-specific background value denoted bZk for zones k=1, 2, ..., 16.

• The standard deviation of the lowest 2% of intensities in each zone is calculated and denoted nZk for zones k=1, 2, ..., 16.

• Let dk(x,y) denote the distance from the center of zone k to a probe cell located at coordinates (x,y) on the chip.

Page 97: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 97

GeneChip Divided into 16 Zones

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

probe cell atcoordinates

(x,y)

x

y

Page 98: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 98

d1(x,y)d4(x,y)

d16(x,y)

16 Distances to Zone Centers for Each Probe Cell

Page 99: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 99

MAS 5.0 Signal: Background Adjustment (continued)

• Let wk(x,y)=1/(dk(x,y)+100).

• Denote the background for the cell located at coordinates (x,y) by

b(x,y)=Σk=1 wk(x,y) bZk / Σk=1 wk(x,y).

• Denote the “noise” for the cell located at coordinates (x,y) by

n(x,y)=Σk=1 wk(x,y) nZk / Σk=1 wk(x,y).

2

16 16

16 16

Page 100: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 100

MAS 5.0 Signal: Background Adjustment (continued)

• Let I(x,y) denote the original intensity of the cell located at coordinates (x,y) on the chip. (75th percentile of 36 pixel intensities in the center of the cell.)

• Let I’(x,y)=max ( I(x,y) , 0.5 ).

• Define the background-adjusted intensity for the cell at coordinates (x,y) by

A(x,y)=max { I’(x,y)-b(x,y) , 0.5n(x,y) }.

• Henceforth these background-adjusted intensities will be referred to as either PM or MM for perfect match or mismatch cells, respectively.

Page 101: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 101

MAS 5.0 Signal: Ideal Mismatch Computation

• MM values are supposed to provide measures of cross-hybridization and stray signal intensity that inflate the value of PM.

• In the simplest case, a PM value would be corrected simply by subtracting its corresponding MM value.

• However, some MM values are bigger than their corresponding PM values so that PM-MM would become negative.

• Because negative values do not make sense and would pose problems with subsequent steps in analysis, Affymetrix determines an Ideal Mismatch (IM) value for each probe pair that is guaranteed to be less than PM.

Page 102: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 102

MAS 5.0 Signal: Ideal Mismatch Computation (continued)

For a given probe set containing n probe pairs, let PMj and MMj denote the perfect match and mismatch values of the jth probe pair. The IM value from the jth probe pair (IMj) is determined as follows:

• If PMj > MMj, then IMj = MMj and no further computation is needed.

• If PMj ≤ MMj, compute

M = TBW { log2(PM1/MM1),...,log2(PMn/MMn) }

where TBW denotes a one-step Tukey BiWeight (a special weighted average described later).

Page 103: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 103

MAS 5.0 Signal: Ideal Mismatch Computation (continued)

• If M > 0.03, then IMj = PMj / 2M.

• If M ≤ 0.03, then compute P = and let

IMj = PMj / 2P.

• Note that at M = 0.03, IMj = PMj / 1.021012 so that PMj will be slightly larger than IMj.

• As M gets larger, IMj decreases. As M gets smaller, IMj

increases towards PMj / 1.020949.

1 + ( 0.03-M )10

0.03

Page 104: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 104

MAS 5.0 Signal: Signal Log Value Computation

• Let Vj = max ( PMj – IMj , 2-20 ).

• Define the probe value for the jth probe pair by PVj = log2(Vj).

• The signal log value for a given probe set is defined by

SLV = TBW ( PV1 , PV2 , ... , PVn )

where TBW denotes a one-step Tukey BiWeight

(a special weighted average to be discussed later).

Page 105: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 105

• Let SLVi denote the signal log value for the ith probe set

on a single chip.

• Let I denote the number of probe sets on the chip.

• Let SF = 500/TrimMean( 2SLV , 2SLV , ..., 2SLV ; 0.02,0.98).

• MAS 5.0 Signal for the ith probe set is Signali = SF * 2SLV.

• All computations are done separately for each chip to obtain a Signal value for each chip and probe set.

MAS 5.0 Signal: Scaling and Signal Calculation

1 2 I

The average of the values in parenthesesthat are strictly between the 0.02 and 0.98

quantiles of the values in parentheses.

i

Page 106: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 106

The One-Step Tukey BiWeight EstimatorUsed by Affymetrix

• Let x1, x2, ..., xn denote observations.

• Let m = median ( x1, x2, ..., xn ).

• Let MAD = median ( |x1 – m|, |x2 – m|, ..., |xn – m| ).

• For each i = 1, 2, ..., n; let ti = . xi - m

5 * MAD + 0.0001Factor Affymetrix

uses to avoiddivision by 0.

Page 107: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 107

The One-Step Tukey BiWeight EstimatorUsed by Affymetrix (ctd.)

Recall the bisquare weight function defined as

B(t) = ( 1 - t 2 ) 2 for | t | < 1

= 0 for | t | ≥ 1.

B

(t)

Bisquare Weight Function

tn

nTBW ( x1, x2, ..., xn ) = Σi=1 B(ti) xi

Σi=1 B(ti)

Page 108: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 108

An Example

Compute TBW ( 1, 7, 13, 15, 28, 1075 ).

m = ( 13 + 15 ) / 2 = 14.

MAD = median ( |1-14|,|7-14|,|13-14|,|15-14|,|28-14|,|1075-14| )

= median ( 13, 7, 1, 1, 14, 1061 )

= median ( 1, 1, 7, 13, 14, 1061 )

= ( 7 + 13 ) / 2 = 10.

t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50

t4 = 1 / 50 t5 = 14 / 50 t6 = 1061 / 50

Ignore the 0.0001factor to make

calculationseasier.

Page 109: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 109

An Example (continued)t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50

t4 = 1 / 50 t5 = 14 / 50 t6 = 1061 / 50

B(t1)=B(0.26)=( 1 - 0.262 ) 2 = 0.8693698B(t2)=B(0.14)=( 1 - 0.142 ) 2 = 0.9611842B(t3)=B(0.02)=( 1 - 0.022 ) 2 = 0.9992002B(t4)=B(0.02)=( 1 - 0.022 ) 2 = 0.9992002B(t5)=B(0.28)=( 1 - 0.282 ) 2 = 0.8493466B(t6)=0

0.8693698*1+ 0.9611842*7+0.9992002*13+0.9992002*15+0.8493466*28+0*1075

0.8693698+ 0.9611842+0.9992002+0.9992002+0.8493466+0

=12.68772.

Page 110: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 110

Obtaining MAS5.0 Signal Valuesfrom Affymetrix .CEL Files

• MAS5.0 Signal values can be obtained from Affymetrix software.

• Approximate MAS5.0 Signal values can be computed with the mas5 function that is part of the Bioconductor package affy.

Page 111: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 111

Robust Multi-array Average (RMA)1. Background adjust PM values from .CEL files.

2. Take the base-2 log of each background-adjusted PM intensity.

3. Quantile normalize values from step 2 across all GeneChips.

4. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe.

5. For each row, find the average of the fitted values from step 4 to use as probe-set-specific expression measures for each GeneChip.

Page 112: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 112

RMA: Background Adjustment Assume PM = S + B where signal S ~ Exp(λ) independent of background B ~ N+(μ,σ2).

N+(μ,σ2) denotes N(μ,σ2) truncated on the left at 0.

Page 113: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 113

The Probability Density Function of theExponential Distribution with Mean 1/λ = 10000

s

λe-λs

Page 114: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 114

b

e-(b-μ) /(2σ )2 2

(2πσ2)0.5

The Probability Density Function of the Normal Distribution with Mean μ = 1000 and Variance σ2 = 3002

Page 115: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 115

s+b

Den

sity

of

s+b

The Probability Density Function of s + bwhere s~Exp(λ=1/10000) and

b~N+(μ = 1000,σ2 = 3002)

Page 116: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 116

RMA: Background Adjustment (continued) N(0,1) density function

N(0,1) distribution function

Separately for each chip, estimate μ, σ, and λ from theobserved PM distribution. Plug those estimates into theformula above to obtain an estimate of E(S|PM) for each PMvalue. These serve as background-adjusted PM values.

Page 117: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 117

RMA: Background Adjustment (continued)Obtaining Estimates of μ, σ, and λ

(unpublished description of the procedure)

• Estimate the mode of the PM distribution using a kernel density estimate of the PM density.

• Estimate the density of the PM values less than the mode. The mode of this distribution serves as an estimate of μ.

• Assume the data to the left of the estimate of μ are the background observations that fell below their mean. Use those observations to estimate σ.

• Subtract the estimate of μ from all observations larger than the estimate. The mode of this distribution estimates 1/λ.

Page 118: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 118

Den

sity

PM Density Estimate Based on Simulated Data

Data below the estimatedmode is used to estimatebackground parameters

μ and σ.

Page 119: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 119

Den

sity

Density Estimate of PM Data below the Estimated Mode of the PM Distribution

Estimate of μ = 1612

This data isused to estimateσ as 642.3.

Page 120: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 120

Estimate of σ

According to the RMA R code, σ is estimated as follows:

The purpose of the factor of 2 in the numerator is not clear.

Page 121: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 121

Den

sity

Density Estimate of PM – μ ValuesGreater than Zero

Estimate of 1/λ = 2019

^

The mean of thesevalues would be a

much better estimateof 1/λ in this case.

(Mean is 9848 and1/λ=10000.)

Page 122: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 122

RMA: Quantile Normalization

1. After background adjustment, find the smallest log2(PM) on each chip.

2. Average the values from step 1.

3. Replace each value in step 1 with the average computed in step 2.

4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

Page 123: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 123

RMA: Median Polish• For a given probe set with J probe pairs, let yij denote the

background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.

• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.

• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.

gene expressionof the probe seton GeneChip i

probe affinityaffect for thejth probe in theprobe set

residual for thejth probe on theith GeneChip

Page 124: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 124

RMA: Median Polish (continued)

• Let yij denote the fitted value for yij that results from the median polish procedure.

• Let αj = y.j – y.. where y.j =Σi=1 yij and y..= Σi=1Σj=1 yij and

and I denotes the number of GeneChips.

• Let μi = yi. =Σj=1 yij / J

• μi is the probe-set-specific measure of expression for GeneChip i.

^

^ ^ ^ ^ ^I I J^

I IJ

^ ^ ^

^

J

^

Page 125: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 125

An ExampleSuppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensitiesfor a single probe set. Determine the final RMAexpression measures for this probe set.

1 2 3 4 51 4 3 6 4 72 8 1 10 5 113 6 2 7 8 84 9 4 12 9 125 7 5 9 6 10

Gen

eChi

p

Probe

Page 126: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 126

An Example (continued)

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

48797

rowmedians

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

matrix afterremoving

row medians

Page 127: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 127

An Example (continued) 0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

0 -5 2 0 3

column medians

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

matrix aftersubtracting

column medians

Page 128: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 128

An Example (continued)

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

0 0-1 0 0

rowmedians

matrix afterremoving

row medians

0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

Page 129: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 129

An Example (continued) 0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

0 1 0 0 0

column medians

matrix aftersubtracting

column medians

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

Page 130: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 130

An Example (continued)

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

All row medians and column medians are 0.Thus the median polish procedure has converged.The above is the residual matrix that we willsubtract from the original matrix to obtain thefitted values.

Page 131: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 131

An Example (continued)

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

4 0 6 4 78 4 10 8 116 2 8 6 99 5 11 9 127 3 9 7 10

original matrix residuals from median polish

matrix of fitted values

4.28.26.29.27.2

row means= μ1

= μ2

= μ3

= μ4

= μ5

^

^

^

^

^

RMAexpressionmeasuresfor the 5 GeneChips

Page 132: 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright.

04/21/23 132

Miscellaneous Comments on Normalization

• We have only scratched the surface in terms of normalization methods. There are many variations on the techniques that were described previously as well as other approaches that we won’t discuss at this point in the course.

• Normalization affects the final results, but it is often not clear what normalization strategy is best.

• It would be good to integrate normalization and statistical analysis, but it is difficult to do so. The most common approach is to normalize data and then perform statistical analysis of the normalized data as a separate step in the microarray analysis process.