12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this...

04/21/23 1

Microarray Data Pre-Processing

04/21/23 2

Copyright notice

• Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!

04/21/23 3

Microarray data analysis: preprocessing

The main goal of data preprocessing is to removethe systematic bias in the data as completely aspossible, while preserving the variation in geneexpression that occurs because of biologicallyrelevant changes in transcription.

04/21/23 4


Observed differences in gene expression could be due to transcriptional changes, or they could becaused by artifacts such as:

• different labeling efficiencies of Cy3, Cy5• uneven spotting of DNA onto an array surface• variations in RNA purity or quantity• variations in washing efficiency• variations in scanning efficiency


• Image analysis

• Background correction

• Normalization

• Summarization

04/21/23 5

Image analysis

• The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes.

• Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Steps in Images Processing

1. Addressing: locate centers

2. Segmentation: classification of pixels either as signal or background. using seeded region growing).

3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Addressing

This is the process of assigning coordinates to each of the spots.

Automating this part of the procedure permits high throughput analysis.

4 by 4 grids19 by 21 spots per grid

Addressing

Registration

Registration

Problems in automatic addressing

Misregistration of the red and green channels

Rotation of the array in the image

Skew in the array

Rotation

Rotation

Segmentation methods• Fixed circles• Adaptive Circle• Adaptive Shape

– Edge detection.– Seeded Region Growing. (R. Adams and L.

Bishof (1994): Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

• Histogram Methods

– Adaptive threshold.

Examples of algorithms and software implementation

Methods Software / algorithms

Fixed Circle ScanAlyze, GenePix, QuantArray

Adaptive Circle GenePix

Adaptive Shape Edging and region growing.

Histogram Method QuantArray and adaptivethresholding.

Limitation of fixed circle method

SRG Fixed Circle

Limitation of circular segmentation

—Small spot—Not circular

Results from SRG

Information Extraction

• Spot Intensities

– mean (pixel intensities).

– median (pixel intensities).

– Pixel variation (IQR of log (pixel

intensities).• Background values

– Local

– Morphological opening

– Constant (global)

– None

• Quality Information

Signal

Background

04/21/23 16

Background Correction• Recall that Spot signal or simply signal is fluorescence intensity due

to target molecules hybridized to probe sequences contained in a spot (what we would like to measure) plus background fluorescence (what we would rather not measure).

• Background is fluorescence that may contribute to spot pixel intensities but is not due to fluorescence from target molecules hybridized to spot probe sequences.

• The idea is to remove background fluorescence from the spot signal fluorescence because the spot signal is believed to be a sum of fluorescence due to background and fluorescence due to hybridized target cDNA.

04/21/23 17

Local background

• Focusing on small regions surrounding the spot mask.

• Median of pixel values in this region

• Most software package implement such an approach

ScanAlyze ImaGene Spot, GenePix

• By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

04/21/23 18

Global background

• Global method which subtracts a constant background for all spots

• Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide– More meaningful to estimate background based on a

set of negative control spots– If no negative control spots: approximation of the

average background = third percentile of all the spot foreground values

04/21/23 19

Background Correction Strategies(applied prior to logging signal intensity)

1. Subtract local background, e.g.,signal mean – background mean orsignal mean – background median

This can increase variation in measurements, especially for low expressing genes. Some believe that local backgroundwill overestimate the background contribution to spotfluorescence. Background fluorescence where cDNA hasbeen spotted may be different than background where nocDNA has been spotted.

04/21/23 20


2. For each spot, find the local background of the

spot as well as the local backgrounds of all

neighboring spots. Compute the median or mean of these

local backgrounds. Subtract that summary of local

backgrounds from the spot’s signal.

This is similar to option 1 but can reduce some variation in

background estimation.

04/21/23 21


3. Find the median or mean of local backgrounds in asector. Subtract the sector summary of local backgroundsfrom each signal in the sector.

4. Subtract the median or mean of blank spot signals ornegative control signals in a sector from all other signals ina sector.

5. Estimate the background for each spot by fitting a rowand column model to the local background values in asector. (See next slide.)

04/21/23 22

Modeling local backgrounds within each sector (Kafadar and Phang. (2003). CSDA 44 313-338)

bij = m + ri + cj + eij

background for spotin ith row and jth column

of the sector

baselinebackground

for the sector

roweffectfor thesector

columneffectfor thesector

residual

An estimated background for each spot bij is obtained via median polish.^

04/21/23 23

Comments on Background Correction• Subtracting background may result in a

negative or zero adjusted-signal values. Such values cannot be logged. One simple approach is to replace all negative values by zero, add one to all values (whether zero or not), and log the resulting values.

04/21/23 24

Data Normalization

• Large sets of experiments involve dozens to hundreds arrays

• To make the arrays comparable, the data need to be normalized

• Because equal amounts of mRNA are used in all arrays, the spot intensities of an array should sum to a fixed number

04/21/23 25

What is Normalization?• Normalization describes the process of removing

(or minimizing) non-biological variation in the measured gene expression levels of hybridized mRNA so that biological differences can be more easily detected.

• Typically normalization is attempting to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides.

• Normalization does not necessarily have anything to do with the normal distribution that plays a prominent role in statistics.

04/21/23 26

Sources of Non-Biological Variation• Dye bias: differences in heat and light sensitivity,

efficiency of dye incorporation• Differences in the amount of labeled cDNA

hybridized to each channel in a microarray experiment – Channel is used to refer to a combination of a dye

and a slide.

• Variation across replicate slides• Variation across hybridization conditions• Variation in scanning conditions• Variation among technicians doing the lab work• etc.......................................................................

04/21/23 27

Normalization Methods forTwo-Color Microarray Data

04/21/23 28

Side-by-side boxplots show examples of variation across channels.

04/21/23 29

Slide 2Cy3 Cy5Slide 1

Cy3 Cy5

median

Q3=75th percentile

Q1=25th percentile

minimum

maximum

04/21/23 30

Interquartile range (IQR) is Q3-Q1. Points more than 1.5*IQR above Q3or more than 1.5*IQR below Q1 are displayed individually.

median

Q3=75th percentile

Q1=25th percentile

minimum

maximum

04/21/23 31

One of the simplest normalization strategies is to align the log signals so that all channels have the same median.

• The value of the common median is not important for subsequent analyses.

• A convenient choice is zero so that positive or negative values reflect signals above or below the median for a particular channel.

• If negative normalized signal values seem confusing, any positive constant may be added to all values after normalization to zero medians.

04/21/23 32

Log

Mea

n S

igna

l Cen

tere

d at

0

04/21/23 33

Note that medians match but variation seems to differ greatly across channels.

Log

Mea

n S

igna

l Cen

tere

d at

0

04/21/23 34

Scale normalization (Yang, et al. 2002. Nucliec Acids Research, 30, 4 e15)

Consider a matrix X with i=1,...,I rows and j=1,...,J columns.

Let xij denote the entry in row i and column j.

We will apply scale normalization to the matrix of log signal mean values that have already been median centered (each row corresponds to a gene and each column corresponds to a channel).

For each column j, let mj=median(x1j, x2j, ..., xIj).

For each column j, let MADj=median(|x1j-mj|,|x2j-mj|,...,|xIj-mj|).

MAD: median absolute deviation

To scale normalize the columns of X to a constant value C, multiply all the entries in the jth column by C/MADj for all j=1,...,J.

A common choice for C is the geometric mean of MAD1,...,MADJ =

The choice of C will not effect subsequent tests or p-values but will affect fold change calculations.

( ) J/J

j jMAD1

1∏

=

*Yang et al. recommended scale normalization for log R/G values.

04/21/23 35

Log

Mea

n S

igna

l (ce

nter

ed a

nd s

cale

d)

Data after Median Centering and Scale Normalizing

04/21/23 36

A Simple Example

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

04/21/23 37

Determine Channel Medians


medians 7 6 6 11

04/21/23 38

Subtract Channel Medians

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0

This is the data after median centering.

04/21/23 39

Find Median Absolute Deviations


MAD 2 4 1 2

04/21/23 40

Find Scaling Constant


MAD 2 4 1 2

C = (2*4*1*2)1/4 = 2

04/21/23 41

Find Scaling Factors


Scaling 2 2 2 2Factors 2 4 1 2

04/21/23 42

Scale Normalize theMedian Centered Data

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 4.5 6 2 2 0 -2.0 2 4 3 -4 0.0 -2 -3 4 -6 -0.5 -8 -2 5 2 3.5 0 0

This is the data after median centering andscale normalizing.

04/21/23 43Log Green

Log

Red

Slide 1 Log Signal Means after Median Centering and Scaling All Channels

Evidence of intensity-dependent dye bias

04/21/23 44 A = (Log Green + Log Red) / 2

M =

Log

Red

- Lo

g G

reen

M vs. A Plot of the Logged, Centered, and Scaled Slide 1 Data

04/21/23 45

To handle intensity-dependent dye bias, Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend “lowess” normalization prior to median centering and scale normalizing.

“lowess” stands for

LOcally WEighted polynomial regreSSion.

The original reference for lowess is

Cleveland, W. S. (1979). Robust locally weightedregression and smoothing scatterplots.

JASA 74 829-836.

04/21/23 46

LOESS

• At each point in the data set a low-degree polynomial is fit to a subset of the data, with explanatory variable values near the point whose response is being estimated.

• The polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away.

• The value of the regression function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point.

• The LOESS fit is complete after regression function values have been computed for each of the n data points.

From Wikipedia, the free encyclopedia


Log

Red

Slide 1 Log Signal Means


M =

Log

Red

- Lo

g G

reen

M vs. A Plot for Slide 1 Log Signal Means


M =

Log

Red

- Lo

g G

reen

M vs. A Plot for Slide 1 Log Signal Meanswith lowess fit (f=0.40)


M =

Log

Red

- Lo

g G

reen

Adjust M Values

04/21/23 51A = (Adjusted Log Green + Adjusted Log Red) / 2

M =

Adj

uste

d Lo

g R

ed –

Adj

uste

d L

og G

reen

M vs. A Plot after Adjustment

04/21/23 52

M vs. A Plot for Slide 1 Log Signal Means

Adj

uste

d Lo

g R

ed

Adjusted Log Green

adjusted log red = log red – adj/2

adjusted log green=log green + adj/2

where adj = lowess fitted value


M =

Log

Red

- Lo

g G

reen

M vs. A Plot for Slide 1 Log Signal Meanswith lowess fit (f=0.40)

For spots with A=7, the lowess fitted value is 0.883. Thus the value of adj discussed on the previous slide is 0.883 for spots with A=7.

The M value for such spots would be moved down by 0.883. The log red value would bedecreased by 0.883/2 and the log green value would be increased by 0.883/2 to obtain adjusted log red and adjusted log green values, respectively.

0.883

04/21/23 54

How is the lowess curve determined?Weight function

Consider the tricube weight function defined as

Suppose we have data points (x1,y1), (x2,y2),...(xn,yn).

Let 0 < f ≤ 1 denote a fraction that will determine the smoothness of the curve.

Let r = n*f rounded to the nearest integer.

t

T

(t)

Tricube Weight Function

For i=1, ..., n; let hi be the rth smallest

number among |xi-x1|, |xi-x2|, ..., |xi-xn|.

T(t) = ( 1 - | t | 3 ) 3 for | t | < 1

= 0 for | t | ≥ 1.

For k=1, 2, ..., n; let wk(xi)=T( ( xk – xi ) / hi ).

04/21/23 55

An Examplei 1 2 3 4 5 6 7 8 9 10xi 1 2 5 7 12 13 15 25 27 30yi 1 8 4 5 3 9 16 15 23 29

x

y

Suppose alowess curve

will be fitto this datawith f=0.4.

04/21/23 56

Table Containing |xi-xj| Values

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

x1 0 1 4 6 11 12 14 24 26 29x2 1 0 3 5 10 11 13 23 25 28x3 4 3 0 2 7 8 10 20 22 25x4 6 5 2 0 5 6 8 18 20 23x5 11 10 7 5 0 1 3 13 15 18x6 12 11 8 6 1 0 2 12 14 17x7 14 13 10 8 3 2 0 10 12 15x8 24 23 20 18 13 12 10 0 2 5x9 26 25 22 20 15 14 12 2 0 3x10 29 28 25 23 18 17 15 5 3 0

04/21/23 57

Calculation of hi from |xi-xj| Values

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

x1 0 1 4 6 11 12 14 24 26 29 h1= 6x2 1 0 3 5 10 11 13 23 25 28 h2= 5x3 4 3 0 2 7 8 10 20 22 25 h3= 4x4 6 5 2 0 5 6 8 18 20 23 h4= 5x5 11 10 7 5 0 1 3 13 15 18 h5= 5x6 12 11 8 6 1 0 2 12 14 17 h6= 6x7 14 13 10 8 3 2 0 10 12 15 h7= 8x8 24 23 20 18 13 12 10 0 2 5 h8=10 x9 26 25 22 20 15 14 12 2 0 3 h9=12x10 29 28 25 23 18 17 15 5 3 0 h10=15

n=10, f=0.4 r=4

04/21/23 58

Weights wk(xi) Rounded to Nearest 0.001 k

1 2 3 4 5 6 7 8 9 10 1 1.000 0.986 0.348 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2 0.976 1.000 0.482 0.000 0.000 0.000 0.000 0.000 0.000 0.000 3 0.000 0.193 1.000 0.670 0.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.000 0.820 1.000 0.000 0.000 0.000 0.000 0.000 0.000 5 0.000 0.000 0.000 0.000 1.000 0.976 0.482 0.000 0.000 0.000 6 0.000 0.000 0.000 0.000 0.986 1.000 0.893 0.000 0.000 0.000 7 0.000 0.000 0.000 0.000 0.850 0.954 1.000 0.000 0.000 0.000 8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.976 0.670 9 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.986 1.000 0.95410 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.893 0.976 1.000

i

w6(x5) = (1 - ( | x6 - x5 | / h5 ) 3 ) 3 = ( 1 - ( | ( 13 – 12 ) / 5 | ) 3 ) 3 = ( 1 – 1 / 125 ) 3 0.976~~

04/21/23 59

How is the lowess curve determined?Regression

)x(ˆi

*0β

∑=

n

kkkik )xβ - β -y)(x(w

1

210

For each i=1, 2, ..., n; let and denote the values of and 0β 1β

that minimize .

ii*

i**

i x)x(ˆ)x(ˆy 10 β+β=For i=1, 2, ..., n; let *iii y - ye =

))s/(e(B kk 6=δ

)x(ˆi

*1β

Consider the bisquare weight function defined as

B(t) = ( 1 - t 2 ) 2 for | t | < 1

= 0 for | t | ≥ 1.

B

(t)

Bisquare Weight Function

t

For k=1,2,...,n; let

where s is the median of |e1|, |e2|, ..., |en|.

and

04/21/23 60

)x(ˆi1β

∑1

210

n

kkkikk )xβ - β -y)(x(w

=

δ

)x(ˆi0βFor each i=1, 2, ..., n; let and denote the values of and

that minimize .

0β 1β

How is the lowess curve determined?

iy

iiii x)x(ˆ)x(ˆy 10 β+β=For i=1, 2, ..., n; let .

Now use the new fitted values to compute new as on the previous slide.

Substitute the new for the old in the expression above and repeat the

minimization described above to obtain new values. These resulting values

are the lowess fitted values. Plot these values versus x1, x2, ..., xn and connect

with straight lines to obtain the lowess curve.

iy iy

kδ

kδ

kδ

04/21/23 61

04/21/23 62

04/21/23 63

04/21/23 64

04/21/23 65

04/21/23 66

04/21/23 67

04/21/23 68

04/21/23 69

04/21/23 70

04/21/23 71

Plot Showing All 10 Lines and Predicted Values after One More Iteration

04/21/23 72

The Lowess Curve

04/21/23 73

After a separate lowess normalization for eachslide, the adjusted values can be median centeredand scale normalized across all channels using thelowess-normalized data for each channel.

A sector represents the set of points spottedby a single pin on a single slide. The entirenormalization process described above can becarried out separately for each sector on eachchannel.

It may be necessary to normalize by sector/channelcombinations if spatial variability is apparent.

04/21/23 74

Boxplots of Mean Signal after Logging, Lowess Normalization,Median Centering, and Scaling

N

orm

aliz

ed S

igna

l

04/21/23 75

Bolstad, et al. (2003, Bioinformatics 19 2:185-193) propose quantile normalization for microarray data • Quantile normalization is most commonly used in

normalization of Affymetrix data

• It can be used for two-color data as well.

• Quantile normalization can force each channel to have the same quantiles.

• xq (for q between 0 and 1) is the q quantile of a data set if the fraction of the data points less than or equal to xq is at least q, and the fraction of the data points greater than or equal to xq at least 1-q.

• median=x0.5 Q1=x0.25 Q3=x0.75

04/21/23 76

Boxplots of Log Signal Means after Quantile Normalization


Log

Red

Original Slide 1 Log Signal Means

04/21/23 78

Comparison of Slide 1 Log Signal Means after Quantile Normalization

Log Green

Lo

g R

ed

04/21/23 79

Details of Quantile Normalization

1. Find the smallest log signal on each channel.

2. Average the values from step 1.

3. Replace each value in step 1 with the average computed in step 2.

4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

04/21/23 80

A Simple Example


04/21/23 81

Find the Smallest Valuefor Each Channel


04/21/23 82

Average These Values

(1+2+2+8)/4=3.25


04/21/23 83

Replace Each Value by the Average

(1+2+2+8)/4=3.25

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11

04/21/23 84

Find the Next Smallest Values


04/21/23 85

Average These Values


(3+5+5+9)/4=5.5

04/21/23 86


Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11

04/21/23 87

Find the Average of theNext Smallest Values

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11

(7+6+6+11)/4=7.5

04/21/23 88


Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50

04/21/23 89

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50

(8+13+7+13)/4=10.25


04/21/23 90


Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50

04/21/23 91

Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50

(9+15+9+15)/4=12.00


04/21/23 92


Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.00 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 12.00 10.25 7.50 7.50

This is the data matrix after quantile normalization.

04/21/23 93

Background Correction and Normalization of Affymetrix

GeneChip Data

04/21/23 94

Affymetrix .CEL Files

• A .CEL file contains one number representing signal intensity for each probe cell on a single GeneChip.

• .CEL files can be read with Affymetrix software or in R using the Bioconductor package affy.

• We will discuss two methods for normalizing and obtaining expression measures using data from Affymetrix .CEL files.

04/21/23 95

Methods

1. Microarray Analysis Suite (MAS) 5.0 Signal proposed by Affymetrix. Statistical Algorithms Description Document (2002) Affymetrix Inc.

2. Robust Multi-array Average (RMA) proposed by Irizarray et al. (2003) Biostatistics 4, 249-264.

These are perhaps the two most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are describedand compared at http://affycomp.biostat.jhsph.edu/.

04/21/23 96

MAS 5.0 Signal: Background Adjustment

• Each chip is divided into 16 rectangular zones.

• The lowest 2% of intensities in each zone are averaged to form a zone-specific background value denoted bZk for zones k=1, 2, ..., 16.

• The standard deviation of the lowest 2% of intensities in each zone is calculated and denoted nZk for zones k=1, 2, ..., 16.

• Let dk(x,y) denote the distance from the center of zone k to a probe cell located at coordinates (x,y) on the chip.

04/21/23 97

GeneChip Divided into 16 Zones

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

probe cell atcoordinates

(x,y)

x

y

04/21/23 98

d1(x,y)d4(x,y)

d16(x,y)

16 Distances to Zone Centers for Each Probe Cell

04/21/23 99

MAS 5.0 Signal: Background Adjustment (continued)

• Let wk(x,y)=1/(dk(x,y)+100).

• Denote the background for the cell located at coordinates (x,y) by

b(x,y)=Σk=1 wk(x,y) bZk / Σk=1 wk(x,y).

• Denote the “noise” for the cell located at coordinates (x,y) by

n(x,y)=Σk=1 wk(x,y) nZk / Σk=1 wk(x,y).

2

16 16

16 16

04/21/23 100

MAS 5.0 Signal: Background Adjustment (continued)

• Let I(x,y) denote the original intensity of the cell located at coordinates (x,y) on the chip. (75th percentile of 36 pixel intensities in the center of the cell.)

• Let I’(x,y)=max ( I(x,y) , 0.5 ).

• Define the background-adjusted intensity for the cell at coordinates (x,y) by

A(x,y)=max { I’(x,y)-b(x,y) , 0.5n(x,y) }.

• Henceforth these background-adjusted intensities will be referred to as either PM or MM for perfect match or mismatch cells, respectively.

04/21/23 101

MAS 5.0 Signal: Ideal Mismatch Computation

• MM values are supposed to provide measures of cross-hybridization and stray signal intensity that inflate the value of PM.

• In the simplest case, a PM value would be corrected simply by subtracting its corresponding MM value.

• However, some MM values are bigger than their corresponding PM values so that PM-MM would become negative.

• Because negative values do not make sense and would pose problems with subsequent steps in analysis, Affymetrix determines an Ideal Mismatch (IM) value for each probe pair that is guaranteed to be less than PM.

04/21/23 102

MAS 5.0 Signal: Ideal Mismatch Computation (continued)

For a given probe set containing n probe pairs, let PMj and MMj denote the perfect match and mismatch values of the jth probe pair. The IM value from the jth probe pair (IMj) is determined as follows:

• If PMj > MMj, then IMj = MMj and no further computation is needed.

• If PMj ≤ MMj, compute

M = TBW { log2(PM1/MM1),...,log2(PMn/MMn) }

where TBW denotes a one-step Tukey BiWeight (a special weighted average described later).

04/21/23 103

MAS 5.0 Signal: Ideal Mismatch Computation (continued)

• If M > 0.03, then IMj = PMj / 2M.

• If M ≤ 0.03, then compute P = and let

IMj = PMj / 2P.

• Note that at M = 0.03, IMj = PMj / 1.021012 so that PMj will be slightly larger than IMj.

• As M gets larger, IMj decreases. As M gets smaller, IMj

increases towards PMj / 1.020949.

1 + ( 0.03-M )10

0.03

04/21/23 104

MAS 5.0 Signal: Signal Log Value Computation

• Let Vj = max ( PMj – IMj , 2-20 ).

• Define the probe value for the jth probe pair by PVj = log2(Vj).

• The signal log value for a given probe set is defined by

SLV = TBW ( PV1 , PV2 , ... , PVn )

where TBW denotes a one-step Tukey BiWeight

(a special weighted average to be discussed later).

04/21/23 105

• Let SLVi denote the signal log value for the ith probe set

on a single chip.

• Let I denote the number of probe sets on the chip.

• Let SF = 500/TrimMean( 2SLV , 2SLV , ..., 2SLV ; 0.02,0.98).

• MAS 5.0 Signal for the ith probe set is Signali = SF * 2SLV.

• All computations are done separately for each chip to obtain a Signal value for each chip and probe set.

MAS 5.0 Signal: Scaling and Signal Calculation

1 2 I

The average of the values in parenthesesthat are strictly between the 0.02 and 0.98

quantiles of the values in parentheses.

i

04/21/23 106

The One-Step Tukey BiWeight EstimatorUsed by Affymetrix

• Let x1, x2, ..., xn denote observations.

• Let m = median ( x1, x2, ..., xn ).

• Let MAD = median ( |x1 – m|, |x2 – m|, ..., |xn – m| ).

• For each i = 1, 2, ..., n; let ti = . xi - m

5 * MAD + 0.0001Factor Affymetrix

uses to avoiddivision by 0.

04/21/23 107

The One-Step Tukey BiWeight EstimatorUsed by Affymetrix (ctd.)

Recall the bisquare weight function defined as

B(t) = ( 1 - t 2 ) 2 for | t | < 1

= 0 for | t | ≥ 1.

B

(t)

Bisquare Weight Function

tn

nTBW ( x1, x2, ..., xn ) = Σi=1 B(ti) xi

Σi=1 B(ti)

04/21/23 108

An Example

Compute TBW ( 1, 7, 13, 15, 28, 1075 ).

m = ( 13 + 15 ) / 2 = 14.

MAD = median ( |1-14|,|7-14|,|13-14|,|15-14|,|28-14|,|1075-14| )

= median ( 13, 7, 1, 1, 14, 1061 )

= median ( 1, 1, 7, 13, 14, 1061 )

= ( 7 + 13 ) / 2 = 10.

t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50

t4 = 1 / 50 t5 = 14 / 50 t6 = 1061 / 50

Ignore the 0.0001factor to make

calculationseasier.

04/21/23 109

An Example (continued)t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50

t4 = 1 / 50 t5 = 14 / 50 t6 = 1061 / 50

B(t1)=B(0.26)=( 1 - 0.262 ) 2 = 0.8693698B(t2)=B(0.14)=( 1 - 0.142 ) 2 = 0.9611842B(t3)=B(0.02)=( 1 - 0.022 ) 2 = 0.9992002B(t4)=B(0.02)=( 1 - 0.022 ) 2 = 0.9992002B(t5)=B(0.28)=( 1 - 0.282 ) 2 = 0.8493466B(t6)=0

0.8693698*1+ 0.9611842*7+0.9992002*13+0.9992002*15+0.8493466*28+0*1075

0.8693698+ 0.9611842+0.9992002+0.9992002+0.8493466+0

=12.68772.

04/21/23 110

Obtaining MAS5.0 Signal Valuesfrom Affymetrix .CEL Files

• MAS5.0 Signal values can be obtained from Affymetrix software.

• Approximate MAS5.0 Signal values can be computed with the mas5 function that is part of the Bioconductor package affy.

04/21/23 111

Robust Multi-array Average (RMA)1. Background adjust PM values from .CEL files.

2. Take the base-2 log of each background-adjusted PM intensity.

3. Quantile normalize values from step 2 across all GeneChips.

4. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe.

5. For each row, find the average of the fitted values from step 4 to use as probe-set-specific expression measures for each GeneChip.

04/21/23 112

RMA: Background Adjustment Assume PM = S + B where signal S ~ Exp(λ) independent of background B ~ N+(μ,σ2).

N+(μ,σ2) denotes N(μ,σ2) truncated on the left at 0.

04/21/23 113

The Probability Density Function of theExponential Distribution with Mean 1/λ = 10000

s

λe-λs

04/21/23 114

b

e-(b-μ) /(2σ )2 2

(2πσ2)0.5

The Probability Density Function of the Normal Distribution with Mean μ = 1000 and Variance σ2 = 3002

04/21/23 115

s+b

Den

sity

of

s+b

The Probability Density Function of s + bwhere s~Exp(λ=1/10000) and

b~N+(μ = 1000,σ2 = 3002)

04/21/23 116

RMA: Background Adjustment (continued) N(0,1) density function

N(0,1) distribution function

Separately for each chip, estimate μ, σ, and λ from theobserved PM distribution. Plug those estimates into theformula above to obtain an estimate of E(S|PM) for each PMvalue. These serve as background-adjusted PM values.

04/21/23 117

RMA: Background Adjustment (continued)Obtaining Estimates of μ, σ, and λ

(unpublished description of the procedure)

• Estimate the mode of the PM distribution using a kernel density estimate of the PM density.

• Estimate the density of the PM values less than the mode. The mode of this distribution serves as an estimate of μ.

• Assume the data to the left of the estimate of μ are the background observations that fell below their mean. Use those observations to estimate σ.

• Subtract the estimate of μ from all observations larger than the estimate. The mode of this distribution estimates 1/λ.

04/21/23 118

Den

sity

PM Density Estimate Based on Simulated Data

Data below the estimatedmode is used to estimatebackground parameters

μ and σ.

04/21/23 119

Den

sity

Density Estimate of PM Data below the Estimated Mode of the PM Distribution

Estimate of μ = 1612

This data isused to estimateσ as 642.3.

04/21/23 120

Estimate of σ

According to the RMA R code, σ is estimated as follows:

The purpose of the factor of 2 in the numerator is not clear.

04/21/23 121

Den

sity

Density Estimate of PM – μ ValuesGreater than Zero

Estimate of 1/λ = 2019

^

The mean of thesevalues would be a

much better estimateof 1/λ in this case.

(Mean is 9848 and1/λ=10000.)

04/21/23 122

RMA: Quantile Normalization

1. After background adjustment, find the smallest log2(PM) on each chip.

2. Average the values from step 1.

3. Replace each value in step 1 with the average computed in step 2.

4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

04/21/23 123

RMA: Median Polish• For a given probe set with J probe pairs, let yij denote the

background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.

• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.

• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.

gene expressionof the probe seton GeneChip i

probe affinityaffect for thejth probe in theprobe set

residual for thejth probe on theith GeneChip

04/21/23 124

RMA: Median Polish (continued)

• Let yij denote the fitted value for yij that results from the median polish procedure.

• Let αj = y.j – y.. where y.j =Σi=1 yij and y..= Σi=1Σj=1 yij and

and I denotes the number of GeneChips.

• Let μi = yi. =Σj=1 yij / J

• μi is the probe-set-specific measure of expression for GeneChip i.

^

^ ^ ^ ^ ^I I J^

I IJ

^ ^ ^

^

J

^

04/21/23 125

An ExampleSuppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensitiesfor a single probe set. Determine the final RMAexpression measures for this probe set.

1 2 3 4 51 4 3 6 4 72 8 1 10 5 113 6 2 7 8 84 9 4 12 9 125 7 5 9 6 10

Gen

eChi

p

Probe

04/21/23 126

An Example (continued)

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

48797

rowmedians

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

matrix afterremoving

row medians

04/21/23 127

An Example (continued) 0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

0 -5 2 0 3

column medians

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

matrix aftersubtracting

column medians

04/21/23 128


0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

0 0-1 0 0

rowmedians

matrix afterremoving

row medians

0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

04/21/23 129

An Example (continued) 0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

0 1 0 0 0

column medians

matrix aftersubtracting

column medians

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

04/21/23 130


0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

All row medians and column medians are 0.Thus the median polish procedure has converged.The above is the residual matrix that we willsubtract from the original matrix to obtain thefitted values.

04/21/23 131


0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

4 0 6 4 78 4 10 8 116 2 8 6 99 5 11 9 127 3 9 7 10

original matrix residuals from median polish

matrix of fitted values

4.28.26.29.27.2

row means= μ1

= μ2

= μ3

= μ4

= μ5

^

^

^

^

^

RMAexpressionmeasuresfor the 5 GeneChips

04/21/23 132

Miscellaneous Comments on Normalization

• We have only scratched the surface in terms of normalization methods. There are many variations on the techniques that were described previously as well as other approaches that we won’t discuss at this point in the course.

• Normalization affects the final results, but it is often not clear what normalization strategy is best.

• It would be good to integrate normalization and statistical analysis, but it is difficult to do so. The most common approach is to normalize data and then perform statistical analysis of the normalized data as a separate step in the microarray analysis process.

12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this...

Documents

Transcript of 12/5/20151 Microarray Data Pre-Processing. 12/5/20152 Copyright notice Many of the images in this...