Statistical Applications in Genetics and Molecular...

35
Statistical Applications in Genetics and Molecular Biology Volume 4, Issue 1 2005 Article 16 Error Distribution for Gene Expression Data Elizabeth Purdom * Susan P. Holmes * Stanford University, [email protected] Stanford University, [email protected] Copyright c 2005 by the authors. All rights reserved. No part of this publication may be re- produced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepress, which has been given certain exclusive rights by the author. Statistical Appli- cations in Genetics and Molecular Biology is produced by Berkeley Electronic Press (bepress). http://www.bepress.com/sagmb

Transcript of Statistical Applications in Genetics and Molecular...

Page 1: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

Statistical Applications in Geneticsand Molecular Biology

Volume 4, Issue 1 2005 Article 16

Error Distribution for Gene Expression Data

Elizabeth Purdom∗ Susan P. Holmes†

∗Stanford University, [email protected]†Stanford University, [email protected]

Copyright c©2005 by the authors. All rights reserved. No part of this publication may be re-produced, stored in a retrieval system, or transmitted, in any form or by any means, electronic,mechanical, photocopying, recording, or otherwise, without the prior written permission of thepublisher, bepress, which has been given certain exclusive rights by the author. Statistical Appli-cations in Genetics and Molecular Biology is produced by Berkeley Electronic Press (bepress).http://www.bepress.com/sagmb

Page 2: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

Error Distribution for Gene Expression Data∗

Elizabeth Purdom and Susan P. Holmes

Abstract

We present a new instance of Laplace’s second Law of Errors and show how it can be used inthe analysis of data from microarray experiments. This error distribution is shown to fit microarrayexpression data much better than a normal distribution. The use of this distribution in a parametricbootstrap leads to more powerful tests as we show that the t-test is conservative in this setting.We propose a biological explanations for this distribution based on the Pareto distribution of thevariables used to compute the log ratios.

KEYWORDS: Assymetric Laplace, Gene Expression, Error Distribution, Laplace

∗Work supported by the NSF grant DMS 02-41246 and a Gabilan Stanford Graduate fellowship.We would like to thank Persi Diaconis, David Siegmund and Noureddine El Karoui for numerousreferences and helpful discussions, Blythe Durbin and David Rocke for useful correspondenceregarding their normalization method, and Sandrine Dudoit, Yee Yang, and Wolfgang Huber formaking their Bioconductor packages freely available.

Page 3: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

1 Introduction

Microarrays allow the researcher to investigate the behavior of thousands ofgenes simultaneously under various conditions. The insights possible fromsuch experiments are vast, but the number of errors due to the complexityof the experiments is also substantial. A great deal of statistical researchhas focused on eliminating known biases introduced at different stages of theprocess and then discriminating which genes are differentially expressed acrossthe conditions of interest (for example observations from cancer patients versushealthy patients).

We propose the use of a known parametric model as an approximation ofthe distribution of the log-ratios of measured gene expression across genes –the Asymmetric Laplace Distribution (Kotz et al., 2001). Namely, if all thegenes on one array are considered as separate independent observations, thedistribution of the log-ratio of the expression values is well approximated bythe Asymmetric Laplace Distribution. Figure 1 gives an example of the fit ofthe Asymmetric Laplace Distribution as compared to the Normal distribution.The Asymmetric Laplace captures the peak at the center of the data as well asthe asymmetry in the distribution. Genes expression ratios, of course, are notindependent. However there are many instances, particularly in normalizationof the arrays, where the statistical analysis does assume independence amongthe genes.

In the two-color microarray datasets for which we fit the AsymmetricLaplace distribution (described in Section 4.1), the model usually gave a rea-sonable fit to the gene expression data and greatly improved upon the Normaldistribution. As an approximating distribution, this distribution provides analternative parametric model from the normal distribution to explore the ef-fects of statistical procedures. The Laplace distribution has a appealing rep-resentation of a mixture of normals with differing variances. Furthermore,the Laplace distribution gives a conceptually simple adjustment to existingnormalization methods which gives robust as well as parametrically justifiedprocedures.

2 Brief Background

A two-color microarray experiment takes two different samples of cDNA taggedwith different dyes, red (Cy5) and green (Cy3). The two samples are hybridizedto known DNA sequences that are spotted on a glass slide. After hybridiza-tion, the slide or array is scanned to measure the dye intensities. Higher dye

1

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 4: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

−1.0 −0.5 0.0 0.5 1.0

Asym. LaplaceNormal

Figure 1: Histogram of gene expression of all genes on a single cDNA mi-croarray from T-Cell Data (described below in Section 4.1). CorrespondingAsymmetric Laplace and Normal distribution overlayed, with parameters es-timated using maximum likelihood estimates.

intensities imply greater presence of the mRNA in the sample correspond-ing to that dye. Typically, one of the samples is a standard reference madeof mRNA from pooled cell lines and the other sample is from the observa-tion. A microarray experiment will repeat this for each observation. Eacharray will give information on the relative gene expression of the observationcompared with the standard reference; these relative gene expression patternscan be compared among patients. When done for classification of conditions,the experiment is designed so that the observations come from different knownconditions, and the relative expression can be compared between these groups.Differentially expressed genes are genes that are expressed differently (relativeto the reference) between the conditions of interest.

However technical aspects of the experiment, such as the position of thespot on the chip or different levels of the incorporation of the Cy5 and Cy3dyes, mean that the measured expression levels have built in biases due tothe technicalities of the experiment. Indeed, even for “self-self” hybridizationexperiments where both samples on the array come from the same originalsample, the error in measurement of relative gene expression is biased. Nor-malization methods try to correct the expression levels within an array to

2

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 5: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

counteract this bias.Generally such methods assume that for each array, only a small proportion

of the total genes should be expressed differently from the reference sample.Thus, we expect the difference in gene expression between the red and thegreen channel to be due to just random fluctuation. In other words, the re-sulting gene expression represents observations from some error distribution.Normalization techniques transform the data so that the distribution of geneexpression across the array reflects this assumption. One common techniquein cDNA arrays is to look at the plot of the difference of the red and greenagainst the average expression (on a log scale) and then transform the dataso that the log difference is centered at zero, using for instance LOWESS re-gression (Dudoit and Yang, 2003). These can be done globally for all of thegenes at once, or separately based on the layout information of the genes onthe slide. Another approach, variance stabilizing normalization (VSN), fur-ther transforms the data to stabilize the variance across expression level (seeDurbin et al., 2002; Huber et al., 2003).

The distribution of the normalized gene expressions, while similar acrossarrays, is often far from normal, regardless of the normalization methods.Rather, the distribution tends toward heavy tails and asymmetry of varyingdegrees (see, for example, Figure 2). Traditional centering and scaling withthe mean and the standard deviation suggested by a normal distribution ap-proximations are sensitive to outlying points. Because of the heavy tails andnon-normality of the data, many authors suggest recentering and rescalingmicroarray data with more robust estimates of location and variance, suchthe median and mean/median absolute deviation, respectively (Yang et al.,2001). This suggests an error distribution that estimates the location parame-ter with the median and the scale parameter with the mean absolute deviation(MeanAD1). Such a distribution exists and is called Laplace’s First Error Dis-tribution, a Laplace distribution, or a double exponential distribution. Theclassical Laplace distribution is symmetric around its location parameter; how-ever, gene expression data often displays signs of asymmetry. A known gen-eralization of the Laplace distribution, the Asymmetric Laplace, allows forasymmetry if necessary (see Figure 5).

1MAD often refers to the median absolute deviation, rather than the mean absolutedeviation, thus we use the abbreviation MeanAD

3

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 6: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

−1.0 −0.5 0.0 0.5 1.0

(a) T-Cell

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

(b) Self-Self

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

(c) Swirl Zebrafish

−1.0 −0.5 0.0 0.5 1.0

(d) Yeast

Figure 2: Histogram of gene expression of all genes on a single array from dif-ferent Microarray datasets (after normalizations as described below in section4.1).

4

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 7: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

−4 −2 0 2 4

L(0,1)N(0,1)

Figure 3: Density Plot of standard L(0, 1) and a standard normal N (0, 1)

3 Overview of the Laplace Distribution

The Laplace distribution (L(θ, σ)) has two parameters, a location parameterθ and a scale parameter σ. The density function is

fY (y) =1√2σ

exp(−√

2|y − θ|/σ), σ > 0

See Figure 3 for a plot of the density. The maximum likelihood estimatesof θ and s = σ/

√2 are the median and the MeanAD respectively. A L(θ, σ)

distribution has expected value θ and variance σ2 = 2s2. The L(θ, σ) distribu-tion has“heavier tails”than the normal, meaning that there is more probabilityof extreme values than under a normal distribution. In addition, the L(θ, σ)distribution concentrates more probability in the center than a normal distri-bution.

Distributions have been proposed in other contexts that adjust the Laplacedistribution so as to admit a skewness parameter in the distribution. In partic-ular, a family of distributions proposed by Hinkley and Revankar (1977), theAsymmetric Laplace Distribution (AL(θ, µ, σ)), introduces a skew parameter,µ (or κ in a different parameterization), to the classical Laplace distribution,while maintaining basic properties of the Laplace distribution. The density ofthe AL(θ, µ, σ) can be explicitly written (see Figure 4 for illustrations of thedistribution):

5

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 8: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

−4 −2 0 2 4

µ = 0µ = 0.5µ = 1µ = 1.5µ = 2

Figure 4: Density Plot of Asymmetric Laplace AL(θ, µ, σ), with θ = 0 andσ = 1, for varying values of µ

f(y) =

√2

σ

κ

1 + κ2

{exp(−

√2κ

σ|x− θ|) if x ≥ θ

exp(−√

2σκ|x− θ|) if x < θ

(1)

where µ = σ( 1κ− κ)/

√2, κ > 0.

As would be expected, the traditional, symmetric Laplace distribution withno skew is a special case of µ = 0 (or κ = 1). θ and σ remain location andscale parameters, so that if Y ∼ AL(θ, µ, σ) then Y−θ

σ∼ AL(0, µ/σ, 1) The

distribution can also be parameterized in terms of κ, as in Equation (1). IfY ∼ AL(θ, κ, σ) then Y−θ

σ∼ AL(0, κ, 1), so κ does not change with shifts or

(positive) scalings of the random variable Y . The expectation and variance ofan AL(θ, µ, σ) are

E(Y ) = θ + µ

var(Y ) = σ2 + µ2

Note the variance is not independent of the mean unless µ = 0 – the case ofthe symmetric Laplace Distribution.

The maximum likelihood estimates of θ, σ and µ can be determined and

6

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 9: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

are given in Kotz et al. (2001, p.173-174).2 The median is the θ that minimizes

1

n

∑|Xi − θ| =

1

n

∑(Xi − θ)+ +

1

n

∑(Xi − θ)−

= α(θ) + β(θ)

where α(θ) =1

n

∑(Xi − θ)+

β(θ) =1

n

∑(Xi − θ)−

α(θ) is the sum of how much larger the data points are than θ while β(θ) isthe sum of how much smaller the data points are than θ. Then the MLE θ forθ in the asymmetric distribution minimizes

1

n

∑|Xi − θ|+ 2

√α(θ)β(θ) (2)

The difference is the second term involving α(θ) and β(θ) again, whichpushes the estimate of θ toward the mode of the distribution. If the distributionis symmetric then these will be the same, and the MLE will still be the median.But in the non-symmetric case, the MLE of θ, the location parameter, is nolonger exactly the median, but is a different order statistic that depends onthe skewness of the data. Once θ is found, the MLEs for µ and σ are:

µ = X − θ

σ =√

24

√α(θ)β(θ)

(√α(θ) +

√β(θ)

)

When the data is roughly symmetric σ will be close to the MeanAD andθ will be close to the median. The maximum likelihood estimates are asymp-totically normal and efficient (Kotz et al., 2001) with asymptotic covariancematrix:

2Note that the MLE given here is different than the result given in Kotz et al. (2001,p. 173) There the authors minimize h(θ) = 2log(

√∑(Xi − θ)+ +

√∑(Xi − θ)−) +√∑

(Xi − θ)+∑

(Xi − θ)− (3.5.118). However there seems to be an error in equation(3.5.110) from which h(θ) is derived, and the second term in h(θ) should not be included.

7

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 10: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

1

n

σ2

√2

4σ(1 + κ2)

√2

4σ2( 1

κ− κ)

√2

4σ(1 + κ2)

(1+κ2

2

)2σ4( 1

κ− κ)(1 + κ2)

√2

4σ2( 1

κ− κ) σ

4( 1

κ− κ)(1 + κ2)

(σ(κ+ 1

κ)

2

)2

(3)

ψ(t) =

(1

1 + 12σ2t2 − iµt

(4)

The AL(θ, µ, σ) distribution has τ = 1, and generally the sum of n iden-tically and independently distributed (i.i.d) AL(θ, µ, σ) random variables isdistributed as a generalized Laplace distribution with τ = n. This meansthat the sum of Generalized Asymmetric Laplace random variables is still dis-tributed as Generalized Asymmetric Laplace but with a different τ parameter.

4 Applications to Microarray Data

4.1 Fitting AL(θ, µ, σ) to Two-Color Microarray Data

We examined several microarrays from published microarray experiments. Thefirst dataset was a set of 70 arrays from sorted T-cells compared to classicalhuman reference cell-lines by competitive hybridization on Agilent cDNA chips(Xu et al., 2004). The data was normalized as described in Xu et al. (2004)using the vsn package in R (Huber and Heydebreck, 2003), which applies ageneralized-log transformation to stabilize the variance across expression val-ues. The difference of the two channels gave the gene measurement. Thesecond dataset (B) consists of self-self hybridizations of 19 different cell lines,as well as the Stratagene universal reference RNA (Yang et al., 2002). Theself-self arrays were normalized using loess smoothing, as described in the pa-per, though we applied the loess smoothing separately per print-tip group.Log-differences of the two channels were then used for the measurement pergene. Dataset (C) is two sets of dye-swap experiments comparing a swirl mu-tant zebrafish with wildtype. It is available as a dataset with the R packagemarrayClasses (Dudoit and Yang, 2002; Wuennenberg-Stapleton and Ngai,2001). The zebrafish arrays were also normalized using print-tip group loesssmoothing and then log-differences were used as gene measurement, as de-scribed in the marrayNorm package. The last dataset (D) is of 86 haploidsegregants from a cross between laboratory and wild strain yeast (Saccha-romyces cerevisiae) from Yvert et al. (2003). The progeny was measured withtwo arrays each, with dyes swapped, and the parent strains measured with four

8

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 11: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

arrays each, also with the dyes swapped. The reference sample for all arrayswas an independent sample of the laboratory strain. This data is availableon NCBI’s Gene Expression Omnibus (GEO) database. Yvert et al. (2003)normalized the data by subtracting off the mean of the log-ratio. In whatfollows, we normalized the data using the vsn package in R. In the followingexposition, we show results from a single array from each of these datasets(see Supplementary Figures for all of the arrays).3 We also examined anothercDNA dataset, included in the supplementary figures, of tumor samples fromdiffuse large B-cell lymphoma patients (Alizadeh et al., 2002). This data wasnormalized using the vsn package as well.

We estimated maximum likelihood estimates and asymptotic standard er-rors of the parameters of a AL(θ, µ, σ) distribution for all of the arrays (seeTable 1 and Supplementary Figures 13). Notice, that while the parameter µdepends on the scale of measurement, the parameter κ is a comparable mea-sure of skewness across the datasets regardless of the scale. We see in thecDNA arrays that κ is close to one across the arrays, indicating small levels ofskewness, and often not significantly different from one.

θ κ σ µ Median MeanAD

T-Cell 0.039 (0.003) 1.174 (0.011) 0.304 (0.005) −0.069 0.006 0.221Self-Self −0.001 (0.002) 1.002 (0.007) 0.243 (0.009) −0.001 −0.001 0.172

Zebrafish −0.104 (0.005) 0.792 (0.002) 0.430 (0.005) 0.143 −0.008 0.318Yeast −0.002 (0.004) 0.930 (0.013) 0.283 (0.004) 0.029 − 0.002 0.193

Table 1: Maximum Likelihood Estimates of the parameters for microarraydata, with standard error estimates for θ, κ, σ in parenthesis.

In Figure 5 we overlay the estimated AL(θ, µ, σ) density on the observedhistograms, where θ, σ and µ are estimated with their respective maximumlikelihood estimates. For comparison, we also overlayed the estimated normaldensity. From these density plots we can see that theAL(θ, µ, σ) distribution iscapturing something of the“spirit”of the density, with peaked concentration inthe center and heavy tails. Using Quantile-Quantile Plots (Q-Q Plots), whichbetter emphasizes the fit of the distribution in the tails, we see in Figure

3From the T-Cell dataset, we used array 15, the naive t-cells of a healthy patient. Withthe self-self hybridizations, we used array 5, the KM12L4A cell line RNA as shown in Figure2 in Yang et al. (2002). From the Zebrafish dataset, we used array 3, “swirl.3.spot”, whereCy3 was the mutant and Cy5 the wildtype. For the yeast data, we used 5-3-dCy5, wherethe reference sample was in Cy3 (array 45).

9

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 12: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

5 that the Q-Q plots are linear. Only a few points out of the thousandsof observations deviate significantly from a straight line. This indicates areasonable fit to the Asymmetric Laplace distribution and is much preferredto the Normal distribution (also in Figure 5). For the arrays not shown,the fit is often comparable to those shown here, though some have greaterdeviations in the tails and others would have to be classified as a misfits.In general, the Asymmetric Laplace performs better than the correspondingnormal (see Supplementary Figures 9, 10, 11, 12)4. In particular, the tails ofthe distribution, as best demonstrated in Q-Q Plots, are in good agreementwith the Asymmetric Laplace, even in those cases where the center of thedistribution seems better described by a Normal distribution.

Since graphical methods lack rigor, we would like to perform tests to de-termine how well the AL(θ, µ, σ) distribution fits this data. We examined twostandard tests: the Kolmogorov-Smirnov (K-S) test and the Anderson-Darling(A-D) test. The Kolmogorov-Smirnov test takes as the test-statistic the max-imum absolute distance between the empirical and the theoretical CDF, whilethe Anderson-Darling test uses a weighted integral of the squared distancebetween the empirical and theoretical.5 Almost all of the arrays result in anextremely significant difference from the AL(θ, µ, σ) at standard testing levelsof testing (e.g. α = .05, .01). The reason for this, however, probably comesfrom the enormous sample size involved (n ≈ 10, 000) so that the test is highlysensitive to any deviations from the null. This is a well-known statistical para-dox: with large amounts of real data, every hypothesis test will reject pointnulls (see, for example Efron and Gous, 2001; Lindsey, 1999). In general thestatistics are less extreme using the Laplace as a null distribution than for thenormal distribution, but this is not in itself an statistically sound indicator offit.

Instead we used Akaike’s Information Criterion (AIC) (Akaike, 1973; Burn-ham and Anderson, 1998), to evaluate the comparative appropriateness of amodel. Namely, if g(θ) is our model,

AIC = −2log(Lg(θ|y1, . . . , yn)) + 2K

where K is the number of parameters being estimated, L is the likelihood

4In the T-cell data the data is in 5 batches, and batch two seems to have severe problemswith the AL(θ, µ, σ) fit.

5Since we must estimate the unknown parameters of the AL(θ, µ, σ) distribution, asymp-totic estimates of the distribution of the test-statistics do not exist. One can use the half-sample method (Stephens, 1986), where the unknown components are estimated with halfof the data and then the test is run, using these estimates, on the entire dataset. Given thesize of the sample, the MLE estimates are stable, so the effect of the half-sample method isnegligible.

10

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 13: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

(a) T-Cell (one point excluded on each tail)

(b) Self-Self

Figure 5: Histograms from Figure 2 overlayed with estimated AsymmetricLaplace distribution and Normal Distribution. Parameters of both distribu-tions estimated with maximum likelihood estimates. Q-Q plots for the Normaland Asymmetric Laplace distribution shown as well, with the outer 0.5% ofthe data on each tail colored in grey. Some extreme points as indicated insubcaptions, were not displayed in the Q-Q plots so as to better examine therest of the plot.

11

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 14: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

(c) Swirl Zebrafish

(d) Yeast (three points excluded on each tail)

Figure 5: (cont.) Histograms from Figure 2 overlayed with estimated Asym-metric Laplace distribution and Normal Distribution. Parameters of bothdistributions estimated with maximum likelihood estimates. Q-Q plots for theNormal and Asymmetric Laplace distribution shown as well, with the outer0.5% of the data on each tail colored in grey. Some extreme points as indicatedin subcaptions, were not displayed in the qq-plots so as to better examine therest of the plot.(d) also shows the fit of the distribution with an Inverse Gammaprior for the variance (see section 4.4.2.)

12

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 15: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

function of the model g, and θ is the maximum likelihood estimate of theparameters of g. The AIC is an estimate of

dK−L(f, g)− C(f)

where dK−L(f, g) is the Kullback-Liebler distance between the true model fand the proposed model g, and C(f) is a constant that depends only on f .Thus, proposed models gi for a given dataset can be compared by their cor-responding AICi, the relative K-L distance to the true f . However, the sizeof AIC should not be compared across separate experiments; unlike the trueK-L distance, the AIC values for different datasets are not on equivalent scalessince the term C(f) will vary with the dataset’s that come from different un-derlying model’s f . Figure 6 shows the difference of the AIC statistic for theAL(θ, µ, σ) and the Normal distribution across all of the arrays in the datasetsexamined.6 The AL(θ, µ, σ) distribution had a lower AIC for all of the samplearrays plotted in Figure 2 and for most of the arrays not shown.

4.2 Affymetrix Data

The AL(θ, µ, σ) distribution is difficult to apply to Affymetrix microarrays,however. The perfect-match (PM) and mis-matched (MM) probes used inthose arrays we examined have roughly similar distributions across genes asthe single channels in two-color microarrays. But the standard measurementof gene expression are transformations of PM-MM measurement (or of justthe PM). This measurement results from viewing the PM as the result of thebiological signal plus a probe-specific additive effect due to unspecific bind-ing. Two-color arrays, however, model various probe effects as multiplicativeeffects, which results in the ratio. PM-MM does not follow the AL(θ, µ, σ)distribution well in the data we examined. The PM-MM did have heavy tails,a peaked concentration at zero, and asymmetry, reminiscent of the AL(θ, µ, σ)distribution. However, the observed distribution had even heavier tails thanthe Laplace distribution. The ratio of log(PM/MM), which is not used inAffymetrix array analysis for the reasons explained above, was however a muchbetter fit to theAL(θ, µ, σ), probably reflecting the similarity in technical errorand gene expression that underlies both the array techniques. Another optionis to evaluate the ratio of PM for two samples, reminiscent of the two-color ar-rays. However, this is also not commonly done in Affymetrix arrays because it

6In Figure 6 we jointly plot the difference in AIC for different samples coming from thesame experiment. Thus we are implicitly assuming that the underlying model f is the sameacross samples.

13

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 16: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

0 5 10 15 20 25 30

−15

000

−10

000

−50

000

Array Index

AIC

(Lap

lace

)−A

IC(N

orm

al)

(a) T-Cell

5 10 15 20 25

−60

00−

5000

−40

00−

3000

−20

00−

1000

010

00

Array Index

(b) Self-Self

−10

000

−80

00−

6000

−40

00−

2000

020

00

Array Index

AIC

(Lap

lace

)−A

IC(N

orm

al)

1 2 3 4

(c) Swirl Zebrafish

0 50 100 150

−40

00−

3000

−20

00−

1000

0

Array Index

(d) Yeast

Figure 6: AICLap−AICNorm plotted for each array in the datasets. A smallervalue of AIC indicates a better fit, thus AICLap − AICNorm < 0 implies abetter fit for the Laplace model. Data as described in 4.1. Note the SwirlZebrafish and Yeast datasets include dye-swap arrays.

14

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 17: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

does not allow for comparisons of many samples unless some reference samplewas included in the experiment.

Furthermore, Affymetrix pre-processing must compile a single gene expres-sion value from several different probes. This results in different kinds ofnormalization procedures, which result in very different distributions of thegene expression.

4.3 Interpretation

The Asymmetric Laplace distribution can be equivalently represented as func-tions of other random variables which can provide insight into possible reasonsfor the good fit of the Asymmetric Laplace. None of these representations arethe ”truth,” but do perhaps give ideas as to why the arrays show a good fit tothe distribution.

If Y is a random variable with distribution AL(θ, µ, σ), then two represen-tations are of possible interest.

Yd= θ +

σ√2

log(P1

P2

), where P1 ∼ Pareto I(κ, 1), P2 ∼ Pareto I(1/κ, 1)

(5)

Yd= θ + µW + σ

√WZ, where W ∼ Exp(1), Z ∼ N (0, 1) (6)

Thus, Equation (5) means that the Asymmetric Laplace distribution canalso be represented as the log-ratio of two independent random-variables withPareto I distributions.7 Here κ is the same as in the parameterization givenabove in the equation for the density (Equation (1)). Equation (6) says that Ycan be viewed a continuous mixture of normal random variables whose scaleand mean parameters are dependent and vary according to an exponentialdistribution:

Yi|Wi ∼ N (θ + µWi, σ2Wi), where Wi ∼ exp(1).

The dependence of the mean and variance are reflected in the same Wi in themean and variance of the mixture.

7The density of a Pareto I(α, β) distribution is

f(x) =αβα

xα+1, x > β

15

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 18: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

Since arrays are often measured as log-ratios of the red and green chan-nel (or approximately so for the variance stabilizing transformation), then therepresentation in (5) seems a particularly simple explanation of the good fit ofthe AL(θ, µ, σ) to the data. Namely, that the red and green channel each fol-low independent Pareto distributions, but related parameters. If κ = 1, thenthe both channels would have the same distribution, linked by the σ param-eter: ParetoI(

√2

σ, 1). Clearly the two channels are not actually independent,

as this model requires, though many normalization models of gene expressioncan have this effect.8 Equation (5) also would imply that skewness in the data– the κ term – arises from a difference in distribution between the red andgreen channel, since κ 6= 1 (µ 6= 0) only if P1 and P2 come from Pareto distri-butions with different parameters. Kuznetsov (2001) finds mRNA expressionin SAGE libraries following a “Pareto-like” distribution – a Pareto II distri-bution with a location parameter.9 Similarly, Wu et al. (2003) find that thedistribution of the expression intensities (PM) for Affymetrix oglionucleotidearrays resemble a power law, which is equivalent to a Pareto distribution. Weexamined the customary log-frequency versus log-expression plots for the redand green channels. Some arrays were fairly linear, thus indicating a Paretofit. But many arrays were only linear in the tails and perhaps were more of aquadratic curve, which indicates a log-normal curve (see Figure 7).

Equation (6) gives another possible intuition: the intensity of every probe/geneon the array follows a normal distribution, but with random standard deviationand mean from an exponential distribution. Due to the nature of microarrayexperiments, we would expect the measured intensities to have different vari-ation across genes, and a mixture of normals is a convenient representation.10

The AL(θ, µ, σ) is, of course, only one such mixture, namely with variancefollowing an exponential distribution.

We can imagine for each gene g there is an underlying difference in biologi-cal log-expression level between the two channels, ξg, and some noise ηg due totechnical aspects of the experiment. A simple assumption would make theseadditive effects on the log-scale (and thus multiplicative on the untransformeddata). How could the parameters of the AL(θ, µ, σ), as used in equation (6)

8For example the variance stabilizing transformation model results in generalized logdifference measurements that are the difference of independent residuals of the red andgreen channels.(Huber et al., 2003). Similarly Newton et al. (2001) use a Bayesian modelwith independent red and green channels.

9Namely, the density aCa

(C+x−µ)a+1 , where for Kuznetsov (2001), C = 1 and µ = b+1. SeeJohnson et al. (1994) for more information.

10The variance stabilizing procedures keep the variance across intensity levels the samewithin an array or batches of arrays, but does not do so gene by gene

16

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 19: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

● ●

●●

●●●●●●●

●●●●

●●●●

●●●●●●●

●●

●●●●●●

●●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

200 500 1000 2000 5000 10000

12

510

2050

100

200

Gene Expression

Fre

quen

cy

(a) Self-Self Red Channel

●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●

●●●●

100 200 500 1000 2000 5000 20000

15

1050

100

500

Gene ExpressionF

requ

ency

(b) Yeast Red Channel

Figure 7: Histograms of the red channels, with both axes on the log scale. Thegreen channel gave equivalent graphs.

compare with these values?One way is to think that biologically the difference in the gene log-expression

levels between the two channels is some constant fixed effect, except for per-haps a few genes, and the observed variability is due to technical noise. Then θwould correspond to ξg. Since normalization often assumes that the biologicalgene expression is the same in the red and green channel for most genes, thenthe remaining expression level,ξg, is often assumed to be 0 for most genes.This would leave the technical noise as having a A(0, µ, σ) distribution givenby ηg = µW + σ

√WZ. Then µ describes the mean of the technical noise. In

this scenario, µ 6= 0 implies some technical bias in measuring the two channels(just as for κ 6= 1 mentioned in the Pareto interpretation).

Another interpretation is that the biological difference between the red andgreen varies from gene to gene as does the technical noise. If we try to fit thisinterpretation in equation (6) into this frame work, then difference in geneexpression would be exponential (ξg = θ+ µW ) and the technical noise wouldbe symmetric laplace (ηg = σ

√WZ); ξg and ηg would also not be independent.

This would imply, that if there was no skewness in the data, ξg = θ. However,this is clearly not a very good model for the biological difference between thered and green channel because we would not expect the biological difference

17

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 20: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

for every gene to be positive. (Note that we would not want to assign theexponential distribution to ηg, since µ = 0 – a symmetric distribution for geneexpression – would imply zero noise).

As is clear from these descriptions, there is no way to distinguish the bio-logical versus the technical elements in these models, and ultimately one positsone or the other. Another likely interpretation is that the normal part of theexpression in equation (6) is divided between the biological and technical noisein an unidentifiable manner, with exponential technical (or biological noise)as well at times resulting in a skewed distribution. Ultimately assumptions,such as ξg = 0 for most genes (the common normalization assumption), helpto further specify the interpretation.

If gene expression across genes can be well described by the representationin Equation (6), what does this imply for the distribution per gene? Namely,the random component W could be a spot or gene effect – the same Wg com-ponent for each measurement of gene g expression. This implies a normaldistribution for a given gene measured across arrays. Otherwise, the W com-ponent could be random across both genes and arrays, which would imply aAL(θ, µ, σ) distribution for a given gene across arrays.

Gene/Spot SpecificWg

: Ygi = θ + µWg + σ√WgZgi (7)

DifferentWgi for eachgene (g) and array (i)

: Ygi = θ + µWgi + σ√WgiZgi (8)

Giles and Kipling (2003) examine the distribution of genes across arraysfor oligonucleotide microarrays using 59 replicated arrays of the same sample.They find the distribution of a gene’s expression to be roughly normal acrossarrays (PM-MM after normalization, but without a log transform).11 Similarly,on the large dataset of yeast microarrays described above, which were nottechnical replicates but biological replicates, we find that the distribution pergene across arrays seems to be roughly normal. This implies that if Equation(6) were plausible, it is likely that the W variance term is constant for a givengene across arrays, and only varies between genes (i.e. (7)).

11Giles and Kipling (2003) did not address the question of the distribution for a givenarray across genes, as is of focus here. They used the Shapiro-Wilks test as a formal test,which found 18%-46% of genes non-normal, depending on the normalization method. Theythen used the slope of the normal Q-Q plot to gauge the measure of the magnitude ofdeviation from normality.

18

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 21: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

Under Equation (7), where each gene has a fixed Wg variance term, thesample mean is distributed as AL(θ, µ, σ) as well. The sample variance termhas a more complicated expression for its density, involving modified Besselfunctions. Figures 8 show the distribution of the sample mean and samplevariance across genes, compared with the theoretical distributions expected fora fixed σ2Wg variance term. The sample mean follows a Laplace distributionreasonably well (and does not conform with the distribution suggested bya varying Wgi in (8)) but the sample variance does not follow its expecteddistribution very well.

4.4 Uses of the Error Distribution

4.4.1 Normalization

Using the AL(θ, µ, σ) distribution gives parametric insight into normalizationacross arrays. For fairly symmetric distributions (µ ≈ 0), the AL(θ, µ, σ)gives a parametric reasoning for the common use of robust measures like themedian and MeanAD to center and scale the arrays. Use of MLE estimates θand σ, though, allow for easier comparison amongst the arrays because theseestimates account for the different skewness of different arrays in evaluatingproper measures of center and scale. In the context of the gene expressiondata, if we expect most genes not to be differentially expression in comparisonwith the sample reference, the representation in Equation (6) implies thatthere is a bias in the direction of µ, which might want to be accounted for innormalization. However the skew values found in the datasets we examinedwere not large, and thus the median and MeanAD are still reasonable valuesfor the centering and scaling of the arrays.

The variance stabilizing methods of Durbin et al. (2003); Huber et al. (2003)both use different maximum likelihood methods to fit a transformation h(y) tothe data that stabilizes the variance across intensity levels. The transformationcan be written as:

h(y) = log(y − a+√

(y − a)2 + λ2) = sinh−1(y − a

λ) (9)

The parameters λ and a of the function h are fit using versions of the model

hλ,a(y) = Xβ + ε (10)

where X is a design matrix. For cDNA arrays, h(y) either is evaluated for eachspot with the channels treated as separate observations (Huber et al., 2003)

19

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 22: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

(a) Distribution of sample mean compared withAL(θ, µ, σ) distribution and Normaldistribution

Sg

0 1 2 3 4 5

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●● ●●●

0.1 0.2 0.5 1.0 2.0 5.0 10.0 20.0 50.0

0.00

10.

005

0.05

00.

500

Sg

Den

sity

Priors of σ2

ExponentialInverse Gamma

(b) Distribution of sample variance compared with using exponential prior and traditional InverseGamma prior (see section 4.4.2). Left: histogram with densities overlaid; Right: histogram anddensities on the log-scale

Figure 8: Sample mean and sample variance computed for each gene in theyeast dataset (Yvert et al., 2003) and the distribution across genes comparedto the corresponding density determined by Equation (7) in red.

20

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 23: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

or is replaced with ∆h(y) = h(ychannel1)− h(ychannel2) (Durbin et al., 2003).12

Both use ε ∼ N(0, σ), which give standard least-squares regression estimatesof β, σ in terms of λ, a. Then the remaining profile likelihood is maximized (orapproximately so) numerically with respect to λ, a:

C log(∑

(hλ,a(yi)− xiβλ,a)2)

+∑

log h′λ,a(yi) (11)

Both Huber et al. (2003); Durbin et al. (2003) remark on the heavy tails ofthe residuals resulting from the fit using normal error. Huber et al. (2003)iteratively uses least trimmed sum of squares regression in minimizing (11) togive parameters λ, a robust to the assumption of normality and outliers.

The parametric model of the Laplace distribution is a logical error distri-bution for the log-ratios, as seen in section 4.1. Using this distribution fornormalization is thus a logical adjustment when normalizing based on log-ratios of the two channels, or ∆h(y), as suggested by Durbin et al. (2003).13

When normalizing on the channels separately, as is the common implemen-tation, the Laplace distribution is still useful as a more robust estimationtechnique. Indeed, if the model uses a symmetric Laplace error term for ε,then the likelihood involves absolute deviation, rather than squared deviation.The estimates of β, σ are then those from a least absolute deviation (LAD)regression. In other words, minimizing∑

|h(yi)− xiβ| instead of∑

(h(yi)− xiβ)2.

The least-squares term in the profile likelihood (11) is also changed to a leastabsolute deviation term. Thus, the maximum likelihood estimates under theLaplace error term are automatically more robust to outlying terms than sumsof squares estimates.

Closed-form estimates do not generally exist for LAD regression, but this isa convex optimization problem, so good minimization algorithms exist (Port-noy and Koenker, 1997). For small datasets, the computational differencebetween LAD and least-squares regression is negligible; however, given thenumber of observations in these models (equal to the number of all the spots in

12The design matrix in Huber et al. (2003) is hλi,ai(yig) = µg + ε, to account for gene geffects, but not other aspects of the design. Huber et al. (2003) then goes on to maximizethe profile log likelihood iteratively to find both ai and λi. Durbin et al. (2003) assumesthat a has been estimated using negative controls or replicated spots and allows for a morecomplicate design matrix. They then use a variant of Box-Cox method to find λ thatapproximately maximizes the profile likelihood with less computation requirements.

13However, it is not clear that the method they used actually can be extended to ∆h(y),as they suggested, given problems with the Jacobian.

21

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 24: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

all arrays under consideration) absolute deviation minimization is more com-putationally intensive than least squares. For the simplified one-way ANOVAdesign model in Huber et al. (2003), the mean and standard error per gene ofh(y) that give β, σ would simply be replaced with the median and MeanAD,respectively. The profile likelihood would then need to be numerically maxi-mized as before, again with an absolute deviation term instead of a quadratic.

The Box-Cox method suggested by Durbin et al. (2003) tries to ease com-putational burdens by avoiding the

∑log h′(y) term and by estimating one

global λ parameter for all arrays, using a larger design matrix X to accountfor array effects. Using the Laplace error distribution here would not givesimple closed-form estimates for β, σ. The full minimization algorithms of anextremely large LAD would be needed, undermining the effort for a computa-tional easier normalization.

The model in Equation (10) has also been proposed for finding differen-tially expressed genes as well, as done by Kerr et al. (2000) (using the logtransform) and Durbin (2004). When the model in (10) is expressed in termsof ∆h(yg), a Laplace error term is particularly appealing given the good fitof the AL(θ, µ, σ) to data in section 4.1. Again, the computational expensewould depend on the extent of the design matrix. Using the Laplace errordistribution allows for testing of effects in the model through parametric boot-strapping. One can easily generate data under various null hypotheses thathave many features similar to the original data. For example, using the modelin Equation (10), one could vary or eliminate a term of the model, and usethe Laplace distribution to generate new residuals (and thus new data) underthe new model. Thus, the residual distribution allows for significance testingfor parameters in the expression model. However, this model will have severeproblems if the the within gene variability changes from gene to gene (see 4.4.2below).

Clearly a similar approach would be to bootstrap by resampling the resid-uals. However, the AL(θ, µ, σ) distribution is very tractable; the density, cu-mulative density, and quantile functions can be written in closed form. Thisallows a great deal of knowledge of the effect of further transformation and ma-nipulations of the data beyond just simulated or sampled data. Furthermore,the AL(θ, µ, σ) model allows parameters with meaningful interpretations. Forexample, the AL(θ, µ, σ) model nicely separates the location parameter fromthe skew parameter, so the effect of the two can be taken into account indetermining what further normalization analysis is appropriate. The Laplacedistribution can also be used in simulation studies, giving a heavier tailedcomparison to the normal distribution.

22

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 25: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

4.4.2 Empirical Bayes

Parametric models are particularly useful in Bayesian analysis, where priorand conditional distribution are used in estimation and infering significance.In addition to the ASL error distribution, the interpretations in section 4.3offer some different prior distributions for comparison.

As an example, if assuming normality of gene expression across arrays (aswith F-tests), equation (7) suggests a prior distribution of the variance termExp( 1

σ2 ). A more standard prior for the variance term is the inverse Gammadistribution (IG(α, β)). Rocke (2003), for example, uses the inverse Gammaprior to give a per-gene empirical Bayes estimate of the variance using theposterior distribution σ2|sg, where sg is the standard sample variance of geneg across arrays. His goal is to find a compromise between estimating the vari-ance for each gene separately (and thus ignoring a great deal of information inother genes) versus estimating a global variance term (and ignoring the vari-ance heterogenity) as in the standard regression model in section 4.4.1 (Kerret al., 2000). The inverse Gamma prior has two free parameters and gives abetter fit to the marginal distribution of sg than the exponential prior sug-gested by the Laplace distribution (Figure 8(b) using the method of momentsEmpirical Bayes estimators suggested in Rocke (2003)). Moreover the poste-rior distribution is unwieldy using the exponential, while the inverse Gammadistribution is a conjugate prior, thus giving an inverse Gamma posterior dis-tribution. Comparing the marginal distribution of the gene expression acrossgenes yig, on the other hand, both priors offered good fits, depending on thearray we examined. Of course the exponential prior just results in having amarginal L(θ, σ) distribution as discussed in section 4.3, and thus is highlytractable. The inverse Gamma results in less convenient density with whichto work.14

A popular and simple empirical Bayes approach to microarray analysis is toassume that the prior probability that the ith gene is differentially expressed isp1 and thus the probability of not being differentially expressed is p0 = 1− p1

(see Efron et al., 2001, for example). Then the density of some statistic, likethe two-sample t-statistic, for gene g is

f(t) = p0f0(t) + p1f1(t)

where fi is the density of the statistic corresponding to whether the gene isexpressed or not. Then to evaluate which genes are differentially expressed

14These plots were implemented on the yeast data, where there were no grouping effectsto take into account.

23

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 26: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

one uses the posterior probability of p0 for each gene:

Prob{Not Differentially Expressed|tg} = p0f0(tg)/f(tg)

Small values of the posterior probability of p0 indicate possibly differentially ex-pressed genes. Clearly this setup can be extended to a larger number of classesin the mixture beyond just “Differentially Expressed” and “Not DifferentiallyExpressed.” The question of the proper null distribution, f0, is important forthis method, as using the natural tn−1 distribution as the null does not seemto correspond to observed data, as the tails of the t are not long enough andthus finds too many genes differentially expressed (Efron, 2003).

Using AL(θ, µ, σ) as a null distribution of gene expression, which can itselfbe thought of as a mixture distribution, seems a possible alternative. And asmentioned above, the mean across genes seems to resemble a Laplace distribu-tion as opposed to the Normal. However, once the means are standardized bythe gene’s standard deviation, the standard t distribution is reasonable, againpointing to the importance of variance heterogeneity amongst the genes.

5 Conclusion and Further Observations

In short, the AL(θ, µ, σ) distribution can be a useful model for gene expressionanalysis. The asymmetric Laplace distribution gives a simple, interpretablemodel that well describes fluctuation of gene expression in competitive hy-bridization microarrays. Furthermore, the model can be broken down intoother representations, such as a continuous mixture of Normals or log-ratiosof Pareto distributions as described in Section 4.3 which suggest useful modelsfor future exploration. Model based analyses, such as parametric bootstrap-ping, allow for extra incorporation of the error distribution information. Thedistribution can be easily written, including the cumulative distribution func-tion, and thus allows for theoretical examination of the data methods. TheAL model is less appropriate for the Affymetrix microarrays, which use thedifference rather than ratio of gene expression values.

This exposition has not taken into account correlations between genes andrather has treated the genes as if they were (independent) observations fromthe same distribution. Under the null hypothesis, each probe expression isthought of as independently, identically distributed from a AL(θ, µ, σ) distri-bution which varies from subject to subject. No one would actually claim thatthe measured intensities are independent within an array; rather they are mea-surements of a complex regulatory network where the amount of transcript ofone gene is highly dependent on other genes and gene products that regulate

24

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 27: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

its transcription. For competitive hybridization arrays, where the expressionis measured as a ratio to a reference sample, this transformation may hope-fully reduce the dependency among non-differentially expressed genes if thereference sample is relevant to the sample of mRNA. However normalizationtechniques, in particular, do treat the spots as independent, and thus it is stilluseful to look at the overall distribution.

Appendix: Supplementary Figures

25

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 28: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

HCT−116.1 HCT−116.2 HCT−116.3 KM12L4A.1 KM12L4A.2

KM12L4A.3 MDAH2774 NT2.1 NT2.2 NT2.3

OV1063 OV3 OVCAR3 PA−1 PANC−1

SKOV3 RefSample.1 RefSample.2 SW480.1 SW480.2

SW480.3 SW620 SW626 TP−1 TP−2

Figure 10: (a) Q-Q Plots of all arrays for self-self hybridization data (b) His-togram with AL(θ, µ, σ) and N (µ, σ) density overlayed (Yang et al., 2002).

26

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 29: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

1 2

3 4

Figure 11: (A) Q-Q Plots of all arrays for zebrafish data (B) Histogram withAL(θ, µ, σ) and N (µ, σ) density overlayed (Wuennenberg-Stapleton and Ngai,2001)

27

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 30: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

CLL−13 CLL−13 CLL−52

CLL−39 DLCL−0032 DLCL−0024

DLCL−0029 DLCL−0023

Figure 12: (A) Q-Q Plots of all arrays for lymphoma data (B) Histogram withAL(θ, µ, σ) and N (µ, σ) density overlayed (Alizadeh et al., 2002)

28

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 31: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

θ σ κ

● ●

●●

●●

−0.

03−

0.02

−0.

010.

000.

010.

020.

030.

04

θ±

2s.e

.

Arrays

● ●

0.15

0.20

0.25

0.30

0.35

σ±

2s.e

.Arrays

●●

0.9

1.0

1.1

1.2

κ±

2s.e

.

Arrays

(a) T-Cell

θ σ κ

● ●

● ●

● ●●

●●

0.00

0.05

0.10

θ±

2s.e

.

Arrays

●●

0.20

0.25

0.30

0.35

σ±

2s.e

.

Arrays

●●

● ●

● ●

1.0

1.1

1.2

1.3

κ±

2s.e

.

Arrays

(b) Self-Self

θ σ κ

−0.

10−

0.08

−0.

06−

0.04

−0.

020.

000.

02

θ±

2s.e

.

Arrays

0.30

0.35

0.40

σ±

2s.e

.

Arrays

0.75

0.80

0.85

0.90

0.95

1.00

1.05

1.10

κ±

2s.e

.

Arrays

(c) Swirl Zebrafish

Figure 13: Parameter estimates across all arrays, with whiskers showing 2×s.e

29

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 32: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

θ σ κ

● ●

−0.

10−

0.05

0.00

0.05

0.10

0.15

θ±

2s.e

.

Arrays

0.4

0.5

0.6

0.7

0.8

0.9

σ±

2s.e

.

Arrays

0.9

1.0

1.1

1.2

1.3

κ±

2s.e

.

Arrays

(d) Lymphomia

θ σ κ

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−0.

10−

0.05

0.00

0.05

0.10

θ±

2s.e

.

Arrays

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

0.2

0.3

0.4

0.5

0.6

0.7

0.8

σ±

2s.e

.

Arrays

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.8

0.9

1.0

1.1

1.2

1.3

κ±

2s.e

.

Arrays

(e) Yeast

Figure 13: Parameter estimates across all arrays, with whiskers showing 2×s.e

30

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 33: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

References

Akaike, H. (1973). Information theory and an extension of the maximumlikelihood principle. In Breakthroughs in Statistics (S. Kotz and N. Johnson,eds.), vol. I. Springer-Verlag, New York, 610–624.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S.,Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X.,Powell, J. I., Yang, G., Liming Marti, E. Moore, T., Hudson, J.,Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C.,Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke,R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein,D., Brown, P. O. and Staudt, L. M. (2002). Distinct types of diffuselarge B-cell lymphoma identified by gene expression profiling. Nature 403503–511.

Burnham, K. and Anderson, D. (1998). Model Selection and Inference.Springer, New York.

Dudoit, S. and Yang, J. Y. H. (2002). marrayClasses package: Classesand methods for cDNA microarray data. Bioconductor, http://www.r-

project.org/.

Dudoit, S. and Yang, J. Y. H. (2003). Bioconductor R packages for ex-ploratory analysis and normalization of cDNA microarray data. In TheAnalysis of Gene Expression Data (G. Parmigiani, E. Garrett, R. A. Irizarryand S. L. Zeger, eds.), chap. 3. Springer, New York, 73–101.

Durbin, B. (2004). Estimation of transformations for microarray data: Arerobust methods always necessary? Preprint.

Durbin, B., Hardin, J., Hawkins, D. and Rocke, D. (2002). A variance-stabilizing transformation for gene-expression micoarray data. Bioinformat-ics 18 S105–S110.

Durbin, B., Hardin, J., Hawkins, D. and Rocke, D. (2003). Estimationof transformation parameters for microarray data. Bioinformatics 19 1360–1367.

Efron, B. (2003). Large-scale simultaneous hypothesis testing: The choiceof a null hypothesis. To be published in JASA.

31

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005

Page 34: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

Efron, B. and Gous, A. (2001). Scales of evidence for model selection:Fisher versus Jeffreys. In Model Selection (P. Lahiri, ed.), vol. 38 of LectureNotes – Monograph Series. Institute of Mathematical Statistics, Beachwood,Ohio, 208–256.

Efron, B., Storey, J. and Tibshirani, R. (2001). Microarrays, empiricalBayes methods, and false discovery rates. Tech. rep., Stanford University.

Giles, P. J. and Kipling, D. (2003). Normality of oligonucleotide microar-ray data and implictations for parametric statistical analyses. Bioinformat-ics 19 2254–2262.

Hinkley, D. and Revankar, N. (1977). Estimation of the Pareto law fromunderreported data. Journal of Econometrics 5 1–11.

Huber, W. and Heydebreck, A. v. (2003). vsn package: Variance stabi-lization and calibration for microarray data. Bioconductor, http://www.r-project.org/.

Huber, W., von Heydebreck, A., Sueltmann, H., Poustka, A. andVingron, M. (2003). Parameter estimation for the calibration and variancestabilization of microarray data. Statistical Applications in Genetics andMolecular Biology 2 Article 3.

Johnson, N. L., Kotz, S. and Balakrishnan, N. (1994). Continuousunivariate distributions, vol. I. 2nd ed. Wiley & Sons, New York.

Kerr, M. K., Martin, M. and Churchill, G. (2000). Analysis of variancefor gene expression microarray data. Journal of Computational Biology 7819–837.

Kotz, S., Kozubowski, T. and Krysztof, P. (2001). The Laplace Distri-bution and Generalizations. Birkha, Boston.

Kuznetsov, V. A. (2001). Distribution associated with stochastic processesof gene expression in a single eukaryotic cell. Journal on Applied SignalProcessing 4 285–296.

Lindsey, J. (1999). Some statistical heresies. The Statistician 48 1–40.

Newton, M., Kendziorski, C., Richmond, C., Blattner, F. and Tsui,K. (2001). On differential variability of expression rations: Improving statis-tical inference about gene expression changes from microarray data. Journalof Computational Biology 8 37–52.

32

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 16

http://www.bepress.com/sagmb/vol4/iss1/art16DOI: 10.2202/1544-6115.1070

Page 35: Statistical Applications in Genetics and Molecular Biologystatweb.stanford.edu/~susan/papers/GeneExpression... · Figure 2: Histogram of gene expression of all genes on a single array

Portnoy, S. and Koenker, R. (1997). The gaussian hare and the laplaciantortoise: Computability of squared-error versus absolute-error estimators.Statistical Science 12 279–296.

Rocke, D. M. (2003). Heterogeneity of variance in gene expression microar-ray data. Preprint.

Stephens, M. A. (1986). Tests based on EDF statistics. In Goodness-Of-FitTechniques (R. B. D’Agostino and M. A. Stephens, eds.), chap. 4. MarcelDekker, Inc., New York, 97–194.

Wu, Z., Irizarry, R. A., Gentleman, R., Murillo, F. M. andSpencer, F. (2003). A model based background adjustment for oligonu-cleotide expression arrays. Tech. rep., Johns Hopkins University, Depart-ment of Biostatistics.

Wuennenberg-Stapleton, K. and Ngai, L. (2001). Swirl experimentaldata provided by the Ngai Lab at UC Berkeley.

Xu, T., Shu, C.-T., Purdom, E., Dang, D., Ilsley, D., Guo, Y.,Holmes, S. P. and Lee, P. P. (2004). Microarray analysis reveals differ-ences in gene expression of circulating CD8+ T cells in melanoma patientsand healthy donors. Cancer Research 64 3661–3667.

Yang, I. V., Chen, E., Hasseman, J. P., Liang, W., Frank, B. C.,Wang, S., Sharov, V., Saeed, A., White, J., Li, J., Lee, N. H.,Yeatman, T. J. and Quackenbush, J. (2002). Within the fold: Assessingdifferential expression measures and reproducibility in microarray assays.Genome Biology 3.

Yang, Y. H., Dudoit, S., Luu, P. and Speed, T. P. (2001). Normaliza-tion for cDNA microarray data. In Microarrays: Optical Technologies andInformatics (M. L. Bittner, Y. Chen, A. N. Dorsel and E. R. Dougherty,eds.), vol. 4266 of SPIE.

Yvert, G., Brem, R. B., Whittle, J., Akey, J., Foss, E., Smith, E.,Mackelprang, R. and Kruglyak, L. (2003). Trans-acting regulatoryvariation in saccharomyces cerevisiae and the role of transcription factors.Nature Genetics 35 57–64.

33

Purdom and Holmes: Gene Expression Error Distribution

Published by Berkeley Electronic Press, 2005