arXiv:1001.0341v2 [q-bio.GN] 16 Oct 2010

23
arXiv:1001.0341v2 [q-bio.GN] 16 Oct 2010 UCLA Statistics Preprints, Jan 2010 ON WEIGHT MATRIX AND FREE ENERGY MODELS FOR SEQUENCE MOTIF DETECTION * By Qing Zhou University of California, Los Angeles The problem of motif detection can be formulated as the construc- tion of a discriminant function to separate sequences of a specific pattern from background. In computational biology, motif detection is used to predict DNA binding sites of a transcription factor (TF), mostly based on the weight matrix (WM) model or the Gibbs free energy (FE) model. However, despite the wide applications, theoret- ical analysis of these two models and their predictions is still lacking. We derive asymptotic error rates of prediction procedures based on these models under different data generation assumptions. This al- lows a theoretical comparison between the WM-based and the FE- based predictions in terms of asymptotic efficiency. Applications of the theoretical results are demonstrated with empirical studies on ChIP-seq data and protein binding microarray data. We find that, irrespective of underlying data generation mechanisms, the FE ap- proach shows higher or comparable predictive power relative to the WM approach when the number of observed binding sites used for constructing a discriminant decision is not too small. Key words: asymptotic efficiency, discriminant analysis, protein- DNA interaction, predictive error, transcription factor binding site. 1. Introduction. Transcription factors (TFs), a class of proteins, regulate gene transcription through their physical interactions with particular DNA sites. Such a DNA site is called a transcription factor binding site (TFBS), which is usually a short piece of nucleotide sequence (e.g., ‘CATTGTC’). Typically, a TF can bind different sites and regulate a set of genes. A key observation is that sites of the same TF share similarity in their sequence composition, which is characterized by a motif. Since gene regulation has always been an important problem in biology, many computational methods have been developed to predict whether a given DNA sequence can be bound by a TF. Please see Elnitski et al. (2006), Ji and Wong (2006), and Vingron et al. (2009) for recent reviews on relevant methods. The prediction of TFBS’s considered in this article is formulated as a classification problem. Denote by w the width of the binding sites and code the four nucleotide bases, A, C, G and T, by a set of positive integers I = {1, ··· ,J } (J = 4). Suppose that we have observed a sample of labeled sequences of length w, D n = {(Y k , X k )} n k=1 , where X k ∈I w and Y k ∈{0, 1} indicating whether X k is bound by the TF (Y k = 1) or not (Y k = 0). We call D + n = {X k : Y k =1} observed binding sites (or motif sites) * To appear in Journal of Computational Biology. Department of Statistics, 8125 Mathematical Sciences Building, University of California, Los An- geles, CA 90095 (email: [email protected]). 1

Transcript of arXiv:1001.0341v2 [q-bio.GN] 16 Oct 2010

arX

iv:1

001.

0341

v2 [

q-bi

o.G

N]

16

Oct

201

0

UCLA Statistics Preprints, Jan 2010

ON WEIGHT MATRIX AND FREE ENERGY MODELS FOR

SEQUENCE MOTIF DETECTION∗

By Qing Zhou†

University of California, Los Angeles

The problem of motif detection can be formulated as the construc-tion of a discriminant function to separate sequences of a specificpattern from background. In computational biology, motif detectionis used to predict DNA binding sites of a transcription factor (TF),mostly based on the weight matrix (WM) model or the Gibbs freeenergy (FE) model. However, despite the wide applications, theoret-ical analysis of these two models and their predictions is still lacking.We derive asymptotic error rates of prediction procedures based onthese models under different data generation assumptions. This al-lows a theoretical comparison between the WM-based and the FE-based predictions in terms of asymptotic efficiency. Applications ofthe theoretical results are demonstrated with empirical studies onChIP-seq data and protein binding microarray data. We find that,irrespective of underlying data generation mechanisms, the FE ap-proach shows higher or comparable predictive power relative to theWM approach when the number of observed binding sites used forconstructing a discriminant decision is not too small.

Key words: asymptotic efficiency, discriminant analysis, protein-DNA interaction, predictive error, transcription factor binding site.

1. Introduction. Transcription factors (TFs), a class of proteins, regulate genetranscription through their physical interactions with particular DNA sites. Such aDNA site is called a transcription factor binding site (TFBS), which is usually a shortpiece of nucleotide sequence (e.g., ‘CATTGTC’). Typically, a TF can bind differentsites and regulate a set of genes. A key observation is that sites of the same TF sharesimilarity in their sequence composition, which is characterized by a motif. Since generegulation has always been an important problem in biology, many computationalmethods have been developed to predict whether a given DNA sequence can be boundby a TF. Please see Elnitski et al. (2006), Ji and Wong (2006), and Vingron et al.(2009) for recent reviews on relevant methods.

The prediction of TFBS’s considered in this article is formulated as a classificationproblem. Denote by w the width of the binding sites and code the four nucleotidebases, A, C, G and T, by a set of positive integers I = {1, · · · , J} (J = 4). Supposethat we have observed a sample of labeled sequences of length w,Dn = {(Yk,Xk)}nk=1,where Xk ∈ Iw and Yk ∈ {0, 1} indicating whether Xk is bound by the TF (Yk = 1)or not (Yk = 0). We call D+

n = {Xk : Yk = 1} observed binding sites (or motif sites)

∗To appear in Journal of Computational Biology.†Department of Statistics, 8125 Mathematical Sciences Building, University of California, Los An-

geles, CA 90095 (email: [email protected]).

1

2 Q. ZHOU

and D−n = {Xk : Yk = 0} background sites (or background sequences). Then, motif

detection is to construct a discriminant function from Dn to predict the label of anynew sequence x ∈ Iw.

Most of the existing computational methods for motif detection can be classified intotwo groups. The starting point of the first group is the sequence specificity of bindingsites, which is often summarized by the position-specific weight matrix (WM). For earlydevelopments of WM, please see Stormo (2000). Under the WM model, each nucleotide(letter) in a binding site is assumed to be generated independently from a multinomialdistribution on {A, C, G, T}. This model has been widely used in search of TFBS’s(e.g., Hertz and Stormo, 1999; Kel et al., 2003; Rahmann et al., 2003; Turatsinzeet al., 2008), de novo motif finding (e.g., Stormo and Hartzell, 1989; Lawrence et al.,1993; Bailey and Elkan, 1994; Roth et al., 1998; Liu et al., 2002) and many other worksreviewed in Vingron et al. (2009). The second group aims at modeling physical bindingaffinity between a TF and its binding sites via the concept of the Gibbs free energy(FE) or binding energy (e.g., Berg and von Hippel, 1987; Stormo and Fields, 1998;Gerland et al., 2002; Kinney et al., 2007). Assuming that each nucleotide in a DNAsequence of length w (w-mer) contributes additively to the interaction with the TF,this approach often leads to a regression-type model for the conditional distribution ofbinding affinity given a piece of nucleotide sequence (e.g., Djordjevic et al. 2003; Foatet al. 2006). This group of methods have tight connections with predictive modelingapproaches to gene regulation, reviewed in Bussemaker et al. (2007), which can beregarded as a natural generalization to the free energy framework (Zhou and Liu, 2008).Although the standpoints are different, the two groups of approaches are in some senseclosely related. They often give similar discriminant functions for predicting TFBS’s,and there are many FE-based methods that use a weight matrix to approximate Gibbsfree energy (e.g., Granek and Clarke, 2005; Roider et al., 2007).

In spite of the fast methodological development on the WM and the FE models,there is still a lack of solid theoretical analysis to compare model assumptions, pa-rameter estimations and response predictions of the two approaches. Such theoreticalanalysis can provide insights into these methods by seeking answers to a series of ques-tions. For example, what are the common and distinct assumptions between the WMand the FE models, what is the relative performance between the two approaches inpredicting TFBS’s given a certain data generation mechanism, and how to calculatetheir predictive error rates when the size of observed sample Dn becomes large? With-out answering these questions, one may find it difficult to understand the nature ofthese methods and cannot extract the full information contained in extensive empiricalcomparisons between the two approaches.

In this article, we compare model assumptions and parameter estimations of typicalWM and FE approaches, derive asymptotic error rates of their predictions under dif-ferent data generation models, and perform comparative studies on large-scale bindingdata. The article is organized as follows. In Section 2 we review the basic models of thetwo approaches. Asymptotic error rates of prediction procedures based on these mod-els are derived and analyzed in Section 3. Computational approaches are developed inSection 4 for practical applications of the theoretical results. Numerical analysis and

ON MOTIF DETECTION 3

biological applications are presented in Sections 5 and 6, respectively, with a compar-ison of the WM-based and the FE-based predictions on ChIP-seq data and proteinbinding microarray data. The paper concludes with discussions in Section 7. Somemathematical details are provided in Appendices. Although presented in the specificcontext of motif detection, the results in this article are generally applicable to themodeling and classification of categorical data.

2. Models. Let c be a scalar, u = (u1, · · · , uJ ) be a (column) vector, v =(v1, · · · , vw) ∈ Iw, and A = (aij)w×J and B = (bij)w×J be two w × J matrices.For notational ease, we define c ±A := (c ± aij)w×J , A/B := (aij/bij)w×J providedthat bij 6= 0, vA :=

∑wi=1 aivi , A(v) :=

∏wi=1 aivi and u(v) :=

∏wi=1 uvi . Furthermore,

we define v[−k] := (v1, · · · , vk−1, vk+1, · · · , vw) and A[−k] by removing the kth row

from A, for k = 1, · · · , w. Symbols ‘L→’ and ‘

P→’ are used for convergence in law andin probability, respectively.

Let θ0 = (θ01, · · · , θ0J) be the cell probabilities (probability vector) of a multinomialdistribution for i.i.d. background nucleotides, where

∑Jj=1 θ0j = 1 and θ0j > 0 for

j = 1, · · · , J . Since θ0 can be accurately estimated from a large number of genomicbackground sequences, we assume that it is given in the following analyses. Throughoutthe paper, we assume that the cell probabilities of any multinomial distribution arebounded away from 0.

2.1. The weight matrix model. Let X = (X1, · · · ,Xw) ∈ Iw be a sequence oflength w. In the weight matrix model (WMM), we assume that X is generated from amixture distribution. Let Y ∈ {0, 1} label the mixture component. With probability q0,Y = 0 and X is generated from an i.i.d. background model (with parameter) θ0, thatis, P (X | Y = 0) = θ0(X). With probability q1 = 1 − q0, Y = 1 and X is generatedfrom a weight matrix Θ = (θij)w×J = (θ1, · · · ,θw)t, where θi = (θi1, · · · , θiJ) is aprobability vector for i = 1, · · · , w and Xi is independent of other Xk (k 6= i). To bespecific, P (X | Y = 1) = Θ(X). From this model the log-odds ratio of Y given X is

logP (Y = 1 |X)

P (Y = 0 |X)= log

q1Θ(X)

q0θ0(X)= log(q1/q0) +

w∑

i=1

log(θiXi/θ0Xi

). (1)

In the WM-based prediction, q1 is typically fixed by prior expectation or determined bythe relative cost of the two types of errors (false positive vs false negative). Effectively,we assume that q1 is given. Let

β0 = log(q1/q0), βij = log(θij/θ0j), (2)

for 1 ≤ i ≤ w, 1 ≤ j ≤ J and β = (βij)w×J . We rewrite (1) as

logP (Y = 1 |X)

P (Y = 0 |X)= β0 +

w∑

i=1

βiXi= β0 +Xβ := h(X), (3)

which defines an additive discriminant function to predict Y given X, i.e., to predictwhether the sequence X can be bound by the TF. The label Y will be predicted as

4 Q. ZHOU

1 if h(X) ≥ 0 and 0 otherwise. This prediction can be regarded as a naive Bayesianclassifier.

Given observed binding sitesD+n , we estimateΘ by the maximum likelihood estima-

tor (MLE) Θm = (θm1 , · · · , θmw )t and substitute it in equation (2) to obtain βm. Here,the superscript ‘m’ stands for estimators based on the WMM. Let dθmi = θmi − θi,which is an infinitesimal in the order of 1/

√n as n → ∞. The standard asymptotic

theory (e.g., Ferguson 1996) implies that

√ndθmi

L→N (0,Σmi ) restricted to

J∑

j=1

dθmij = 0, as n→ ∞, (4)

and that√ndθmi , i = 1, · · · , w, are mutually independent. The (j, k)th element of the

covariance matrix Σmi is (δjkθij − θijθik)/q1 where δjk is the Kronecker delta symbol

and 1 ≤ j, k ≤ J . From equation (2) we have dβmij = dθmij /θij, which leads to thefollowing limiting distribution,

√ndβmij

L→N (0, (1 − θij)/(θijq1)), for j = 1, · · · , J , as n→ ∞, (5)

with√ndβm

i mutually independent for i = 1, · · · , w.

2.2. The free energy model. Let F , X = (X1, · · · ,Xw) and FX be a TF, a DNAsequence, and the corresponding TF-DNA complex, respectively. The process of theTF-DNA interaction can be described by the chemical reaction F +X = FX. Theconcentrations of the three molecules at chemical equilibrium, [F ], [X] and [FX], aredetermined by the association constant Ka(X), that is,

[FX]

[F ][X]= Ka(X) = exp

{

−∆G(X)

RT

}

,

where ∆G(X) is the Gibbs free energy (FE) for the interaction of F with X, R is thegas constant and T the temperature. We regard RT > 0 as a constant. Suppose thatthe contribution of a single nucleotide Xi to the FE is additive (von Hippel and Berg,1986; Benos et al., 2002) so that we may write −∆G(X)/(RT ) = c+

∑wi=1 biXi

. Thenwe have

log[FX]

[X]= log[F ] + c+

w∑

i=1

biXi:= b0 +

w∑

i=1

biXi. (6)

To avoid non-identifiability in estimation, we take Sref = (s1, · · · , sw) as a referencesequence to determine a baseline level of the FE, and define βij = bij − bisi for all i, jand β0 = b0 +

∑wi=1 bisi , such that βisi ≡ 0 for i = 1, · · · , w and

b0 +

w∑

i=1

biXi= β0 +

w∑

i=1

βiXi(7)

ON MOTIF DETECTION 5

for every X ∈ Iw. Let Y be the indicator for whether X is bound by the TF atchemical equilibrium. From the physical meaning of concentration,

P (Y = 1 |X) =[FX]

[X] + [FX]. (8)

Combining equations (6), (7) and (8) leads to an additive discriminant function forthis free energy model (FEM),

logP (Y = 1 |X)

P (Y = 0 |X)= β0 +

w∑

i=1

βiXi= β0 +Xβ := h(X). (9)

Similarly as for the WMM, we assume that β0 is fixed by prior or a desired cost.Furthermore, it is conventional to assume thatX is sampled from an i.i.d. backgroundmodel θ0, i.e., P (X) = θ0(X). The data generation process of the FEM has a clearbiological meaning. Suppose that we have sampled n nucleotide sequences of lengthw, {Xk ∈ Iw}nk=1, from the genomic background θ0. We mix these sequences withTF molecules in a container where the concentration of the TF is held as a constant.At chemical equilibrium we label the sequences Xk bound by the TF as Yk = 1and otherwise Yk = 0. The output of this experiment is the labeled sample Dn ={(Yk,Xk)}nk=1. Although there exist other models based on binding free energy, wefocus on this basic model in this paper, which makes a theoretical analysis relativelyclean while capturing main characteristics of FE-based approaches.

Given Dn, the MLE of β, denoted by βf = β + dβf with the superscript ‘f ’ forFE-based estimators, can be calculated by the standard logistic regression. Note thatβf maximizes the conditional likelihood

P (Y |X, β) =exp{(β0 +Xβ)Y }1 + exp(β0 +Xβ)

(10)

determined by equation (9). Similar to the results in Efron (1975), it is not difficultto demonstrate that βf is consistent for β with asymptotic normality,

√ndβf L→N (0,Σf ), as n→ ∞, (11)

where dβf is regarded as a vector of (J − 1)w dimensions (recall that βisi = βfisi ≡ 0for i = 1, · · · , w). The asymptotic covariance matrix

Σf =[

Eθ0

{

p1(X)p0(X)CXCtX

}]−1, (12)

where py(X) = P (Y = y | X) for y = 0, 1, CX is a (J − 1)w-dimensional columnvector coding each Xi as a factor of J levels, and Eθ0 is taken with respect to (w.r.t.)the background model θ0 that generates the sequence X.

6 Q. ZHOU

2.3. Comparison. Given (β0,β) in the WMM and the reference sequence Sref inthe FEM, if we let

β0 = β0 +

w∑

i=1

βisi , βij = βij − βisi , (13)

for i = 1, · · · , w, j = 1, · · · , J , then the two models have the same conditional distri-bution [Y | X] (3, 9) for any X. To simplify notations, we shall denote the decisionfunction (9) in the FEM by h(X) = β0 +Xβ hereafter. Except for this conditiondistribution, other model assumptions are different. The WMM assumes that the nu-cleotides in X are generated independently given its label Y . But this is not true forthe FEM, in which the conditional probability of X given Y is

P (X | Y,FEM) ∝ P (Y |X,FEM)P (X | FEM) =exp{(β0 +Xβ)Y }1 + exp(β0 +Xβ)

θ0(X). (14)

Since equation (14) cannot be written as a product of functions of Xi, this modelimplicitly allows dependence among X1, · · · ,Xw. Consequently, the FEM may accountfor some observed nucleotide dependences within a motif such as in Bulyk et al. (2002),Barash et al. (2003), Zhou and Liu (2004), and Zhao et al. (2005) among others. Onthe other hand, under the FEM model the marginal distribution of X is simply thebackground nucleotide distribution, i.e., P (X | FEM) = θ0(X), but the marginaldistribution of X under the WMM is a mixture,

P (X | WMM) = q1Θ(X) + q0θ0(X). (15)

The different model assumptions lead to different procedures for parameter estima-tion, in particular the coefficients β (β). As discussed in Sections 2.1 and 2.2, βm andβf are consistent under the WMM and under the FEM, respectively. Since βf maxi-mizes the conditional likelihood P (Y | X, β) (10) which is identical between the twomodels, it is also consistent for β under the WMM up to the translation (13). However,βf is expected to be less efficient than βm in prediction if the WMM corresponds tothe underlying data generation process, due to the ignorance of the information onΘ contained in the marginal likelihood P (X | Θ,WMM) (to be discussed in detailin Section 3.1). Conversely, if data are generated by the FEM, βm is biased and nolonger consistent. We will analyze the bias and the resulting incremental error rate inlater sections.

3. Theoretical results. For both WMM and FEM, the ideal decision functionh(x) is obtained with the true parameters of the respective models and the corre-sponding ideal error rate

R∗ =∑

x:h(x)≥0

P (Y = 0,X = x) +∑

x:h(x)<0

P (Y = 1,X = x). (16)

Denote byR∗(x) = min

y∈{0,1}P (Y = y |X = x)

ON MOTIF DETECTION 7

the ideal error rate for h(x) given X = x. Consider a decision function h(x) estimatedfrom Dn. Given any x for which h(x)h(x) < 0, the incremental error rate beyondR∗(x) is

∆R(x) = |P (Y = 1 |X = x)− P (Y = 0 |X = x)| .Then the expectation of the total incremental error rate for h is

E[∆R(h)] = E

[

x∈Iw

∆R(x)P (X = x)1{

h(x)h(x) < 0}

]

=∑

x∈Iw

∆R(x)P (X = x)P{

h(x)h(x) < 0}

, (17)

where 1(·) is the indicator function. Please note that h, constructed from a sampleof size n, is a random function. Let ∆h(x) = h(x) − h(x) be the deviation of h(x)from h(x). In what follows, we will derive two theorems on E[∆R(h)] under differentassumptions for ∆h(x). As we will see, the asymptotic error rates of the WM and theFE procedures under the data generation models discussed in this paper can all becalculated based on the two theorems.

Suppose that, for every x,√n∆h(x)

L→N (0, V (h,x)), where V (h,x) is the asymp-totic variance. As n→ ∞,

E[∆R(h)] →∑

x∈Iw

∆R(x)P (X = x)P (∆h(x) > |h(x)|)

=∑

x∈Iw

∆R(x)P (X = x) Φ

{

−√

nh2(x)/V (h,x)

}

, (18)

where Φ is the cdf of the standard normal distribution N (0, 1). Let

α(h) = minh(x)6=0

h2(x)/V (h,x), (19)

and x∗ be the corresponding minimum. Note that ∆R(x) = 0 when h(x) = 0. Thus,

as n→ ∞, E[∆R(h)] is dominated by the term ∆R(x∗)P (X = x∗) Φ[

−{nα(h)}1/2]

,

where α(h) determines the rate of convergence. Using the theory of large deviations,we obtain:

Theorem 1. If√n∆h(x)

L→N (0, V (h,x)) for every x ∈ Iw then

1

nlogE[∆R(h)] → −α(h)

2, as n→ ∞.

Let ha and hb be two estimated decisions constructed from samples of size na and nb,respectively. Suppose that both of them satisfy the condition in Theorem 1. We definethe asymptotic relative efficiency (ARE) of ha with respect to hb by ARE(ha, hb) =α(ha)/α(hb), which is the limit ratio nb/na required to achieve the same asymptoticperformance.

8 Q. ZHOU

If h(x) is biased in the sense that√n{∆h(x) − µ(h,x)} L→N (0, V (h,x)), where

µ(h,x) denotes the asymptotic bias of h(x), then simple derivation from equation(17) gives that

E[∆R(h)] →∑

x∈Iw

∆R(x)P (X = x) Φ

−√n sign{h(x)}{h(x) + µ(h,x)}

V (h,x)

,

as n→ ∞, where sign(y) is the sign of y with sign(0) ≡ 0.

Theorem 2. Suppose that√n{∆h(x) − µ(h,x)} L→N (0, V (h,x)) for every x ∈

Iw. Let B(h) = {x : sign{h(x)}{h(x) + µ(h,x)} < 0}. Then

E[∆R(h)] →∑

x∈B(h)

∆R(x)P (X = x), as n→ ∞. (20)

We ignore the case {x : h(x) + µ(h,x) = 0} which practically never happens. Theset B(h) is the collection of x for which the estimated decision h gives a differentpredicted label from the ideal decision h as n → ∞. Note that E[∆R(h)] does notvanish if B(h) is nonempty. Thus, the incremental percentage over the ideal error rate,E[∆R(h)]/R∗, is an appropriate measure of the predictive performance of h.

In the remainder of this section, we derive and compare the error rates of the WMand the FE procedures. From Sections 3.1 to 3.4, we assume that the constant termβ0(β0) is fixed to its true value. The results are generalized to situations where theconstant is mis-specified in Section 3.5. The computation of α(h) (19) and E[∆R(h)](20) will be discussed in Section 4.

3.1. Error rates under WMM. In this subsection we assume that the underlyingdata generation process is given by the WMM. Since both βm and βf are consistentwith asymptotic normality under the WMM, we may uniformly denote their deci-sion functions by h(x) = β0 + xβ = h(x) + xdβ, where dβ = β − β and

√ndβ

follows a normal distribution N (0,Σ) as n → ∞. This implies that√n∆h(x) =

√nxdβ

L→N (0, V m(β,x)) with V m(β,x) being the asymptotic variance. The super-script ‘m’ indicates the WMM as the data generation model. Let E[∆Rm(β)] be theexpected incremental error rate of h indexed by β. Following Theorem 1,

1

nlogE[∆Rm(β)] → −α

m(β)

2= −1

2min

h(x)6=0

h2(x)

V m(β,x), (21)

as n → ∞. Consequently, the ARE of the FE procedure w.r.t the WM procedure,AREm(βf , βm), is determined by the ratio of αm(βf ) over αm(βm).

The decision function of the WM procedure is constructed with βm (Section 2.1).Note that xdβ =

i dβixiis a summation of w dβij ’s, each from a different dβi. The

ON MOTIF DETECTION 9

limiting distribution of√ndβmij (5) and the mutual independence among dβm

i imply

that the asymptotic variance of√nxdβm is

1

q1

w∑

i=1

(1− θixi)/θixi

=1

q1x {(1−Θ)/Θ} ,

and consequently,

αm(βm) = minxβ 6=−β0

q1(β0 + xβ)2

x{(1 −Θ)/Θ} .

Suppose that we have chosen (s1, · · · , sw) as the reference sequence in the FE pro-cedure. Define β0 and β from the parameters (β0,β) of the WMM by equation (13).Then the FE-based estimator βf is consistent for β with asymptotic normality. Letdβf = βf − β. Similar to equation (12), the asymptotic covariance matrix of

√ndβf

is[

E{

p1(X)p0(X)CXCtX

}]−1, where the expectation is taken w.r.t. the marginal

distribution of X under the WMM (15). Thus the covariance matrix can be writtenas

Covm(√

ndβf)

=

[

q0Eθ0

{

eh(X)

eh(X) + 1CXC

tX

}]−1

, (22)

where the expectation Eθ0 averages over X ∈ Iw generated from the backgroundmodel θ0. Based on equation (22), one can calculate the variance of

√nxdβf for every

x and determine the convergence rate αm(βf ) of the expected incremental error rateE[∆Rm(βf )] for the FE procedure.

Because the estimation of βf is only based on the conditional distribution [Y | X]while βm is estimated from the joint distribution of Y and X, we expect βf to beless efficient in prediction with αm(βf ) < αm(βm). We will conduct a numericalstudy in Section 5 to evaluate AREm(βf , βm) on 200 transcription factors to confirmour conclusion. Here we demonstrate the lower efficiency of βf by the loss of Fisherinformation in estimating an individual θij from the conditional likelihood only. Forsimplicity, suppose that Θ[−i] is given and collapse Xi into two categories, Xi = j andXi 6= j. Because

P (X, Y | Θ) = P (Y |X,Θ)P (X | Θ)

under the WMM, the loss of information equals the Fisher information on θij containedin the marginal likelihood P (X | Θ), denoted by I(θij | X). Let I(θij | X, Y ) be theFisher information on θij given X and Y jointly. We define

∆(θij | [Y |X]) = I(θij |X)/I(θij |X, Y ) (23)

as the fraction of the loss of information on θij in the conditional likelihood P (Y |X,Θ).

Proposition 3. Let θij = q0θ0j + q1θij. If (Y,X) is drawn from the WMM then

∆(θij | [Y |X]) ≥ q1θij(1− θij)

θij(1− θij):= B(q1, θij , θ0j).

10 Q. ZHOU

A proof of this proposition is given in Appendix A. If one chooses to include an equalnumber of background sites (Y = 0) and binding sites (Y = 1) in logistic regression toestimate βf , which effectively specifies q0 = q1 = 0.5 by design, then this lower boundmay be substantial. For example, with a uniform background distribution θ0j = 0.25for j = 1, · · · , 4, the range of B(q1, θij , θ0j) is between 20% and 55% for most typicalvalues of θij (Table 1).

Table 1

Typical values of B(0.5, θij , 0.25)

θij 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9B(0.5, θij , 0.25)(%) 31 46 53 55 53 49 42 32 18

3.2. WMM with Markov background. We generalize the background model to aMarkov chain, which often represents a better fit to genomic background in highorganisms. We assume that given Y = 0, X is generated by a first order Markovchain with a transition probability matrix ψ0 = (ψ0(x, y))J×J where x, y ∈ I. For anyx = (x1, · · · , xw) ∈ Iw, ψ0(x) :=

∏wi=1 ψ0(xi−1, xi), where ψ0(x0, x1) is interpreted as

the probability of x1 under the stationary distribution of the Markov chain. The idealdecision function under this model is

h1(x) = logP (Y = 1 |X = x)

P (Y = 0 |X = x)= β0 +

w∑

i=1

log θixi− logψ0(xi−1, xi), (24)

where β0 = log(q1/q0) and the subscript ‘1’ [in h1(x) and ∆Rm1 (26)] indicates a quan-

tity whose definition involves a Markov background model. Since ψ0 can be accuratelyestimated with sufficient genomic background sequences, we assume that it is given.With the MLE Θm from observed binding sites, the WM procedure constructs a de-cision whose expected incremental error rate converges to zero exponentially fast asn→ ∞, following Theorem 1.

With a slight abuse of notations, let us denote by θ0 the probability vector of thestationary distribution of the Markov chain, which is also the marginal distributionof any nucleotide Xi in a background site. We still define β0 and β by equation (2)with θ0 being the stationary probabilities, and translate β0 and β via a referencesequence to β0 and β (13). Let (Y,X) be a sample from the WMM with Markovbackground. If the dependence among neighboring nucleotides in a background site isignored, the conditional likelihood P (Y |X, β), parameterized by β, is then given bythe same expression in equation (10). Because the FE-based estimator βf maximizes

this conditional likelihood, it is standard to show that βf P→ β and is asymptoticallynormal. Let hf (x) = β0 + xβf denote the estimated decision function of the FEprocedure. As n→ ∞,

hf (x)P→ β0 + xβ = β0 +

w∑

i=1

log θixi− log θ0xi

. (25)

ON MOTIF DETECTION 11

Let ∆hf (x) = hf (x) − h1(x) be the deviation of hf (x) from the ideal decision (24).Comparing equations (24) and (25) gives the asymptotic bias,

b(x) =w∑

i=1

logψ0(xi−1, xi)− log θ0xi= logψ0(x)− log θ0(x).

Due to the asymptotic normality of βf , we have√n{∆hf (x)−b(x)} L→N (0, V m

1 (βf ,x)),where V m

1 (βf ,x) is the corresponding asymptotic variance. Under this model,

∆R(x)P (X = x) = |q1Θ(x)− q0ψ0(x)| = q0ψ0(x)∣

∣eh1(x) − 1

∣.

Following Theorem 2 with µ(hf ,x) = b(x), the expected incremental error rate

E[∆Rm1 (βf )] →

Bm1 (βf )

q0

∣eh1(x) − 1

∣ψ0(x), as n→ ∞, (26)

where Bm1 (βf ) = {x : sign{h1(x)}{h1(x) + b(x)} < 0}. The incremental percentage

over the ideal error rate, E[∆Rm1 (βf )]/(Rm

1 )∗, is appropriate for comparing the FE-based prediction with the WM-based prediction whose expected error rate convergesto (Rm

1 )∗. A general expression for R∗ is given in equation (16) which, under the WMMwith Markov background, is written as

(Rm1 )∗ = q0Eψ0

{

1(h1(X) ≥ 0) + eh1(X)1(h1(X) < 0)}

. (27)

3.3. Error rates under FEM. We now analyze asymptotic error rates of the twoprocedures regarding the FEM as the underlying data generation mechanism.

The FE-based estimator βf is consistent for β under the FEM. The asymptotic nor-

mality of√ndβf (11, 12) implies that

√nxdβf L→N (0, V f (βf ,x)). Let E[∆Rf (βf )]

be the expected incremental error rate of the FE procedure under the FEM. FromTheorem 1, we have

1

nlogE[∆Rf(βf )] → −α

f (βf )

2= −1

2min

h(x)6=0

h2(x)

V f (βf ,x), as n→ ∞. (28)

Denote by θfi = (θfi1, · · · , θfiJ) the probability vector of the conditional distribution

[Xi | Y = 1] under the FEM, i.e.,

θfij = P (Xi = j | Y = 1,FEM), (29)

for i = 1, · · · , w, and call Θf = (θfij)w×J the weight matrix. Recall that the WM-

based estimator βm is obtained by estimating θfi individually from observed bindingsites D+

n and then transforming the estimates via equation (2). Denote the estimated

weight matrix by Θm. Since the data are generated by the FEM, Θm P→Θf and√ndΘm =

√n(Θm −Θf ) follows a multivariate normal distribution asymptotically,

12 Q. ZHOU

similar to (4), but dθmi and dθmk may be correlated (1 ≤ i 6= k ≤ w). Given that the

coefficients β in the FEM are defined w.r.t. a reference sequence, we transform Θm to

βmij = log(θmij /θ0j)− log(θmisi/θ0si), for all i, j, (30)

where si is the ith nucleotide of the reference sequence Sref . Let ∆βm = βm − β be

the deviation of βm = (βmij )w×J . To obtain its asymptotic distribution, we determine

the cell probability θfij (29) from equation (14), that is,

θfij ∝ θ0jeβij

x∈Iw−1

eβ0+xβ[−i]

1 + eβijeβ0+xβ[−i]

θ0(x) = θ0jeβijEθ0

{

eUi

1 + eβijeUi

}

,

where Ui = β0 +Xβ[−i] for X ∈ Iw−1. In particular, θfisi ∝ θ0siEθ0{

eUi/(1 + eUi)}

since βisi = 0. For i = 1, · · · , w and j = 1, · · · , J , we define

δij = logEθ0

{

eUi

1 + eβijeUi

}

− logEθ0

{

eUi

1 + eUi

}

(31)

and rewrite θfij = θ0jeβij+δij/Zi, where Zi =

j θ0jeβij+δij is the normalizing constant.

Because θmijP→ θfij and βisi = δisi = 0, from equation (30) we have

βmijP→ log(θfij/θ0j)− log(θfisi/θ0si) = βij + δij (32)

for all i and j as n → ∞. Thus, δ = (δij)w×J is the asymptotic bias of βm. From the

asymptotic normality of√ndΘm, we see that

√n(∆βm − δ) follows a multivariate

normal distribution with mean 0 and finite covariance matrix as n → ∞. Note thatthis multivariate normal distribution is defined on a (J −1)w-dimensional space, since∆βmisi = δisi ≡ 0 for i = 1, · · · , w.

Consider the WM-based decision function hm(x) = β0+xβm = h(x)+x∆βm. The

above derivation shows that√n(x∆βm−xδ) L→N (0, V f (βm,x)), where the variance

is determined by the covariance matrix of ∆βm. Following Theorem 2, the expectationof the total incremental error rate of the WM procedure

E[∆Rf (βm)] →∑

Bf (βm)

tanh |h(x)/2| θ0(x), as n→ ∞, (33)

where Bf(βm) = {x : sign{h(x)}{h(x)+xδ} < 0}. Similarly, the incremental percent-age over the ideal error rate E[∆Rf(βm)]/(Rf )∗ is used to compare the predictions ofthe WM and the FE procedures, given that E[∆Rf (βf )] → 0 (28). Under the FEM,the ideal error rate

(Rf )∗ = Eθ0 {p0(X)1(h(X) ≥ 0) + p1(X)1(h(X) < 0)} . (34)

Recall that py(X) = P (Y = y |X), i.e.,

py(X) =eyh(X)

eh(X) + 1, for y = 0, 1.

ON MOTIF DETECTION 13

3.4. FEM with Markov background. Next, we generalize the FEM to Markov back-ground and assume that any sequence X ∈ Iw is generated marginally by a Markovchain. Consistent with Section 3.2, we denote by ψ0 = (ψ0(x, y))J×J the transitionprobability matrix of the Markov chain.

It is trivial to see that, with the Markov background model, the ideal decision isstill h(x) = β0 + xβ. If V

f1 (β

f ,x) denotes the asymptotic variance of√nxdβf under

Markov background, then with V f1 in place of V f equation (28) remains valid for the

FE-based prediction. On the other hand, if we proceed with the WM procedure, theexpected incremental error rate

E[∆Rf1 (β

m)] →∑

Bf1 (β

m)

tanh |h(x)/2| ψ0(x), as n→ ∞, (35)

where Bf1 (β

m) = {x : sign{h(x)}{h(x)+δ(x)} < 0}. Here δ(x) denotes the asymptoticbias of the WM-based decision function for x. A detailed derivation of equation (35)and the bias δ(x) is provided in Appendix B. In analogy to the FEM with i.i.d. back-

ground, E[∆Rf1(β

m)]/(Rf1 )

∗ measures the increased error rate of the WM procedurerelative to the FE procedure, where

(Rf1 )

∗ = Eψ0 {p0(X)1(h(X) ≥ 0) + p1(X)1(h(X) < 0)} .

3.5. Mis-specification of the constant term. In all the above derivations, we haveassumed that the constant term β0(β0) is fixed to its true value. If this is not the case,then the deviation ∆β0 = β0 − β0(β0) will be an extra bias term for an estimateddecision in which the constant term is fixed to β0. More specifically, the set Bf (βm)in equation (33) will be replaced by {x : sign{h(x)}{h(x) + xδ + ∆β0} < 0}, andsimilarly for Bm

1 (βf ) in equation (26) and Bf1 (β

m) in equation (35).

4. Computation. To apply the theoretical results, we need to solve the mini-mization (19) and the summation (20) involved in Theorems 1 and 2, respectively. Ifthe width of a motif w ≤ 12, brute-force enumeration of all w-mers is computationallyfeasible, which provides exact solutions for both the minimization and the summationproblems.

For a motif of width w > 12, we minimize (19) to find α(h) by a two-step approach.We generate N = 5 × 106 w-mers from the background model θ0 and identify theminimum of (19) among them. Then we refine the obtained minimum by simulatedannealing for 5,000 iterations with temperature decreasing linearly from one to zero.At each iteration, we randomly choose one nucleotide Xi from the w positions andpropose to mutate Xi to one of the other three nucleotide bases with equal probabil-ity. The proposal is accepted according to a Metropolis-Hastings ratio with currenttemperature.

Since the set B(h), as in equations (26), (33) and (35), is usually small, it will bevery inefficient to approximate the summation by generating w-mers from backgrounddistributions. Thus, we develop an importance sampling approach to approximate thesummation (20) when w > 12. Here, we use the calculation of E[∆Rf (βm)] (33) to

14 Q. ZHOU

illustrate this approach. Note that one can bound xδ in the definition of Bf (βm)so that xδ ∈ [M1,M2], where M1 =

∑wi=1minj δij and M2 =

∑wi=1maxj δij . These

bounds imply that if x ∈ Bf(βm) then

h(x) ∈ (−M2, 0) ∪ (0,−M1) := H.

We design a sequential proposal g(X) that is more likely to generate X with h(X) ∈H. Suppose that we have generated X1, · · · ,Xk−1 (1 ≤ k ≤ w) from this proposal. Let

hk−1 = β0 +∑k−1

i=1 βiXi, in particular h0 = β0. We determine B

(L)k+1 =

i>k minj βij

and B(U)k+1 =

i>k maxj βij , the bounds for∑

i>k βiXi. If Xk = j then the range for

h(X) is

hk−1 + βij + [B(L)k+1, B

(U)k+1] := [Lkj, Ukj].

The larger the overlap between this interval and H, the more likely thatX will belongto the desired set Bf(βm). Thus, we propose Xk with probability

gk(Xk = j | X1, · · · ,Xk−1) ∝ |[Lkj − ǫ, Ukj + ǫ] ∩H| , (36)

where | · | returns the length of an interval and ǫ is a small positive number to allowthe generation of Xk = j when Lkj = Ukj ∈ H. Proposing Xk sequentially by (36)for k = 1, · · · , w generates an X from g(X). With N proposed samples {X(t)}Nt=1 weestimate the summation (33) by

1

N

N∑

t=1

tanh∣

∣h(X(t))/2

θ0(X(t))1{X(t) ∈ Bf (βm)}

g(X(t)).

In this work, we propose N = 5×106 samples for this importance sampling estimation.We verified that the estimations were very close to the exact summations. With differ-ent bounds for h1(x) and h(x), this approach is applied to other similar summationsin (26) and (35).

5. Numerical study. A numerical study was performed under the WMM toconfirm and quantify the lower predictive efficiency of the FE-based estimator βf

compared to the WM-based estimator βm discussed in Section 3.1. We randomlyselected 200 TFs from the database TRANSFAC (Matys et al. 2003). For each TF,experimentally verified binding sites were used to construct a weight matrix with asmall amount of pseudo counts. Then we randomly sampled 5,000 human upstreamsequences, each of length 10 kilo bases, and calculated their nucleotide frequency θ0 =(0.263, 0.234, 0.237, 0.266). The 200 weight matrices display large variability. Thewidth w ranges from 6 to 21 and the information content,

∑wi=1{2 + Eθi(log2 θiXi

)},ranges from 5.1 to 17.5 bits (Figure 1). These statistics show that our selection hascovered the typical width and strength of DNA motifs.

A constructed weight matrix was regarded as the parameter Θ and the nucleotidefrequency θ0 was used for the i.i.d. background in the WMM. Since the prior odds ratio(q1/q0) of a binding site over a background site is usually small, we chose three typical

ON MOTIF DETECTION 15

Width

Cou

nt

5 10 15 20

05

1020

30

Information Content (bit)

Cou

nt

6 8 10 14 18

010

2030

40

Fig 1. Histograms of the width and the information content of 200 WMs.

Table 2

Summary of AREm(βf , βm)

λ Min. Q1 Median Q3 Max.

200 0.134 0.391 0.490 0.595 0.849500 0.183 0.475 0.555 0.638 0.8491000 0.217 0.508 0.560 0.675 0.918

Q1,3: the first and the third quartiles.

values for the inverse of the prior odds, λ = q0/q1 = 200, 500, 1000, for numericalcalculations. We evaluated the AREs of the FE-based prediction w.r.t. the WM-basedprediction, defined by AREm(βf , βm) = αm(βf )/αm(βm) in Section 3.1, for the 200WMs. As discussed in Section 4, our evaluation of AREs was exact for WMs of w ≤ 12and was carried out with simulated annealing for w > 12. In addition, Monte Carloaverage was utilized, before simulated annealing, to approximate Covm(

√ndβf ) (22)

by simulating 5× 106 w-mers from the i.i.d. background.The asymptotic relative efficiencies AREm(βf , βm) on the 200 TFs are summarized

in Table 2 for the three inverse prior odds. It is seen that for all the WMs the FE-basedprediction shows lower efficiency than the WM-based prediction, and that the medianAREs of βf to βm are between 50% and 60% and the third quartiles (Q3) between60% and 70%. Thus, for more than 75% of the TFs, the FE procedure is less than70% as efficient as the WM procedure in terms of prediction. This confirms the loss ofefficiency of the FE-based prediction under the WMM, although both estimators areconsistent. We note that the increase of ARE with higher λ (smaller q1) is consistentwith the lower bound defined in Proposition 3.

6. Applications. In this section, we apply the WM and the FE approaches toChIP-seq data and protein binding microarray (PBM) data. We perform cross valida-tion (CV) with training data of different size, ranging from 20 to 500 binding sites,for two purposes. First, with the large scale of both types of data, we can compare

16 Q. ZHOU

empirical error rates in cross validation against theoretical error rates. This may allowus to verify some of the model assumptions and propose further improvement on themodels. Second, we are also interested in examining the practical performance of thetwo computational methods when the number of observed binding sites varies in awide range, which will provide useful guidance for future applications.

6.1. ChIP-seq data. In the recent two years, the ChIP-seq technique (Johnson etal., 2007; Mikkelsen et al., 2007; Robertson et al., 2007) has become a powerful high-throughput method to detect TFBS’s in whole genome scale. A binding peak in ChIP-seq data can usually narrow down the location of a TFBS to a neighborhood of 50 to200 bps (Johnson et al., 2007). ChIP-seq data that contain thousands of binding sitesfor a number of TFs have been generated in a study on mouse embryonic stem cells(Chen et al., 2008). We chose five TFs, Esrrb, Oct4, STAT3, Sox2 and cMyc, in thisstudy to compare the WM and the FE methods. The five TFs all have well-definedweight matrices in literature and each contains more than 2,000 detected bindingpeaks in ChIP-seq, and their data quality was confirmed by motif enrichment analysisin Chen et al. (2008). To identify the exact binding site of a ChIP-seq binding peak, wesearched the 200-bp neighborhood of the peak, 100 bps on each side, to find the bestmatch to the known weight matrix of the TF. Given the very small search space, theuncertainty in the exact location of the binding site should be minimal. If the motifwidth of a TF is w, background w-mers were extracted from genomic control regionsthat match the locations of the binding sites relative to nearby genes. The ratio of thenumber of background sites over the number of binding sites was set to 200 for everyTF, that is, the inverse prior odds ratio λ = q0/q1 = 200. A transition matrix wasestimated from the extracted background sites for each TF, since the log Bayes factorof a Markov background model over an i.i.d model was > 105.

Based on the way we composed the data sets, the WMM with Markov background(Section 3.2) seems a more plausible data generation model. Clearly, a data set was amixture of detected binding sites and random background sites, and the backgrounddistribution was close to a Markov chain. If there is no within-motif dependence,binding sites can be regarded as being generated from a WM model, and consequently,the WM-based prediction is expected to have a smaller error rate compared to the FE-based prediction. However, if there exists within-motif dependence in binding sites, theFEM, which is able to capture such dependence, may outperform the WM approachregardless of the mixture nature of the data sets. We computed theoretical error ratesof the two approaches under the WMM with Markov background. For each TF, weestimated aWM from all the binding sites and a transition matrix from the backgroundsites. Regarding them as the model parameters, we calculated the asymptotic errorrate of the WM-based prediction, which is the ideal error rate (27), and the incrementalrate of the FE-based prediction (26). Note that the bias due to mis-specification of theconstant term in the FE approach needs to be included for the calculation of equation(26). These theoretical error rates are reported in Table 3 (the column of n+ = ∞).

To compare with theoretical results, we performed cross validation to compute em-pirical error rates of the WM and the FE procedures on each data set. We randomly

ON MOTIF DETECTION 17

Table 3

Predictive error rates (in the unit of 10−3) for ChIP-seq data

TF n+ 20 50 100 200 500 ∞

WM 2.66 2.49 2.43 2.40 2.36 2.30Esrrb FE 4.30 2.77 2.54 2.45 2.37 2.45

(FE-WM)/WM (%) 61.7 11.2 4.5 2.1 0.4 6.5

WM 3.68 3.54 3.50 3.47 3.44 2.98Oct4 FE 4.81 3.71 3.53 3.45 3.41 3.06

(FE-WM)/WM (%) 30.7 4.8 0.9 −0.6 −0.9 2.7

WM 3.03 2.84 2.78 2.72 2.69 2.57STAT3 FE 4.69 3.09 2.84 2.75 2.70 2.74

(FE-WM)/WM (%) 54.8 8.8 2.2 1.1 0.3 6.6

WM 2.95 2.75 2.68 2.65 2.63 2.53Sox2 FE 3.44 2.89 2.73 2.68 2.66 2.59

(FE-WM)/WM (%) 16.6 5.1 1.9 1.1 1.1 2.4

WM 2.67 2.49 2.42 2.38 2.34 2.07cMyc FE 3.30 2.51 2.34 2.26 2.23 2.24

(FE-WM)/WM (%) 23.6 0.8 −3.3 −5.0 −4.7 8.2

Note: Reported are average error rates over 100 CVs.

sampled (without replacement) n+ binding sites and λ · n+ background sites froma full data set to form a training set. Both approaches were applied to the trainingset to estimate their respective decision functions. For WM-based prediction, a WMand a transition matrix were estimated from the training data set to construct a de-cision function (24) with β0 = − log(λ). For FE-based prediction, we applied logisticregression to the training set to obtain hf (x) = β0 + xβf . Then we predicted theclass labels of the remaining unused sequences (test set) by each of the two decisionfunctions and calculated empirical error rates (CV error rates). This procedure wasrepeated 100 times independently for each value of n+ to obtain the average CV errorrate. To examine performance with a varying sample size (the number of sequences ina training set), we chose n+ from 20 to 500.

The average CV error rates are reported in Table 3. The theoretical results give areasonable approximation to the CV error rates for both approaches when the trainingsample size n+ ≥ 200. The asymptotic error rates of the WM approach are uniformlylower than its CV error rates for all the TFs, while the FE approach achieves a smallerCV error rate with n+ = 500 than its asymptotic rate for three TFs. Consequently, theincremental percentage of the FE-based prediction for n+ = 500 is less than the ex-pected level calculated from the theory. This comparison implies that the WMM maynot match the exact underlying data generation process, although it is more plausiblethan the FEM given the mixture composition of the data sets. As we discussed, poten-tial dependence within a motif may cause possible violation to the WMM. To verifyour hypothesis, we conducted the χ2-test for every pair of motif positions (Xi and Xk,1 ≤ i < k ≤ w) given the binding sites in each data set. At the significance level of0.005, we identified 25, 19, 17, 8, and 12 pairwise correlations for Esrrb, Oct4, STAT3,Sox2, and cMyc binding sites, respectively, which gives a false discovery rate of < 2%

18 Q. ZHOU

for all the TFs. By capturing such correlations the FEM is able to achieve compara-ble or even slightly better prediction than the WMM with a moderate-size trainingsample (n+ ≥ 100, Table 3). Finally, it is important to note that even under the exactmodel assumptions of the WMM, the FE-based prediction only results in a marginalincrement in error rate (< 10%) compared to the WM approach asymptotically (Ta-ble 3, n+ = ∞). Together with the superior or comparable CV performance when thetraining size is reasonably large, this result suggests the use of the FE approach, whenwe have a sufficient number of observed binding sites.

6.2. PBM data. Protein binding microarrays (Mukherjee et al., 2004) provide ahigh throughput means to interrogate protein binding specificity to DNA sequences.Quantitative measurement of the binding specificity of a protein to every short nu-cleotide sequence designed on a DNA microarray can be obtained simultaneously. ThePBM data in Berger et al. (2008) quantified DNA binding of homeodomain proteins viathe calculation of an enrichment score, with an expected false discovery rate (FDR),for each double-stranded nucleotide sequence of length eight (w = 8). The data set foreach protein contains 32,896 8-mers, each with an enrichment score and an FDR. Weidentified as the consensus binding pattern for a protein the 8-mer with the highestenrichment score, and then labeled as binding sites those 8-mers whose FDR < 0.005and which differ by no more than three nucleotides from the consensus after consid-ering both the forward and the reverse complement strands. The remaining 8-merswere labeled as background sites and we randomly determined their strands (orien-tations) to avoid potential artifacts. In this study we included five proteins, Hoxa11,Irx3, Lhx3, Nkx2.5, and Pou2f2, each from a different family, and called 134, 190, 267,145, and 213 binding sites, respectively.

The FEM, developed by the biophysics of protein-DNA binding, is expected tobe a better model that matches the design of PBM data than the WMM. Thus,theoretical analysis was conducted under the FEM for the five PBM data sets. Weapplied logistic regression to estimate β and β0 (9) with all the labeled 8-mers in adata set, where the 8-mer ‘AAAAAAAA’ was regarded as the reference sequence, i.e.,βi1 ≡ 0. We calculated the ideal error rate (Rf )∗ (34) of the FE-based prediction,with an i.i.d. uniform background (by design the background distribution is uniform).For the WM approach, we chose β0 as the log-ratio of the number of binding sitesover that of background sites, and calculated its asymptotic error rate by equation(33), in which the bias in the constant term (∆β0) was included. The theoretical errorrates are reported in Table 4 (n+ = ∞), where we find that the WM approach givesa significantly higher error rate, between 14% and 56%, than the FE approach.

The same CV procedure as in the previous section was performed on the PBM datasets to compare the empirical predictive error rates of the two approaches, with n+

varying between 20 and 100 (Table 4). There is a clear decreasing trend in error ratefor both approaches with the increase of the training sample size n+, although forsome data sets the difference between the CV error rate for n+ = 100 and the asymp-totic rate is still quite obvious. Such discrepancy is probably due to the following tworeasons. First, the parameters (β, β0) used for the calculation of asymptotic rates were

ON MOTIF DETECTION 19

Table 4

Predictive error rates (in the unit of 10−3) for PBM data

Protein n+ 20 50 100 ∞

FE 3.57 2.62 2.43 1.63Hoxa11 WM 3.51 3.22 3.05 2.10

(WM-FE)/FE (%) −1.7 22.9 25.5 28.8

FE 5.91 4.71 4.54 3.26Irx3 WM 5.06 4.85 4.79 3.71

(WM-FE)/FE (%) −14.4 3.0 5.5 13.8

FE 7.51 4.64 4.20 3.18Lhx3 WM 6.90 6.54 6.47 4.97

(WM-FE)/FE (%) −8.1 40.9 54.0 56.3

FE 3.73 2.31 2.12 1.80Nkx2.5 WM 3.64 3.29 3.16 2.36

(WM-FE)/FE (%) −2.4 42.4 49.1 31.1

FE 6.25 4.91 4.56 3.31Pou2f2 WM 5.79 5.57 5.49 4.02

(WM-FE)/FE (%) −7.3 13.4 20.4 21.5

estimated from data sets which only contain 100 to 200 binding sites. This resulted ina high variance in the estimated parameters: The median ratio of the standard errorover the absolute value of an estimated coefficient was between 10% and 30% for thefive data sets. Second, the training sample size, n+ = 100, is still too small to achievea comparable error rate as n+ → ∞. However, we have already seen substantially in-creased error rates of the WM-based predictions compared to the FE-based predictionsfor n+ = 100, which is very consistent with the theoretical results. This comparisonconfirms that unless the training sample size is really small, using the WM approachmay degrade predictive performance dramatically if the data generation mechanism isclose to the FEM.

7. Discussion. Combining results on the ChIP-seq data and the PBM data, thisstudy provides some general guidance for practical applications of the WM and theFE approaches, irrespective of underlying data generation. When the training samplesize is small, the WM procedure seems to produce fewer errors than the FE procedure.But when we have observed enough binding sites, the advantage of the FE procedureis clearly seen. On one hand, it gives a comparable or slightly better prediction thanthe WM approach even if the WMM is more likely for the data (Table 3, n+ ≥ 100).On the other hand, when the data are generated in a way that matches the biophysicalprocess of protein-DNA binding such as the PBM data, the reduction in error rateof the FE approach can be substantial compared to the WM approach (Table 4,n+ ≥ 50). The relative performance between the two approaches reflects a typicalvariance-bias tradeoff. Estimation under the WMM is simple and more robust, whichtypically has a smaller variance than the FEM. For a small sample size, predictiveerrors are mostly caused by variance in estimation and thus, WM-based predictionsmay outperform FE-based predictions. When the sample size increases, estimationvariance decreases for both approaches and the potential bias in the WM approach

20 Q. ZHOU

becomes the main factor for predictive errors. Given that its primary principle comesfrom the biophysics of protein-DNA interactions, the FEM has become more attractive,based on which many computational methods have been developed for predicting TF-DNA binding. In these methods a weight matrix is sometimes used as a first orderapproximation for computing free energy-based binding affinity. This work suggeststhat this approximation must be applied with caution. The results on the PBM datahave demonstrated that the WM procedure may give a prediction with 50% or moreerrors compared to the FE-based decision for a reasonably large sample size (Table 4).

In recent years, a substantial amount of large-scale TF-DNA binding data have beengenerated for many important biological processes. As demonstrated by the applica-tions to ChIP-seq data and PBM data, large-sample theory is able to provide valuableinsights on statistical estimation and prediction for such large-scale data. The resultsin this article can be regarded as a first step towards a theoretical development oncomputational approaches for gene regulation analysis. Incorporation of within-motifdependence in the WMM and interaction effects in the FEM is a direct next stepof this work, for which the model selection component needs to be considered in atheoretical analysis. Although desired, further generalizations to methods for de novomotif discovery, identification of cis-regulatory modules and predictive modeling ofgene regulation will be more challenging future directions.

Appendices.

Appendix A: Proof of Proposition 3.

Proof. Let θk(−j) = 1− θkj for k = 0, i. The second order partial derivative of themarginal log-likelihood l(Θ |X) = log{q0θ0(X) + q1Θ(X)} w.r.t. θij is

∂2l(Θ |X)

∂θ2ij= − q21{Θ[−i](X[−i])}2

{q0θ0(X) + q1Θ(X)}2 ,

where Θ(X) = Θ[−i](X[−i]) · θiXifor Xi = j, (−j) and similarly for θ0(X). Thus, the

Fisher information on θij given X is

I(θij |X) = −E

{

∂2l(Θ |X)

∂θ2ij

}

=∑

x

q21{Θ[−i](x[−i])}2q0θ0(x) + q1Θ(x)

= q1∑

x

q1Θ(x)

q0θ0(x) + q1Θ(x)· 1

θixi

·Θ[−i](x[−i])

= q1∑

x∈{j,(−j)}

1

θix· EΘ[−i]

{

(

q0θ0xθ0(X[−i])

q1θixΘ[−i](X[−i])+ 1

)−1}

.

Because EΘ[−i]

{

θ0(X[−i])/Θ[−i](X[−i])}

= 1, Jensen’s inequality implies that

I(θij |X) ≥∑

x∈{j,(−j)}

q21q0θ0x + q1θix

=q21

θij(1− θij).

ON MOTIF DETECTION 21

The lower bound B(q1, θij , θ0j) is obtained by dividing the R.H.S. of this inequalityby the Fisher information on θij given X and Y jointly,

I(θij |X, Y ) = −E

{

∂2l(Θ |X, Y )

∂θ2ij

}

=q1

θij(1− θij),

where l(Θ |X, Y ) = log P (X, Y | Θ) is the joint log-likelihood.

Appendix B: Derivation of E[∆Rf1 (β

m)] (35). Given the estimated weight matrixΘm based on observed binding sites D+

n , the constructed decision function of the WMapproach

hm1 (x) = log(q1/q0) +

w∑

i=1

[log θmixi− logψ0(xi−1, xi)]

P→ β0 +w∑

i=1

{

logθfixi

θfisi

− logψ0(xi−1, xi)

ψ0(si−1, si)

}

, as n→ ∞, (37)

where θfij = P (Xi = j | Y = 1) under the FEM with Markov background, (s1, · · · , sw)is the reference sequence, and

β0 = log(q1/q0) +

w∑

i=1

log{θfisi/ψ0(si−1, si)}.

Let x[−i](s) = (x1, · · · , xi−1, s, xi+1, · · · , xw). Following a similar derivation in Sec-

tion 3.3, we have θfij ∝ exp(βij + ηij), where

ηij = log

x[−i]

eui

1 + eβijeui

ψ0(x[−i](j))

− log

x[−i]

eui

1 + euiψ0(x[−i](si))

with ui = β0 + x[−i]β[−i]. Since βisi = ηisi = 0, log(θfij/θfisi) = βij + ηij for all i and j.

Thus, equation (37) becomes

hm1 (x)P→ β0 +

w∑

i=1

{

βixi+ ηixi

− logψ0(xi−1, xi)

ψ0(si−1, si)

}

= h(x) + δ(x),

where δ(x) =∑w

i=1 ηixi− log{ψ0(xi−1, xi)/ψ0(si−1, si)}. Let ∆hm1 (x) = hm1 (x)−h(x).

The asymptotic normality of√ndΘm implies that

√n{∆hm1 (x) − δ(x)} follows a

limiting normal distribution with mean 0 and a finite (possibly zero) variance forevery x. Equation (35) then follows from Theorem 2.

Acknowledgements. The author thanksWing H. Wong, Jun S. Liu and ZhengqingOuyang for helpful discussions. This work was supported by NSF grant DMS-0805491.

22 Q. ZHOU

References.

[1] Bailey, T. L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization todiscover motifs in biopolymers. In Proceedings of 2nd International Conference on IntelligentSystems for Molecular Biology, 28-36. CA: AAAI Press.

[2] Barash, Y., Elidan, G., Friedman, N., and Kaplan, T. (2003). Modeling dependence in protein-DNA binding sites. RECOMB 2003, Berlin, Germany.

[3] Benos, P.V., Bulyk, M.L., and Stormo, G.D. (2002). Additivity in protein-DNA interactions: howgood an approximation is it? Nucleic Acids Research, 30, 442-451.

[4] Berg, O.G. and von Hippel, P.H. (1987). Selection of DNA binding sites by regulatory proteins:Statistical-mechanical theory and application to operators and promoters. Journal of MolecularBiology, 193, 723-750.

[5] Berger, M.F., Badis, G., Gehrke, A.R., Talukder, S., Philippakis, A.A. et al. (2008). Variation inhomeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell,133, 1266-1276.

[6] Bulyk, M.L., Johnson, P.L.F., and Church, G.M. (2002). Nucleotides of transcription factor bind-ing sites exert interdependent effects on the binding affinities of transcription factors. NucleicAcids Research, 30, 1255-1261.

[7] Bussemaker, H.J., Foat, B.C., andWard, L.D. (2007). Predictive modeling of genome-wide mRNAexpression: from modules to molecules. Annual Review of Biophysics and Biomolecular Structure,36, 329-347.

[8] Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B. et al. (2008). Integration of externalsignaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133,1106-1117.

[9] Djordjevic, M., Sengupta, A.M., and Shraiman, B.I. (2003). A biophysical approach to transcrip-tion factor binding site discovery. Genome Research, 13, 2381-2390.

[10] Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis.Journal of the American Statistical Association, 70, 892-898.

[11] Elnitski, L., Jin, V.X., Farnham, P.J., and Jones, S.J.M. (2006). Locating mammalian tran-scription factor binding sites: a survey of computational and experimental techniques. GenomeResearch, 16, 1455-1464.

[12] Ferguson, T.S. (1996). A Course in Large Sample Theory, p. 105-139. London: Chapman & Hall.[13] Foat, B.C., Morozov, A. and Bussemaker, H.J. (2006). Statistical mechanical modeling of genome-

wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics, 22, e141-e149.[14] Gerland, U., Moroz, J.D., and Hwa, T. (2002). Physical constraints and functional characteristics

of transcription factor-DNA interaction. Proceedings of the National Academy of Sciences USA,99, 12015-12020

[15] Granek, J.A. and Clarke, N.D. (2005). Explicit equilibrium modeling of transcription-factor bind-ing and gene regulation. Genome Biology, 6, R87.

[16] Hertz, G.Z. and Stormo, G.D. (1999). Identifying DNA and protein patterns with statisticallysignificant alignments of multiple sequences. Bioinformatics, 15, 563-577.

[17] Ji, H.K. and Wong, W.H. (2006). Computational biology: toward deciphering gene regulatoryinformation in mammalian genomes. Biometrics, 62, 645-663.

[18] Johnson, D.S., Mortazavi, A., Myers, R.M., and Wold, B. (2007). Genome-wide mapping of invivo protein-DNA interactions. Science, 316, 1497-1502.

[19] Kel, A.E., Gossling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O.V., and Wingender, E.(2003). MATCH: A tool for searching transcription factor binding sites in DNA sequences. NucleicAcids Research, 31, 3576-3579.

[20] Kinney, J.B., Tkacik, G., Callan, C.G. Jr (2007). Precise physical models of protein-DNA inter-action from high-throughput data. Proceedings of the National Academy of Sciences USA, 104,501-506.

[21] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wooton, J.C. (1993).Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262,208-214.

[22] Liu, X.S., Brutlag, D.L., and Liu, J.S. (2002). An algorithm for finding protein-DNA binding sites

ON MOTIF DETECTION 23

with applications to chromatin immunoprecipitation microarray experiments. Nature Biotechnol-ogy, 20, 835-839.

[23] Matys, V., Fricke, E., Geffers, R., Goβling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas,D., Kel, A.E., Kel-Margoulis, O.V. et al. (2003). TRANSFAC: transcriptional regulation, frompatterns to profiles. Nucleic Acids Research, 31, 374-378.

[24] Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G. et al. (2007).Genome-wide maps of chromatin state in pluripotent and lineage-comitted cells. Nature, 448,553-559.

[25] Mukherjee, S., Berger, M.F., Jona, G., Wang, X.S., Muzzey, D., Snyder, M. et al. (2004). Rapidanalysis of the DNA-binding specificities of transcription factors with DNA microarrays. NatureGenetics, 36, 1331-1339.

[26] Rahmann, S., Muller, T., and Vingron, M. (2003). On the power of profiles for transcription factorbinding site detection. Statistical Applications in Genetics and Molecular Biology, 2, Article 7.

[27] Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T. et al. (2007). Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massive par-allel sequencing. Nature Methods, 4, 651-657.

[28] Roider, H.G., Manke, T., and Vingron, M. (2007). Predicting transcription factor affinities toDNA from a biophysical model. Bioinformatics, 23, 134-141.

[29] Roth, F.R., Hughes, J.D., Estep, P.E., and Church, G.M. (1998). Finding DNA regulatory motifswithin unaligned noncoding sequences clustered by whole genome mRNA quantization. NatureBiotechnology, 16, 939-945.

[30] Stormo, G.D. (2000). DNA binding sites: representation and discovery. Bioinformatics, 16, 16-23.[31] Stormo, G.D. and Fields, D.S. (1998). Specificity, free energy and information content in protein-

DNA interactions. Trends in Biochemical Sciences, 23, 109-113.[32] Stormo, G.D. and Hartzell, G.W. (1989). Identifying protein-binding sites from unaligned DNA

fragments. Proceedings of the National Academy of Sciences USA, 86, 1183-1187.[33] Turatsinze, J.V., Thomas-Chollier, M., Defrance, M., and van Helden, J. (2008). Using RSAT to

scan genome sequences for transcription factor binding sites and cis-regulatory modules. NatureProtocols, 3, 1578-1588.

[34] Vingron, M., Brazma, A., Coulson, R., van Helden, J., Manke, T., Palin, K., Sand, O., and Ukko-nen, E. (2009). Integrating sequence, evolution and functional genomics in regulatory genomics.Genome Biology, 10, 202.

[35] von Hippel, P.H. and Berg, O.G. (1986). On the specificity of DNA-protein interactions. Proceed-ings of the National Academy of Sciences USA, 83, 1608-1612.

[36] Zhao, X., Huang, H., and Speed, T.P. (2005). Finding short DNA motifs using permuted Markovmodels. Journal of Computational Biology, 12, 894-906.

[37] Zhou, Q. and Liu, J.S. (2004). Modeling within-motif dependence for transcription factor bindingsite predictions. Bioinformatics, 20, 909-916.

[38] Zhou, Q. and Liu, J.S. (2008). Extracting sequence features to predict protein-DNA interactions:a comparative study. Nucleic Acids Research, 36, 4137-4148.