BIOINFORMATICS Pages 1–7 - arXivarXiv:0908.0597v1 [math.CO] 5 Aug 2009 BIOINFORMATICS Vol. 00 no....

arX

iv:0

908.

0597

v1 [

mat

h.C

O]

5 A

ug 2

009

BIOINFORMATICS Vol. 00 no. 00 2009Pages 1–7

Target prediction and a statistical sampling algorithm forRNA-RNA interactionFenix W.D. Huang 1, Jing Qin 1, Christian M. Reidys 1,2∗, andPeter F. Stadler3−7

1Center for Combinatorics, LPMC-TJKLC, Nankai University Tianjin 300071, P.R. China2College of Life Science, Nankai University Tianjin 300071, P.R. China3Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center forBioinformatics, University of Leipzig, Hartelstrasse 16-18, D-04107 Leipzig, Germany.

4Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, D-04103 Leipzig, Germany5RNomics Group, Fraunhofer Institut for Cell Therapy and Immunology, Perlickstraße 1,D-04103Leipzig, Germany

6Inst. f. Theoretical Chemistry, University of Vienna, Wahringerstrasse 17, A-1090 Vienna, Austria7The Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, New Mexico, USAReceived on *****; revised on *****; accepted on *****

Associate Editor: *****

ABSTRACTIt has been proven that the accessibility of the target sites has

a critical influence for miRNA and siRNA. In this paper, we pre-sent a program, rip2.0, not only the energetically most favorabletargets site based on the hybrid-probability, but also a statisticalsampling structure to illustrate the statistical characterization andrepresentation of the Boltzmann ensemble of RNA-RNA interactionstructures. The outputs are retrieved via backtracing an improveddynamic programming solution for the partition function based onthe approach of Huang et al. (Bioinformatics). The O(N6) timeand O(N4) space algorithm is implemented in C (available fromhttp://www.combinatorics.cn/cbpc/rip2.html).

1 INTRODUCTIONNoncoding RNAs have been found to have roles in a greatvariety of processes, including transcriptional regulation, chro-mosome replication, RNA processing and modiTcation, mes-senger RNA stability and translation, and even protein degra-dation and translocation. Direct base-pairing with targetRNAor DNA molecules is central to the function of some ncRNAs(Storz, 2002). Examples include the regulation of translationin both prokaryotes (Narberhaus and Vogel, 2007) and eukaryo-tes (McManus and Sharp, 2002; Banerjee and Slack, 2002), thetargeting of chemical modifications (Bachellerieet al., 2002), aswell as insertion editing (Benne, 1992), transcriptional control(Kugel and Goodrich, 2007). The common theme in many RNAclasses, including miRNAs, siRNAs, snRNAs, gRNAs, and snoR-NAs is the formation of RNA-RNA interaction structures thataremore complex than simple sense-antisense interactions.

∗to whom correspondence should be addressed. Phone: *86-22-2350-6800;Fax: *86-22-2350-9272;[email protected]

The hybridization energy is a widely used criterion to predictRNA-RNA interactions (Rehmsmeieret al., 2004; Tjadenet al.,2006; Buschet al., 2008). It has been proven that the acces-sibility of the target sites has a critical influence for miRNAand siRNA (Ameres and Schroeder, 2007; Kerteszet al., 2007;Kretschmer-Kazemi Far and Sczakiel, 2003). Although a lot regu-latory ncRNAs has already been identified, the number of expe-rimentally verified target sites is much smaller, which stimulate agreat demand to restrain the list of putative targets. In itsmost gene-ral form, the RNA-RNA interaction problem (RIP) is NP-complete(Alkan et al., 2006; Mneimneh, 2007). The argument for this state-ment is based on an extension of the work of Akutsu (2000) forRNA folding with pseudoknots. Polynomial-time algorithmscanbe derived, however, by restricting the space of allowed configu-rations in ways that are similar to pseudoknot folding algorithms(Rivas and Eddy, 1999). The second major problem concerns theenergy parameters since the standard loop types (hairpins,inter-nal and multiloops) are insufficient; for the additional types, suchas kissing hairpins, experimental data are virtually absent. Tertiaryinteractions, furthermore, are likely to have a significantimpact.

Several circumscribed approaches of target prediction have beenconsidered in the literature. The simplest approach concatena-tes the two interacting sequences and subsequently employsaslightly modified standard secondary structure folding algorithm.For instance, the algorithmsRNAcofold (Hofackeret al., 1994;Bernhartet al., 2006),pairfold (Andronescuet al., 2005), andNUPACK (Renet al., 2005) subscribe to this strategy. The main pro-blem of this approach is that it cannot predict important motifssuch as kissing-hairpin loops. The paradigm of concatenationhas also been generalized to the pseudoknot folding algorithm ofRivas and Eddy (1999). The resulting model, however, still doesnot generate all relevant interaction structures (Chitsazet al., 2009;Qin and Reidys, 2008). An alternative line of thought is to neglect

c© Oxford University Press 2009. 1

http://arxiv.org/abs/0908.0597v1

http://www.combinatorics.cn/cbpc/rip2.html

F.W.D. Huang, J. Qin, C.M. Reidys, P.F. Stadler

all internal base-pairings in either strand and to compute the mini-mum free energy (mfe) secondary structure for their hybridizationunder this constraint. For instance,RNAduplex andRNAhybrid(Rehmsmeieret al., 2004) follows this line of thought.RNAup(Mucksteinet al., 2006, 2008) andintaRNA (Buschet al., 2008)restrict interactions to a single interval that remains unpaired inthe secondary structure for each partner. Due to the highly conser-ved interaction motif, snoRNA/target complexes are treated moreefficiently using a specialized tool (Taferet al., 2009) however.Pervouchine (2004) and Alkanet al. (2006) independently derivedand implemented minimum free energy (mfe) folding algorithmsfor predicting the joint secondary structure of two interacting RNAmolecules with polynomial time complexity. In their model,a “jointstructure” means that the intramolecular structures of each moleculeare pseudoknot-free, the intermolecular binding pairs arenoncros-sing and there exist no so-called “zig-zags”. The optimal “jointstructure” can be computed inO(N6) time andO(N4) space bymeans of dynamic programming.

Recently, Chitsazet al. (2009) and Huanget al. (2009) inde-pendently presentedpiRNA and rip1.0, tools that use dynamicprogramming algorithm to compute the partition function of“jointstructures”, both inO(N6) time. Albeit differing in design details,they are equivalent. In addition, Huanget al. (2009) identified inrip1.0 a basic data structure that forms the basis for computingadditional important quantities such as the base pairing probabilitymatrix. However, since the probabilities of hybrid is not sim-ply a sum of the probabilities of the exterior arcs which are notindependent,rip1.0 can not solve the probability of a hybrid.

A A A A A A

113 128

5’

3’

30

50

70

90

66.9%

25.9%

25.6%

23.6%

22.9%

A

U

GCC

GGGG

G

CUC

G

C

C

A

UA

G

G

G

A U

A

A

G

UA

CCGGU

U

A GA AG

UA U A A

U

GU

AG

U

CGA

A

CC

CC C

G

UG

G A AGGAU U UA A C GCGUG

UUAUCUC G

GGUU

AGAUA U

UCAUG

CGU A U U U U GGAU AUG AA CGAG GCGC

CUACUA UU GUUUA

C GC G

ompA

CAG

AAAG

AUCCCUGU

AAUUCAG

AG

UUAA

AGUA

UGGCCACU

GC A

CAGUGGCC

MicA

UUUU3’

5’

Fig. 1. The natural structure ofompA-MicA (Udekwuet al., 2005),in which the target site are colored in red and the regions colored ingreen are the ones with the first five region-probabilities. The targetsitesR[i, j] of ompA (interacts withMicA) whose probabilitieslarger than10−1 are showed in Tab. 3.1.

The calculation of equilibrium partition functions and base-pairing probabilities is an important advance toward the charac-terization of the Boltzmann ensemble of RNA-RNA interaction

19.0%

17.4%

24.7%

54.6%

83.0%

AU

A

C

G

A

C

A

AU

C

AG

A

GC

UAU

GU

UA

CG

UA

U

GC A A A U U A A U A AU A A A G AG C

A UG C

AGU

A GC UA UA GU C A GU C

AA

A GC CU C A U

C U A U GC

U

C GU U A C AGCA C A

U

U AC

GCU

GC

AAU

CC AGU A UU A CU

A

UU U

GC

C

AGC

GC

GGUGCU

GG

CU U A G

C

C

A CA GA GG

C

G CC C

C UA

A GA

U

AC GG

5’

5’

UAG

G

3’sod B

3’

RyhB

52

30

70

40

20

60

81

Fig. 2. The natural structure ofsodB-RyhB(Geissmann and Touati,2004), in which the target site are colored in red and the regionscolored in green are the ones with the first five region-probabilities.The target sitesR[i, j] of sodB(interacts withRyhB) whose proba-bilities larger than10−1 are showed in Tab. 3.1.

95

39

10

20

80

60 70

87 63.8%

5’

5’

3’

3’ OxyS

fhl A

50.7%

43.6%

39.6%

28.4%

AG

U

UA GU CA AU AU AC

GU

AC

CUUUUG

A

C

GU

UAAAA

C

U

C

AC CGC

GGGC G UU U

CUU

AGAA

U

U

CC

GGA

GA

GAC

U

C

CA C GAU G

U

AC A

U

G

CC

C

U

C

C

AGG

G

UU

UU

A

CA

G GGU

C

C

AC GA A AU CUAA UAAAC G A

C

GCUCA

A U

G

U

U

U

GA

C

U

G

U

C

CC

G

U

U

C

CUG

A

A

A

AAA

CGG

AG

C

A

G

C

G

C

U

C

GAGU

G

A

U

CU

C

G

G

ACAA

GC

AC

CU

AG

UU

GG

G

AA

C

GUU

C

CC

U

AA30

50

45

CA

GC

UU

CU

U

Fig. 3. The natural structure offhlA-OxyS(Chitsazet al., 2009), inwhich the target site are colored in red and the regions colored ingreen are the ones with the first five region-probabilities. The targetsitesR[i, j] of fhlA (interacts withOxyS) whose probabilities largerthan10−1 are showed in Tab. 3.1.

structures. However, this elegant algorithm does not generate anystructures. However, as Ding and Lawrence (2003) suggestedwithprototype algorithms, the generation of a statistically representa-tive sample of secondary structures may provide a resolution to thisdilemma.

In contrast torip1.0, given two RNA sequences, the outputof rip2.0 consists of not only the partition function, the basepairing probability matrix, but also the contact-region probabilitymatrix based on the hybrid probabilities via introducing a new com-ponent “hybrid” in the decomposition process and a statisticallysampled RNA-RNA structure based on the probability-matrices. Atthe same time, we decrease the storage space from 4D-matrices and2D-matrices to 4D-matrices and 2D-matrices.

2

Target prediction and a statistical sampling algorithm for RNA-RNA interaction

2 PARTITION FUNCTION

2.1 BackgroundLet us first review some basic concepts has been introduced byHuanget al.(2009), see supplement material (SM) for a full-version.

Given two RNA sequencesR andS (e.g. an antisense RNA and its target)with N andM vertices, we index the vertices such thatR1 is the5′ endof R andS1 denotes the3′ end ofS. The edges ofR andS represent theintramolecular base pairs. Ajoint structure, J(R, S, I), is a graph with thefollowing properties, see Fig. 4,(B):

1. R, S are secondary structures (each nucleotide being paired withat most one other nucleotide via hydrogen bonds, without internalpseudoknots);

2. I is a set of arcs of the formRiSj without pseudoknots, i.e., ifRi1Sj1 ,Ri2Sj2 ∈ I wherei1 < i2, thenj1 < j2 holds;

3. There are no zig-zags, see Fig. 4,(A).

Joint structures are exactly the configurations that are considered in the maxi-mum matching approach of Pervouchine (2004), in the energy minimizationalgorithm of Alkanet al. (2006), and in the partition function approach ofChitsazet al. (2009). The subgraph of a joint structureJ(R, S, I) inducedby a pair of subsequences{Ri, Ri+1, . . . , Rj} and{Sh, Sh+1, . . . , Sℓ}is denoted byJi,j;h,ℓ. In particular, J(R, S, I) = J1,N;1,M . We sayRaRb(SaSb, RaSb) ∈ Ji,j;h,ℓ if and only if RaRb(SaSb, RaSb) isan edge of the graphJi,j;h,ℓ. Furthermore,Ji,j;h,ℓ ⊂ Ja,b;c,d if andonly if Ji,j;h,ℓ is a subgraph ofJa,b;c,d induced by{Ri, . . . , Rj} and{Sh, . . . , Sℓ}. Given a joint structure,Ja,b;c,d, its tight structure (ts)

1 24

1 23

1 2 3 4 5

Fig. 4. (A): A zigzag, generated byR2S1, R3S3 andR5S4 (red). (B):the joint structureJ1,24;1,23, we color the different segments and tightstructures in whichJ1,24;1,23 decomposes.

Ja′,b′;c′,d′ is either a single exterior arcRa′Sc′ (in the casea′ = b′ andc′ = d′), or the minimal block centered around the leftmost and rightmostexterior arcsαl, αr , (possibly being equal) and an interior arc subsumingboth, i.e.,Ja′,b′;c′,d′ is tight in Ja,b;c,d if it has either an arcRa′Rb′ orSc′Sd′ if a′ 6= b′ or c′ 6= d′.

In the following, a ts is denoted byJTi,j;h,ℓ. If Ja′,b′;c′,d′ is tight in

Ja,b;c,d, then we callJa,b;c,d its envelope. By construction, the notion ofts is depending on its envelope. There are only four basic types of ts, seeFig. 5:

◦ : {RiSh} = J◦i,j;h,ℓ andi = j, h = ℓ;

▽ : RiRj ∈ J▽i,j;h,ℓ andShSℓ 6∈ J▽

i,j;h,ℓ;

� : {RiRj , ShSℓ} ∈ J�i,j;h,ℓ;

△ : ShSℓ ∈ J△i,j;h,ℓ andRiRj 6∈ J△

i,j;h,ℓ.

2.2 Refined decomposition grammar for targetprediction

The unique ts decomposition would in principle already suffice to constructa partition function algorithm. Indeed, each decomposition step corresponds

Fig. 5. From left to right: tights of type◦, ▽, � and△.

to a multiplicative recursion relation for the partition functions associatedwith the joint structures. From a practical point of view, however, this wouldresult in an unwieldy expensive implementation. The reasonare the multiplebreak pointsa, b, c, d, . . . , each of which correspond to a nestedfor-loop.

We therefore need a refined decomposition that reduced the number ofbreak points. To this end we call a joint structureright-tight (rts), JRT

i,j;r,sin Ji1,j1;r1,s1 if its rightmost block is aJi1,j1;r1,s1 -ts anddouble-tight(dts), JDT

i,j;r,s in Ji1,j1;r1,s1 if both of its left- and rightmost blocks areJi1,j1;r1,s1 -ts’s. In particular, for the convenient of the computation, weassume the single interaction arc as a special case of dts, i.e.JDT

i1,i1;r1,r1=

Ri1Sr1 . In order to obtain the probability of a hybridJHyi1,iℓ;j1,jℓ

via thebacktracing method used in (Huanget al., 2009), we introduce the hybridstructure, JHy

i1,iℓ;j1,jℓ, as a new block item used in the decomposition

process. We adopt the point of view of Algebraic Dynamic Programming(Giegerich and Meyer, 2002) and regard each decomposition rule as a pro-duction in a suitable grammar. Fig. 6 summarizes three majorsteps in thedecomposition: (I) “interior arc-removal” to reduce ts. The scheme is com-plemented by the usual loop decomposition of secondary structures, and (II)“block-decomposition” to split a joint structure into two blocks.

or

or

or

or

Procedure (b)

Procedure (a)

= or or

A B C D E F G H J K

1 N

M1

1 N

M1

or

or

M

or

or

L

or

Fig. 6. Illustration of Procedure (a) the reduction of arbitrary joint structu-res and right-tight structures, and Procedure (b) the decomposition of tightstructures. The panel below indicates the 10 different types of structuralcomponents:A, B: maximal secondary structure segmentsR[i, j], S[r, s];C: arbitrary joint structureJ1,N;1,M ; D: right-tight structuresJRT

i,j;r,s; E:

double-tight structureJDTi,j;r,s; F tight structure of type▽, △ or �; G: type

� tight structureJ�i,j;r,s; H: type▽ tight structureJ▽

i,j;r,s; J: type△ tight

structureJ△i,j;r,s; K : hybrid structureJHy

i,j;h,ℓ; L : substructure of a hybrid

Jhi,j;h,ℓ such thatRiSℓ andRhSℓ are exterior arcs;M : isolated segment

R[i, j] or S[h, ℓ].

3


(B)

(A)

Fig. 7. The decomposition treesTJ1,15;1,8for the joint structureJ1,15;1,8

according to the grammar inrip2.0(A) andrip1.0 (B), respectively.

According to the decomposition rule, a given joint structure decomposedinto interior arcs and hybrids, see Figure 7(A). The details of the decom-position procedures are collected in SM, Section 2, where weshow that foreach joint structureJ1,N;1,M we indeed obtain a unique decomposition-tree (parse-tree), denoted byTJ1,N ;1,M

. More precisely,TJ1,N ;1,Mhas

root J1,N;1,M and all other vertices correspond to a specific substructureof J1,N;1,M obtained by the successive application of the decompositionsteps of Fig. 6 and the loop decomposition of the secondary structures. Thedecomposition trees of a concrete example generated according to rip2.0andrip1.0 is shown in Fig. 7(A) and(B), respectively.

Let us now have a closer look at the energy evaluation ofJi,j;h,ℓ.Each decomposition step in Fig. 6 results in substructures whose energieswe assume to contribute additively and generalized loops that need to beevaluated directly. There are the following two scenarios:I. Interior Arc removal. The first type of decomposition is focus on decom-posing ts which is similar as the approach deduced by Huanget al. (2009).Most of the decomposition operations in Procedure (b) displayed in Fig. 6can be viewed as the “removal” of an arc (corresponding to theclosing pairof a loop in secondary structure folding) followed by a decomposition. Both:the loop-type and the subsequent possible decomposition steps depend onthe newly exposed structural elements. W.l.o.g., we may assume that weopen an interior base pairRiRj .

For instance, a rtsJRTp,q,r,s (denoted by “D” in Fig. 6) we need to

determine the type of the exposed pairs of bothR[p, q] and S[r, s].Hence each such structure will be indexed by two types lies in{E,M,K,F}. Analogously, there are in total four types of a hybridJHy

i,j;h,ℓ,

i.e. {JHy,EE

i,j;h,ℓ, JHy,EK

i,j;h,ℓ, JHy,KE

i,j;h,ℓ, JHy,KK

i,j;h,ℓ}.

F

K

K

B

KK

F K

K

KK

A

F K

KK

FK

K

K

(I) (II)

K

K K

K

A

K

K

A

K

K

B

Fig. 8. (I ) Decomposition ofJRT,MK

i,j;h,ℓ and (II ) decomposition ofJDT,KKBi,j;h,ℓ .

II. Block decomposition.The second type of decomposition is the splittingof joint structures into “blocks”. There are two major differences in contrastto the method used in Huanget al. (2009). First, we introduce the hybriditself as a new block item in the grammar and furthermore decompose ahybrid via simultaneously removing a single exterior arc. Second, we splitthe whole interaction structure into blocks via the alternating decompositionsof a rts and a dts as showed in the Procedure (a) of Fig. 6.

In order to make sure the maximality of a hybrid, the rts’sJRT,KKi,j;h,ℓ ,

JRT,KEi,j;h,ℓ , JRT,EK

i,j;h,ℓ andJRT,EEi,j;h,ℓ may appear in two ways, depending on

whether or not there exists an exterior arcRi1Sj1 such thatR[i, i1 −1] andS[j, j1−1] are isolated segments. If there exists, we say rts is of type (B) or(A), otherwise. Similarly, a dts,JRT,KK

i,j;h,ℓ , JRT,KEi,j;h,ℓ , JRT,EK

i,j;h,ℓ orJRT,EEi,j;h,ℓ

is of type (B) or (A) depending on whetherRiSh is an exterior arc. Forinstance, Fig. 8 (I ) displays the decomposition ofJDT,KKB

i,j;h,ℓ into hybrid andrts with type (A) and furthermore Fig. 8 (II ) displays the decomposition ofJRT,KKAi,j;h,ℓ .

SupposeJDTi,j;r,ℓ is a dts contained in a kissing loop, that is we have either

EeR[i,j]

6= ∅ or EeS[h,ℓ]

6= ∅. W.l.o.g., we may assumeEeR[i,j]

6= ∅.Then at least one of the two “blocks” contains the exterior arc belonging toEe

R[i,j]labeled byK andF, otherwise, see Fig. 8 (I ).

2.3 Examples for partition function recursionsThe computation of the partition function proceeds “from the inside to theoutside”, see equs. (2.2). The recursions are initialized with the energiesof individual external base pairs and empty secondary structures on subse-quences of length up to four. In order to differentiate multi- and kissing-loopcontributions, we introduce the partition functionsQm

i,j andQki,j . Here,

Qmi,j denotes the partition function of secondary structures onR[i, j] or

S[i, j] having at least one arc contained in a multi-loop. Similarly, Qki,j

denotes the partition function of secondary structures onR[i, j] or S[i, j]in which at least one arc is contained in a kissing loop. LetJ

ξ,Y1Y2Y3

i,j;h,ℓ bethe set of substructuresJi,j;h,ℓ ⊂ J1,N;1,M such thatJi,j;h,ℓ appears inTJ1,N ;1,M

as an interaction structure of typeξ ∈ {DT,RT,▽,△,�, ◦}with loop-subtypesY1, Y2 ∈ {M,K, F} on the sub-intervalsR[i, j] andS[h, ℓ], Y3 ∈ {A,B}. Let Qξ,Y1Y2Y3

i,j;h,ℓ denote the partition function of

Jξ,Y1Y2Y3

i,j;h,ℓ .

For instance, the recursion forQDT,KKBi,j;h,ℓ displayed in Figure 8 (I ) is

equivalent to:

QRT,MK

i,j;h,ℓ =X

i1,h1

QHy,KKi,i1;h,h1

QRT,KKAi1+1,j;h1+1,ℓ +QHy,KK

i,i1;h,h1QRT,KF

i1+1,j;h1+1,ℓ

+QHy,KKi,i1;h,h1

QRT,FFi1+1,j;h1+1,ℓ +QHy,KK

i,i1;h,h1QRT,FK

i1+1,j;h1+1,ℓ +QHyKK

i,j;h,ℓ.

(2.1)

In which, the recursions forJHy,EEi,j;h,ℓ, J

Hy,EKi,j;h,ℓ, J

Hy,KEi,j;h,ℓ, andJHy,KK

i,j;h,ℓ read:

QHy,EEi,j;h,ℓ =

X

i1,h1

QHy,EEi,i1;h,h1

e−(σ0+σGInt

i1,h1,j,ℓ);

QHy,EK

i,j;h,ℓ =X

i1,h1

QHy,EK

i,i1;h,h1e−(σ0+σGInt

i1,h1,j,ℓ+(ℓ−h1−1)β3);

QHy,KE

i,j;h,ℓ =X

i1,h1

QHy,KE


i1,h1,j,ℓ+(j−i1−1)β3);

QHy,KK

i,j;h,ℓ =X

i1,h1

QHy,KK


i1,h1,j,ℓ+(j+ℓ−i1−h1−2)β3).

(2.2)

3 BACKTRACING

3.1 target predictionGiven two RNA sequences, our sample space is the ensemble of all the pos-sible joint interaction structures. LetQI denote the partition function whichsums over all the possible joint structures. The probability measure of a givenjoint structureJ1,N;1,M is given by

PJ1,N ;1,M=

QJ1,N ;1,M

QI. (3.1)

4


In contrast to the computation of the partition function “from the inside tothe outside”, the computation of the substructure-probabilities are obtained“from the outside to the inside” via total probability formula (TPF). Thatis, the longest-range substructures are computed first. This is analogous toMcCaskill’s algorithm for secondary structures (McCaskill, 1990).

SetJ = J1,N;1,M , T = TJ1,N ;1,Mand letΛJi,j;h,ℓ

= {J |Ji,j;h,ℓ ∈

T} denote the set of all joint structuresJ such thatJi,j;h,ℓ is a vertex in thedecomposition treeT . Then

PJi,j;h,ℓ=

X

J∈Λi,j;h,ℓ

PJ . (3.2)

By virtue of TPF, setθs be the possible parent-structure ofJi,j;hℓand PJi,j;h,ℓ|θi

be the conditional probability, we havePJi,j;h,ℓ=

P

s PJi,j;h,ℓ|θsPθs .

Let Pξ,Y1Y2Y3

i,j;h,ℓ be the probability ofJξ,Y1Y2Y3

i,j;h,ℓ . For instance,PRT,MKA

i,j;h,ℓis the sum over all the probabilities of substructuresJi,j;h,ℓ ∈ TJ1,N ;1,M

such thatJi,j;h,ℓ ∈ JRT,MKA

i,j;h,ℓ , i.e. a rts of typeA andR[i, j], S[h, ℓ] arerespectively enclosed by a multi-loop and kissing loop. Given a component,Jξ,Y1Y2Y3

i,j;h,ℓ (showed in Figure 6), we say another componentJξ,Y1Y2Y3

i,j;h,ℓ

is its parent-componentif and only if as a substructure,Jξ,Y1Y2Y3

i,j;h,ℓ could

be a parent structure ofJξ,Y1Y2Y3

i,j;h,ℓ in the decomposition tree. Accordin-

gly, we sayJξ,Y1Y2Y3

i,j;h,ℓ is thechild-componentof Jξ,Y1Y2Y3

i,j;h,ℓ . For instance,

JRT,KKB

i1,j1;h1,ℓ1is one of the parent-component ofJ

Hy,KKB

i,j;h,ℓ , see Figure. 8 (I ).

SetΘs be one of the possible parent-component ofJξ,Y1Y2Y3

i,j;h,ℓ . Accordingly,we have

Pξ,Y1Y2Y3

i,j;h,ℓ =X

s

Pξ,Y1Y2Y3

i,j;h,ℓ|ΘsPΘs

, (3.3)

where by definition

Pξ,Y1Y2Y3

i,j;h,ℓ =X

Ji,j;h,ℓ∈Jξ,Y1Y2Y3i,j;h,ℓ

PJi,j;h,ℓ. (3.4)

Furthermore, in the programme, we calculatePξ,Y1Y2Y3

i,j;h,ℓ|ΘsPΘs

for all s

during the decomposition ofΘs. I.e. given PΘs, we havePΘs

=P

i Pǫi|ΘsPΘs

, wherePǫi|Θsdenotes the conditional probability of the

event that givenΘs is the parent-component,ǫi is its child-component. Inparticular, we havePξ,Y1Y2Y3

i,j;h,ℓ|Θs= Pǫm|Θs

for somem.

Since the four subclasses ofJHy

i,j;h,ℓ, i.e.JHy,EEi,j;h,ℓ, J

Hy,EKi,j;h,ℓ, J

Hy,KEi,j;h,ℓ, and

JHy,KKi,j;h,ℓ are independent, we obtain

PHy

i,j;h,ℓ = PHy,EEi,j;h,ℓ + P

Hy,EKi,j;h,ℓ + P

Hy,KEi,j;h,ℓ + P

Hy,KKi,j;h,ℓ. (3.5)

Given a hybridJHy

i,j;h,ℓ, recall the definition target sites areR[i, j] andS[h, ℓ]. The probability of a target siteR[i, j] is defined by

PtarR[i,j] =

X

h,ℓ

PHy

i,j;h,ℓ. (3.6)

Analogously, we definePtarR[i,j]

. We predict the optimal interaction regionwith maximal probability, i.e.

Popt = maxi,jP

tarR[i,j]. (3.7)

3.2 Statistically generating interaction structureIn this section, we generalize the idea of Ding and Lawrence (2003) in orderto draw a representative sample from the Boltzmann equilibrium distribu-tion of RNA interaction structures. The section is divided into two parts. Atfirst we illustrate the correspondence between the decomposition grammar,i.e. the recursions for partition functions and the sampling probabilities formutually exclusive cases and secondly, we describe the sampling algorithm.

The calculation of the sampling probabilities is based on the recurrencesof the partition functions since for mutually exclusive andexhaustive cases,

113,128: 66.9% 87,89: 25.9 % 53,55: 25.6 % 27,27: 23.6 %29,29: 22.9% 39,40: 21.1 % 27,28: 20.9 % 67,69: 16.6 %

115,128: 16.6 % 36,41: 15.0 % 36,40: 13.0% 26,28: 12.3%67,70: 10.9 % 55,56: 10.3 %

Table 1. The target sitesR[i, j] of ompA (interacts withMicA) whoseprobabilities larger than10−1 .

52,60: 83.0% 15,17: 54.6 % 38,47: 24.7 % 15,16: 19.0 %72,75: 17.4% 77,78: 16.8 % 45,47: 14.2 % 71,74: 13.7 %73,75: 12.3 % 77,81: 11.1 % 14,17: 11.0%

Table 2. The target sitesR[i, j] of sodB (interacts withRyhB) whoseprobabilities larger than10−1 .

87,93: 63.8% 39,48: 50.7 % 62,64: 43.6 % 70,72: 39.6 %30,30: 28.4% 70,73: 27.0 % 39,45: 17.0 % 87,92: 13.5 %40,45: 11.9 % 63,64: 11.4 %

Table 3. The target sitesR[i, j] of fhlA (interacts with OxyS) whoseprobabilities larger than10−1 .

the key observation is that sampling probability for a case is equivalent tothe contribution to partition function by the case divided by the partitionfunction. For instance, again we consider the decomposition of JRT,MK

i,j;h,ℓ .

SetP0i1,j1

, P1i1,j1

, P2i1,j1

, P3i1,j1

andP4i1,j1

be the sampling probabilitiesfor all five cases showed in Figure 8 (I ) anticlockwise respectively, then wehave:

P0i1,j1

= QHy,KKi,i1;h,h1

QRT,KKAi1+1,j;h1+1,ℓ/Q

RT,MK

i,j;h,ℓ ,

P1i1,j1

= QHy,KKi,i1;h,h1

QRT,KFi1+1,j;h1+1,ℓ/Q

RT,MK

i,j;h,ℓ ,

P2i1,j1

= QHy,KK

i,i1;h,h1QRT,FF

i1+1,j;h1+1,ℓ/QRT,MK

i,j;h,ℓ ,

P3i1,j1

= QHy,KKi,i1;h,h1

QRT,FKi1+1,j;h1+1,ℓ/Q

RT,MK

i,j;h,ℓ ,

P4i1,j1

= QHyKK

i,j;h,ℓ/QRT,MK

i,j;h,ℓ .

Since the probabilities of all mutually exclusive and exhaustive cases sum upto 1, we have

P

i1,j1P0i1,j1

+ P1i1,j1

+ P2i1,j1

+ P3i1,j1

+ P4i1,j1

= 1,which coincides with eqn. (2.1).

Next we give a description of the sampling algorithm, as a generalizationof Ding and Lawrence (2003), we still take two stacksA andB. StackAstores sub-joint structures and their typesξ in the form of{(i, j;h, ℓ; ξ)},such as(i, j;h, ℓ;RTMK) represents a sub-joint structureJRT,MK

i,j;h,ℓ . StackB collects interior/exterior arcs and unpaired bases that will define a sampledinteraction structure once the sampling process finishes. At the begin-ning, (1, N ; 1,M, arbitrary) is the only element in stackA. A sampledinteraction structure is drawn recursively as follows: at first, start with(1, N ; 1,M ; arbitrary), sample a pair of separated secondary structuresor a rts(i, N ; j,M ;RTEE) according to their sampling probabilities. In theformer case,(1, N ; sec) and(1,M ; sec) are stored in stackA. Otherwise,(1, i− 1; sec), (1, j − 1; sec) and(i, N ; j,M ;RTEE) are stored in stackA. Secondly, given a new element in stackA, denoted by{(i, j;h, ℓ; ξ)},we draw a particular case from all the mutual exclusive and exhaustive casesaccording to the sampling probabilities and store the corresponding sub-jointstructures into stackA, and all the interior arc, exterior arc or unpaired basessampled in the process will be stored in stackB. I.e. after the completion ofsampling for a “bigger” joint structure from stackA and storage of “smal-ler” sub-joint structures derived in the former process in stackA, also thestorage of the sampled arcs and unpaired bases of stackB, the element in

5


the bottom of stackA is chosen to do the subsequent sampling. The whoprocess terminates when stackA is empty and at the same time, a sampledinteraction structure formed in stackB.

4 RESULTS AND CONCLUSIONSThe complete set of recursions comprises for tsQ

△,▽,�i,j;r,s , 15 4D-

arrays respectively, for right-tight structuresQRTi,j;r,s, 20 4D-arrays,

for dts QDTi,j;r,s and 20 4D-arrays. In addition, we need the usual

matrices for the secondary structuresR andS, and the above men-tioned matrices for kissing loops. The full set of recursions iscompiled in the SM, Section 3.

( )A

AUACGCACAAUAAGGCUAUUGUACGUAUGCAAAUUAAUAAUAAAGGAGAGUAGCAAUGUCAUUCGAAUUACCUGCACUACCAUAUGC

UUUUCGGUCGUGGGCCGACCGAUUCAUUAUGACCUUCGUUACACUCGUUACAGCACGAAAGUCCAAGAGGCGCUCCCAGAAGGACUAGCG

5'-

3 -'

-5'

-3 '

( )C

( )B

A U A C G C A C A A U A A G G C U A U U G U A C G U A U G C A A A U U A A U A A U A A A G G A G A G U A G C A A U G U C A U U C G A A U U A C C U G C A C U A C C A U A U G C G C G A U C A G G A A G A C C C U C G C G G A G A A C C U G A A A G C A C G A C A U U G C U C A C A U U G C U U C C A G U A U U A C U U A G C C A G C C G G G U G C U G G C U U U U

AU

AC

GC

AC

AA

UA

AG

GC

UA

UU

GU

AC

GU

AU

GC

AA

AU

UA

AU

AA

UA

AA

GG

AG

AG

UA

GC

AA

UG

UC

AU

UC

GA

AU

UA

CC

UG

CA

CU

AC

CA

UA

UG

CG

CG

AU

CA

GG

AA

GA

CC

CU

CG

CG

GA

GA

AC

CU

GA

AA

GC

AC

GA

CA

UU

GC

UC

AC

AU

UG

CU

UC

CA

GU

AU

UA

CU

UA

GC

CA

GC

CG

GG

UG

CU

GG

CU

UU

U

CUUA

CUGUAACGA

UGA

GCAC

GACAUUGCU

CAC

5'

5'3'

3'

95.60%

98.85%

98.90%

98.68%

98.59%

98.88%

99.04%

98.80%

96.70%

sodB

RhyB

RhyBsodB

Fig. 9. (A) The natural structure of sodB-RhyB(Geissmann and Touati, 2004). (B) The hybrid-probability matrixgenerated viarip2.0. This matrix represents all potential contactregions of thesodBstructure as squares, whose area is proportionalto their respective probability. (C) “Zoom” into the most likelyinteraction region as predicted byrip2.0. All base pairs of thehybrid are labeled by their probabilities.

ACKNOWLEDGEMENTSWe thank Bill Chen and Sven Findeiß for comments on the manuscript. Thiswork was supported by the 973 Project of the Ministry of Science and Tech-nology, the PCSIRT Project of the Ministry of Education, andthe NationalScience Foundation of China to CMR and his lab, grant No. STA 850/7-1of the Deutsche Forschungsgemeinschaft under the auspicesof SPP-1258“Small Regulatory RNAs in Prokaryotes”, as well as the European Com-munity FP-6 project SYNLET (Contract Number 043312) to PFS and hislab.

REFERENCESAkutsu, T. (2000) Dynamic programming algorithms for RNA secondary structure

prediction with pseudoknots.Disc. Appl. Math., 104, 45–62.Alkan, C., Karakoc, E., Nadeau, J., Sahinalp, S. and Zhang, K. (2006) RNA-RNA

interaction prediction and antisense RNA target search.J. Comput. Biol., 13, 267–282.

( )A

AUACGCACAAUAAGGCUAUUGUACGUAUGCAAAUUAAUAAUAAAGGAGAGUAGCAAUGUCAUUCGAAUUACCUGCACUACCAUAUGC

UUUUCGGUCGUGGGCCGACCGAUUCAUUAUGACCUUCGUUACACUCGUUACAGCACGAAAGUCCAAGAGGCGCUCCCAGAAGGACUAGCG

5'-

3 -'

-5'

-3 '

( )C

( )B

A U A C G C A C A A U A A G G C U A U U G U A C G U A U G C A A A U U A A U A A U A A A G G A G A G U A G C A A U G U C A U U C G A A U U A C C U G C A C U A C C A U A U G C G C G A U C A G G A A G A C C C U C G C G G A G A A C C U G A A A G C A C G A C A U U G C U C A C A U U G C U U C C A G U A U U A C U U A G C C A G C C G G G U G C U G G C U U U U

AU

AC

GC

AC

AA

UA

AG

GC

UA

UU

GU

AC

GU

AU

GC

AA

AU

UA

AU

AA

UA

AA

GG

AG

AG

UA

GC

AA

UG

UC

AU

UC

GA

AU

UA

CC

UG

CA

CU

AC

CA

UA

UG

CG

CG

AU

CA

GG

AA

GA

CC

CU

CG

CG

GA

GA

AC

CU

GA

AA

GC

AC

GA

CA

UU

GC

UC

AC

AU

UG

CU

UC

CA

GU

AU

UA

CU

UA

GC

CA

GC

CG

GG

UG

CU

GG

CU

UU

U

CUUA

CUGUAACGA

UGA

GCAC

GACAUUGCU

CAC

5'

5'3'

3'

95.60%

98.85%

98.90%

98.68%

98.59%

98.88%

99.04%

98.80%

96.70%

sodB

RhyB

RhyBsodB

Fig. 10. (A) The natural structure of sodB-RhyB(Geissmann and Touati, 2004). (B) The hybrid-probability matrixgenerated viarip2.0. This matrix represents all potential contactregions of thesodBstructure as squares, whose area is proportionalto their respective probability. (C) “Zoom” into the most likelyinteraction region as predicted byrip2.0. All base pairs of thehybrid are labeled by their probabilities.

Ameres, S. and Schroeder, R. (2007) Molecular basis for target RNA recognition andcleavage by human RISC.Cell, 130, 101–112.

Andronescu, M., Zhang, Z. and Condon, A. (2005) Secondary structure prediction ofinteracting RNA molecules.J. Mol. Biol., 345, 1101–1112.

Bachellerie, J., Cavaille, J. and Huttenhofer, A. (2002)The expanding snoRNA world.Biochimie, 84, 775–790.

Banerjee, D. and Slack, F. (2002) Control of developmental timing by small temporalRNAs: a paradigm for RNA-mediated regulation of gene expression. Bioessays,24, 119–129.

Benne, R. (1992) RNA editing in trypanosomes. the use of guide RNAs. Mol.Biol.Rep., 16, 217–227.

Bernhart, S., Tafer, H., Muckstein, U., Flamm, C., Stadler, P. and Hofacker, I. (2006)Partition function and base pairing probabilities of RNA heterodimers.AlgorithmsMol. Biol., 1, 3–3.

Busch, A., Richter, A. and Backofen, R. (2008) IntaRNA: efficient prediction ofbacterial sRNA targets incorporating target site accessibility and seed regions.Bioinformatics, 24, 2849–2856.

Chitsaz, H., Salari, R., Sahinalp, S. and Backofen, R. (2009) A partition functionalgorithm for interacting nucleic acid strands. In press.

Ding, Y. and Lawrence, C. (2003) A statistical sampling algorithm for rna secondarystructure prediction.Nucleic Acid Res., 31, 7280–7301.

Geissmann, T. and Touati, D. (2004) Hfq, a new chaperoning role: binding to messengerRNA determines access for small RNA regulator.EMBO J., 23, 396–405.

Giegerich, R. and Meyer, C. (2002)Lecture Notes In Computer Science, volume 2422,chapter Algebraic Dynamic Programming, pp. 349–364. Springer-Verlag.

Hofacker, I., Fontana, W., Stadler, P., Bonhoeffer, L., Tacker, M. and Schuster, P. (1994)Fast folding and comparison of RNA secondary structures.Monatsh. Chem., 125,167–188.

Huang, F., Qin, J., Stadler, P. and Reidys, C. (2009) Partition function and base pairingprobabilities for RNA-RNA interaction prediction.Bioinformatics.

Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. and Segal,E. (2007) The role of siteaccessibility in microRNA target recognition.Nat. Genet., 39, 1278–1284.

Kretschmer-Kazemi Far, R. and Sczakiel, G. (2003) The activity of siRNA in mam-malian cells is related to strutural target accessorbility: a comparson with antisenseoligonucleotides.Nucleic Acid Res., 31, 4417–4424.

Kugel, J. and Goodrich, J. (2007) An RNA transcriptional regulator templates its ownregulatory RNA.Nat. Struct. Mol. Biol., 3, 89–90.

6


McCaskill, J. (1990) The equilibrium partition function and base pair binding probabi-lities for RNA secondary structure.Biopolymers, 29, 1105–1119.

McManus, M. and Sharp, P. (2002) Gene silencing in mammals bysmall interferingRNAs. Nature Reviews, 3, 737–747.

Mneimneh, S. (2007) On the approximation of optimal structures for RNA-RNA interaction. IEEE/ACM Trans. Comp. Biol. Bioinf. In press,doi.ieeecomputersociety.org/10.1109/TCBB.2007.70258.

Muckstein, U., Tafer, H., Bernhard, S., Hernandez-Rosales, M., Vogel, J., Stadler, P.and Hofacker, I. (2008) Translational control by RNA-RNA interaction: Improvedcomputation of RNA-RNA binding thermodynamics. In Elloumi, M., Kung, J.,Linial, M., Murphy, R. F., Schneider, K. and Toma, C. T. (eds.), BioInformaticsResearch and Development — BIRD 2008, volume 13 ofComm. Comp. Inf. Sci., pp.114–127. Springer, Berlin.

Muckstein, U., Tafer, H., Hackermuller, J., Bernhard, S., Stadler, P. and Hofacker, I.(2006) Thermodynamics of RNA-RNA binding.Bioinformatics, 22, 1177–1182.Earlier version in:German Conference on Bioinformatics 2005, Torda andrew andKurtz, Stefan and Rarey, Matthias (eds.), Lecture Notes in InformaticsP-71, pp3-13, Gesellschaft f. Informatik, Bonn 2005.

Narberhaus, F. and Vogel, J. (2007) Sensory and regulatory RNAs in prokaryotes: Anew german research focus.RNA Biol., 4, 160–164.

Pervouchine, D. (2004)IRIS: Intermolecular RNA interaction search.Proc. GenomeInformatics, 15, 92–101.

Qin, J. and Reidys, C. (2008) A framework for RNA tertiary interaction. Submitted.Rehmsmeier, M., Steffen, P., Hochsmann, M. and Giegerich,R. (2004) Fast and

effective prediction of microRNA/target duplexes.Gene, 10, 1507–1517.Ren, J., Rastegari, B., Condon, A. and Hoos, H. (2005) Hotknots: heuristic prediction

of microRNA secondary structures including pseudoknots.RNA, 11, 1494–1504.Rivas, E. and Eddy, S. (1999) A dynamic programming algorithms for RNA structure

prediction including pseudoknots.J. Mol. Biol., 285, 2053–2068.Storz, G. (2002) An expanding universe of noncoding RNAs.Science, 296, 140–144.Tafer, H., Kehr, S., Hertel, J. and Stadler, P. (2009)RNAsnoop: Efficient target

prediction for box H/ACA snoRNAs. Submitted.Tjaden, B., Goodwin, S., Opdyke, J., Guillier, M., Fu, D., Gottesman, S. and Storz,

G. (2006) Target prediction for small, noncoding RNAs in bacteria. Nucleic AcidsRes., 34, 2791–2802.

Udekwu, K., Darfeuille, F., Vogel, J., Reimegad, J., Holmqvist, E. and Wagner, E.(2005) Hfq-dependent regulation of OmpA synthesis is mediated by an antisenseRNA. Genes Dev., 19, 2355–2366.

7

(A)(A) (B) (C)

3'-

5'- -3'

-5' 3'-

5'- -3'

-5' 3'-

5'- -3'

-5'

BIOINFORMATICS Pages 1–7 - arXivarXiv:0908.0597v1 [math.CO] 5 Aug 2009 BIOINFORMATICS Vol. 00 no....

Documents

Transcript of BIOINFORMATICS Pages 1–7 - arXivarXiv:0908.0597v1 [math.CO] 5 Aug 2009 BIOINFORMATICS Vol. 00 no....