RNA secondary structure prediction

Doug RaifordLesson 7

RNA World Hypothesis RNA world evolved into the

DNA and protein world DNA advantage: greater

chemical stability Protein advantage: more

flexible and efficient enzymes (biomolecules that catalyze)▪ 20 amino acids vs. 4 nucleotides▪ Chemically, more diverse

Remnants remain in ribosomes, nucleases, polymerases, and splicing molecules

Primary: sequence

Secondary: double stranded regions Reverse

complementsTertiary: three-

dimensional structure

>tRNA. Carries amino acid for Isolucine AGGCUUGUAGCUCAGGUGGUUAGAGCGCACCCCUGAUAAGGGUGAGGUCGGUGGUUCAAGUCCACUCAGGCCUACCA

CCA Tail

Acceptor Step

D arm

Anticodon arm

Anticodon

T arm

How find regions of reverse complementation?

What do we have? Sequence A’s like pairing with U’s and

G’s like pairing with C’s Stronger bond (3 hydrogen

bonds) between G’s and C’s Should result in lowest free energy (max

enthalpy)

tRNA Transports amino

acid to the ribosome

CCA Tail

Acceptor Step

D arm

Anticodon arm

Anticodon

T arm

Visualization

Good at finding longer base-pairings (stacked base-pairs)

Need to find the conformation that provides the minimal total free energy

RNA often has many alternate conformations at different temperatures

Stacked base-pairs add stability

Loops/bulges introduce positive free energy and are destabilizing

First nucleotide base-pairs with last

First nucleotide base-pairs with some other (other than last) nucleotide (including none)

Recurse on rest

Recurse on every possible set of two strings

Recurrence relations

jkiSESESErr

SEjkki

jijiji for )()(min

)(),(min)(

,1,

1,1,

As luck would have it… Zuker came up with a

dynamic programming solution

j

G G G A A A U C CG 0G 0G 0A 0A 0A 0U 0C 0C 0

i

G G G A A A U C CG 0G 0 0G 0 0A 0 0A 0 0A 0 0U 0 0C 0 0C 0 0

Start with zeros on diagonal

Populate diagonally

j

i

Will look at last value to illustrate

Match first and last character, recurse on rest

G G G A A A U C CG 0 0 0 0 0 0 -

1-2

-3

G 0 0 0 0 0 0 -1

-2

-3

G 0 0 0 0 0 -1

-2

-2

A 0 0 0 0 -1

-1

-1

A 0 0 0 -1

-1

-1

A 0 0 -1

-1

-1

U 0 0 0 0C 0 0 0C 0 0

j

i

)2(1

)(),( 1,1

jiji SErr

α A C U GA 0 0 -

10

C 0 0 0 -1U -

10 0 0

G 0 -1

0 0

G G G A A A U C CG 0 0 0 0 0 0 -

1-2

G 0 0 0 0 0 -1

-2

-3

G 0 0 0 0 -1

-2

-2

A 0 0 0 -1

-1

-1

A 0 0 -1

-1

-1

A 0 -1

-1

-1

U 0 0 0C 0 0C 0

Min of all pairs of substrings

j

i-3

GGGAAAUCC

GGGAAAUCCG-G-G-A C-C-U

A

A

G-G AC-C-U

A

A

G

n2 plus 2n for each visited cellSo O(n3)

Populate matrix plus traverse

row/column for each cell

Any prediction method must account for these

Now O(n4) Interior loops

most expensive Can exploit the

fact that along diagonals, loops have same size

Can calculate once

Limits search space

Back to O(n3)

kkkk

kk

LSEkkrr

LSEkrrLSEkrr

LSErrLijrr

LE

LEjkiSESE

SESE

SE

jikjkijik

jikjijik

jijkijik

jijiji

jiji

ji

ji

jkki

ji

ji

ji

size of loopinterior an ofenergy free ingdestabiliz)( size of bulge a ofenergy free ingdestabiliz)(

pairs baseadjacent ofenergy free gstabilizin size with loophairpin a ofenergy free ingdestabiliz )(

loopinterior an is if,)()(),(min

jon bulge a is if,)()(),(minion bulge a is if,)()(),(min

region helical a is if),(),(loophairpin a is if),1(),(

)(

)(for )()(min

)()(

min)(

,1,1211

,1,11

,1,11

,1,1

,

,

,

,1,

1,

,1

,

21

Zuker’s site

1 gccgaggtgg tggaattggt agacacgcta ccttgaggtg gtagtgccca atagggctta61 cgggttcaag tcccgtcctc ggtacca

tRNA for Leucine in E. coli, a prototypical organism

Codon: uuaAnti-codon: aat

CCA Tail

Acceptor Step

D arm

Anticodon arm

Anticodon

T arm

http://www.bioinfo.rpi.edu/zukerm/cgi-bin/rna-index.cgi

Just like proteins: conformation

What if a T-A base-pair mutate to an G-C Still same function

What would this do to a search or sequence alignment?

GCAGGACCAUAUA|||||||||||||CGUCCUGGUAUAU

GCAGGACCAGAUA|||||||||||||CGUCCUGGUCUAU

Phenomenon known as covariance (not to be confused with statistical covariance)

GCAGGACCAUAUA|||||||||||||CGUCCUGGUAUAU

GCAGGACCAGAUA|||||||||||||CGUCCUGGUCUAU

How might we locate covariant pairs?

MSA then compare all pair-wise combinations of columns

High degree of agreement in two columns (G’s match with C’s, A’s match with U’s) an indication of base-pairing

χ2 testCompare to expected number of parings given sequence composition

Pairing depicted with nested parentheses

AAGACUUCGGUCUGGCGACAUUC ((( ))) (( ( )))

Mountain plots

A mountain plot represents a secondary structure in a plot of height versus position, where the height m(k) is given by the number of base pairs enclosing the base at position k. I.e. loops correspond to plateaus (hairpin loops are peaks), helices to slopes.

Circle plot

Data structure capable of capturing secondary structure

Ordered Binary Tree

ProductionsS → aSu | uSa | cSg | gScS → aS | cS | gS | uSS → Sa | Sc | Sg | SuS → SSS →⍉

DerivationS → aSS → aScS → aSccS → acSgccS → acgScgccS → acggSccgccS → acgggScccgccS → acggggSccccgccS → acgggguSccccgccS → acgggguuSccccgccS → acgggguucSccccgccS → acgggguucgSccccgccS → acgggguucgaSccccgccS → acgggguucgaaSccccgccS → acgggguucgaauSccccgccS → acgggguucgaauccccgcc

Parse treea←S | S→c | S→c |c←S→g |g←S→c |g←S→c |g←S→c |g←S→c S→u | |u←S S→a \ / u←S S→a \ / c←S—S→g

Conformation of RNA dictates function Determining secondary structure can

help determine tertiary structure Dynamic programming approach to

identifying minimum energy conformations Zuker MFOLD

View using dot plots, nested parens, mountain or circular plots

Covariance: base-pairs mutate but still form pairs, exploit to find pairings

RNA secondary structure prediction

Documents

Transcript of RNA secondary structure prediction