RNA secondary structure prediction
description
Transcript of RNA secondary structure prediction
Doug RaifordLesson 7
RNA World Hypothesis RNA world evolved into the
DNA and protein world DNA advantage: greater
chemical stability Protein advantage: more
flexible and efficient enzymes (biomolecules that catalyze)▪ 20 amino acids vs. 4 nucleotides▪ Chemically, more diverse
Remnants remain in ribosomes, nucleases, polymerases, and splicing molecules
Primary: sequence
Secondary: double stranded regions Reverse
complementsTertiary: three-
dimensional structure
>tRNA. Carries amino acid for Isolucine AGGCUUGUAGCUCAGGUGGUUAGAGCGCACCCCUGAUAAGGGUGAGGUCGGUGGUUCAAGUCCACUCAGGCCUACCA
CCA Tail
Acceptor Step
D arm
Anticodon arm
Anticodon
T arm
How find regions of reverse complementation?
What do we have? Sequence A’s like pairing with U’s and
G’s like pairing with C’s Stronger bond (3 hydrogen
bonds) between G’s and C’s Should result in lowest free energy (max
enthalpy)
tRNA Transports amino
acid to the ribosome
CCA Tail
Acceptor Step
D arm
Anticodon arm
Anticodon
T arm
Visualization
Good at finding longer base-pairings (stacked base-pairs)
Need to find the conformation that provides the minimal total free energy
RNA often has many alternate conformations at different temperatures
Stacked base-pairs add stability
Loops/bulges introduce positive free energy and are destabilizing
First nucleotide base-pairs with last
First nucleotide base-pairs with some other (other than last) nucleotide (including none)
Recurse on rest
Recurse on every possible set of two strings
Recurrence relations
jkiSESESErr
SEjkki
jijiji for )()(min
)(),(min)(
,1,
1,1,
As luck would have it… Zuker came up with a
dynamic programming solution
j
G G G A A A U C CG 0G 0G 0A 0A 0A 0U 0C 0C 0
i
G G G A A A U C CG 0G 0 0G 0 0A 0 0A 0 0A 0 0U 0 0C 0 0C 0 0
Start with zeros on diagonal
Populate diagonally
j
i
Will look at last value to illustrate
Match first and last character, recurse on rest
G G G A A A U C CG 0 0 0 0 0 0 -
1-2
-3
G 0 0 0 0 0 0 -1
-2
-3
G 0 0 0 0 0 -1
-2
-2
A 0 0 0 0 -1
-1
-1
A 0 0 0 -1
-1
-1
A 0 0 -1
-1
-1
U 0 0 0 0C 0 0 0C 0 0
j
i
)2(1
)(),( 1,1
jiji SErr
α A C U GA 0 0 -
10
C 0 0 0 -1U -
10 0 0
G 0 -1
0 0
G G G A A A U C CG 0 0 0 0 0 0 -
1-2
G 0 0 0 0 0 -1
-2
-3
G 0 0 0 0 -1
-2
-2
A 0 0 0 -1
-1
-1
A 0 0 -1
-1
-1
A 0 -1
-1
-1
U 0 0 0C 0 0C 0
Min of all pairs of substrings
j
i-3
GGGAAAUCC
GGGAAAUCCG-G-G-A C-C-U
A
A
G-G AC-C-U
A
A
G
n2 plus 2n for each visited cellSo O(n3)
Populate matrix plus traverse
row/column for each cell
Any prediction method must account for these
Now O(n4) Interior loops
most expensive Can exploit the
fact that along diagonals, loops have same size
Can calculate once
Limits search space
Back to O(n3)
kkkk
kk
LSEkkrr
LSEkrrLSEkrr
LSErrLijrr
LE
LEjkiSESE
SESE
SE
jikjkijik
jikjijik
jijkijik
jijiji
jiji
ji
ji
jkki
ji
ji
ji
size of loopinterior an ofenergy free ingdestabiliz)( size of bulge a ofenergy free ingdestabiliz)(
pairs baseadjacent ofenergy free gstabilizin size with loophairpin a ofenergy free ingdestabiliz )(
loopinterior an is if,)()(),(min
jon bulge a is if,)()(),(minion bulge a is if,)()(),(min
region helical a is if),(),(loophairpin a is if),1(),(
)(
)(for )()(min
)()(
min)(
,1,1211
,1,11
,1,11
,1,1
,
,
,
,1,
1,
,1
,
21
Zuker’s site
1 gccgaggtgg tggaattggt agacacgcta ccttgaggtg gtagtgccca atagggctta61 cgggttcaag tcccgtcctc ggtacca
tRNA for Leucine in E. coli, a prototypical organism
Codon: uuaAnti-codon: aat
CCA Tail
Acceptor Step
D arm
Anticodon arm
Anticodon
T arm
Just like proteins: conformation
What if a T-A base-pair mutate to an G-C Still same function
What would this do to a search or sequence alignment?
GCAGGACCAUAUA|||||||||||||CGUCCUGGUAUAU
GCAGGACCAGAUA|||||||||||||CGUCCUGGUCUAU
Phenomenon known as covariance (not to be confused with statistical covariance)
GCAGGACCAUAUA|||||||||||||CGUCCUGGUAUAU
GCAGGACCAGAUA|||||||||||||CGUCCUGGUCUAU
How might we locate covariant pairs?
MSA then compare all pair-wise combinations of columns
High degree of agreement in two columns (G’s match with C’s, A’s match with U’s) an indication of base-pairing
χ2 testCompare to expected number of parings given sequence composition
Pairing depicted with nested parentheses
AAGACUUCGGUCUGGCGACAUUC ((( ))) (( ( )))
Mountain plots
A mountain plot represents a secondary structure in a plot of height versus position, where the height m(k) is given by the number of base pairs enclosing the base at position k. I.e. loops correspond to plateaus (hairpin loops are peaks), helices to slopes.
Circle plot
Data structure capable of capturing secondary structure
Ordered Binary Tree
ProductionsS → aSu | uSa | cSg | gScS → aS | cS | gS | uSS → Sa | Sc | Sg | SuS → SSS →⍉
DerivationS → aSS → aScS → aSccS → acSgccS → acgScgccS → acggSccgccS → acgggScccgccS → acggggSccccgccS → acgggguSccccgccS → acgggguuSccccgccS → acgggguucSccccgccS → acgggguucgSccccgccS → acgggguucgaSccccgccS → acgggguucgaaSccccgccS → acgggguucgaauSccccgccS → acgggguucgaauccccgcc
Parse treea←S | S→c | S→c |c←S→g |g←S→c |g←S→c |g←S→c |g←S→c S→u | |u←S S→a \ / u←S S→a \ / c←S—S→g
Conformation of RNA dictates function Determining secondary structure can
help determine tertiary structure Dynamic programming approach to
identifying minimum energy conformations Zuker MFOLD
View using dot plots, nested parens, mountain or circular plots
Covariance: base-pairs mutate but still form pairs, exploit to find pairings