Post on 29-Jan-2016
Lecture 7. Topics in RNA Bioinformatics (Identification of RNA Structures)
The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2
Lecture outline1. Sequence-based prediction methods2. RNA footprinting and high-throughput
methods
Last update: 17-Oct-2015
SEQUENCE-BASED PREDICTION METHODS
Part 1
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4
RNA structures• Some RNAs have strong structural features
highly related to their functions
Last update: 17-Oct-2015
tRNA
snoRNAImage sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg; http://lowelab.ucsc.edu/images/CDBox.jpg; http://www.daviddarling.info/images/ribosome_and_RNA.jpg
rRNA
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5
Secondary vs. tertiary structure• Four levels of molecular structures:– Primary: The sequence– Secondary: Local interactions– Tertiary: Global interactions– Quaternary: Inter-molecule interactions
• Both secondary and tertiary RNA structures are meaningful– However, more work has been devoted to
identifying/predicting RNA secondary structures– Also focus of this lecture
Last update: 20-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6
Methods for predicting RNA structures
• Wikipedia contains a comprehensive list: https://en.wikipedia.org/wiki/List_of_RNA_structure_prediction_software
• Main classes:– Models specific to a particular type of RNA– Based on a single sequence• Minimum free energy (MFE)• Partition function
– Based on comparison of multiple sequences
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7
Type-specific models• Example: tRNAscan-SE for finding tRNAs and
predicting tRNA structures• Three main phases:
1. Running tRNAscan and the Pavesi algorithm to find candidate tRNAs
2. Using a covariance model to identify the more confident candidates
3. Trimming the candidates and predicting the detailed secondary structures
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8
Workflow of tRNAscan-SE
Last update: 17-Oct-2015
Image credit: Lowe and Eddy, Nucleic Acids Research 25(5):955-964, (1997)
Step 1
Step 2
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9
1a. tRNAscan• Features used:– Invariant and semi-
invariant bases– Potential base-pairing
structures consistent with the cloverleaf secondary structure• The aminoacyl arm, the D
arm, the anticodon arm and the T--C arm
– Length and position of potential intron sequences
Last update: 17-Oct-2015
Image credit: Fichant and Burks, Journal of Molecular Biology 220(3):659-671, (1991)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10
1a. tRNAscan: sequential tests
Last update: 17-Oct-2015
Image credit: Fichant and Burks, Journal of Molecular Biology 220(3):659-671, (1991)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11
1b. The Pavesi algorithm• Frequency tables based on 231 nuclear tRNA
genes:
Last update: 17-Oct-2015
Image credit: Pavesi et al., Nucleic Acids Research 22(7):1247-1256, (1994)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12
1b. The Pavesi algorithm: workflow
Last update: 17-Oct-2015
Image credit: Pavesi et al., Nucleic Acids Research 22(7):1247-1256, (1994)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13
2. Chomsky hierarchy of languages
Last update: 17-Oct-2015
Image source: http://www.cs.utexas.edu/users/novak/cs343283.html
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14
2. Context-free grammar• Use of context-free grammars in RNA secondary
structure representation: To capture paring relationships. Example:
Last update: 17-Oct-2015
Example credit: Sakakibara et al., CPM 289-306, (1994)
Productions P = {S0S1,S1CS2G,S1AS2U,S2AS3U,S3S4S9,S4US5A,S5CS6G,S6AS7,S7US7,
S7GS8,S8G,S8U,S9AS10U,S10CS10G,S10GS11C,S11AS12U,S12US13,S13C}
One possible derivation:S0
S1
CS2G CAS3UG CAS4S9UG CAUS5AS9UG CAUCS6GAS9UG CAUCAS7GAS9UG CAUCAGS8GAS9UG CAUCAGGGAS9UG CAUCAGGGAAS10UUG CAUCAGGGAAGS11CUUG CAUCAGGGAAGAS12UCUUG CAUCAGGGAAGAUS13UCUUG CAUCAGGGAAGAUCUCUUG
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15
2. Parse tree and RNA structure
Last update: 17-Oct-2015
Figure credit: Sakakibara et al., The 5th Annual Symposium on Combinatorial Pattern Matching 289-306, (1994)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16
2. SCFG and CM• Stochastic context-free grammar (SCFG):
context-free grammar with probabilistic derivation
• Covariance model (CM): model for representing RNA sequence and structure profiles based on SCFG
Last update: 20-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17
2. SCFC and CM
Last update: 17-Oct-2015
Input multiple sequence alignment and consensus structure:
Construction of guide tree from consensus structure:
Image credit: INFERNAL user’s guide
Output CM:
Node Description
MATP Pair
MATL Single strand, left
MATR Single strand, right
BIF Bifurcation
ROOT root
BEGL Begin, left
BEGR Begin, right
END End
White state: consensusGray state: indels
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18
Based on a single sequence• Minimum free energy (MFE): Finding the RNA
structure (pairing of bases) that minimizes the free energy– More pairing– More stable pairing• Strong GC pairing• Stable structures such as stacking pairs
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19
Turner’s free energy model• Considering the total
free energy of an RNA structure is the sum of the free energy of the sub-structures
Last update: 17-Oct-2015
Image credit: http://www.clcbio.com/scienceimages/rna_prediction/RNA_structure_prediction_web.png
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20
Turner’s energy parameters• For hairpin loops:
Last update: 17-Oct-2015
Table credit: Mathews et al., Journal of Molecular Biology 288(5):911-940, (1999)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21
Energy minimization• Dynamic programming (without pseudoknots)
Last update: 17-Oct-2015
Vሺ𝑗ሻ= minቊVሺ𝑗− 1ሻ 𝑗 is unpairedmin1≤𝑖<𝑗ሼVPሺ𝑖,𝑗ሻ+ Vሺ𝑖 − 1ሻሽ 𝑗 pairs with 𝑖 Vpሺ𝑖,𝑗ሻ= min
ە۔
+eSሺ𝑖,𝑗ሻۓ VPሺ𝑖 + 1,𝑗− 1ሻ ሺ𝑖,𝑗ሻ and ሺ𝑖 + 1,𝑗− 1ሻ form stacking pairseH(𝑖,𝑗) ሺ𝑖,𝑗ሻ closes a hairpin loopVBIሺ𝑖,𝑗ሻ ሺ𝑖,𝑗ሻ closes an internal loopVMሺ𝑖,𝑗ሻ ሺ𝑖,𝑗ሻ closes a multi loop
VBIሺ𝑖,𝑗ሻ= min𝑖1,𝑗1:𝑖<𝑖1<𝑗1<𝑗ሼeBIሺ𝑖,𝑗,𝑖1,𝑗1ሻ+ VPሺ𝑖1,𝑗1ሻሽ VMሺ𝑖,𝑗ሻ= min𝑖1,𝑗1,…,𝑖k,𝑗𝑘:𝑖<𝑖1<𝑗1<...<𝑖𝑘<𝑗𝑘<𝑗൝eMሺ𝑖,𝑗,𝑖1,𝑗1,…,𝑖𝑘,𝑗𝑘ሻ+ VPሺ𝑖ℎ,𝑗ℎሻ𝑘
ℎ=1 ൡ
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22
Partition function• Issues of MFE:
– Solution may not be optimal– There can be different structures with similar free energy
• The partition function sums over the relative likelihood of all possible secondary structures:
– S: Possible secondary structures– G(S): Gibb’s free energy change– R: Gas constant– T: Absolute temperature
• Probability of a particular structure s, •
Last update: 17-Oct-2015
Q= e−∆Gሺ𝑆ሻ/RT𝑆
Prሺ𝑠ሻ= e−∆Gሺ𝑠ሻ/RTQ Prሺ𝑖 and 𝑗 form a pairሻ= σ e−∆Gሺ𝑠ሻ/RT𝑠:ሺ𝑖,𝑗ሻ𝜖𝑠Q Mathews, RNA 10(8):1178-1190, (2004)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23
Based on multiple sequences• The conservation of a base and the co-
conservation of a base pair in multiple sequences can help resolve ambiguous cases– In fact, a CM can be trained from a multiple
sequence alignment• Main types:– Joint optimization– Consensus/alignment of individual structures
Last update: 17-Oct-2015
RNA FOOTPRINTING AND HIGH-THROUGHPUT METHODS
Part 2
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25
RNA footprinting• A traditional way to study RNA secondary structures– Preferentially cleave or mark nucleotides with a particular
structural property
Last update: 17-Oct-2015
Image credit: Novikova et al., International Journal of Molecular Sciences 14(12):23672-23684, (2013)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26
Some probes that can be usedProbe Size (Dalton) Structural preference
DMS (dimethyl sulfate) 126 Mark unpaired bases
IM7 222 Mark unpaired bases
RNase V1 15,900 Cleave paired bases
RNase ONE 27,000 Cleave unpaired bases
Nuclease S1 32,000 Cleave unpaired bases
Nuclease P1 36,000 Cleave unpaired bases
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27
High-throughput RNA footprinting• Enzyme-based: After enzymatic treatment,
sequence the resulting fragments to identify the cleavage sites– And thus bases with the structural property
• Chemical-probe-based: Chemical adduct can terminate reverse-transcription. The termination point can be identified by sequencing
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28
Parallel Analysis of RNA Structure (PARS)
Last update: 17-Oct-2015
Image credit: Kertesz et al., Nature 467(7311):103-107, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29
DMS-seq
Last update: 17-Oct-2015
Image credit: Rouskin et al., Nature 505(7485):701-705, (2014)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30
Potential confounding factors• General:– Expression level of transcripts• Need control/comparison
– Sequence bias– Issues in read alignment• Blind tail – Fragments that are too short cannot be
aligned correctly
– Experimental efficiency
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31
Potential confounding factors• Method-specific:– DMS modifies mainly only adenines and cytosines– Increasing read count towards 3’ end in DMS-seq– Natural polymerase drop-off in chemical-probe-
based methods– Preference due to secondary vs. tertiary structure
(e.g., steric hindrance in enzyme-based methods)
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32
Data normalization• Normalization strategies:– Transcript level: Comparison using• Standard RNA-seq data• Control experiment (with some steps not carried out)• Data from two different enzymes (PARS)
– Increasing read count: Smoothing by local window– Polymerase drop-off: Modeling it explicitly
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33
Example: Poisson linear model• Modeling local sequence bias
– i: measured read count of nucleotide i– : actual expression level of transcript– bik: the k-th nucleotide within the length-K local
sub-sequence around nucleotide i– kh: bias coefficient
Last update: 17-Oct-2015
log𝜔𝑖 = 𝛼+ 𝛽𝑘ℎ𝐼ሺ𝑏𝑖𝑘 = ℎሻ+ 𝜀ℎ∈ሼ𝐴,𝐶,𝐺ሽ𝐾
𝑘=1
Li et al., Genome Biology 11(5):R50, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34
In vivo vs. in vitro data
Last update: 17-Oct-2015
Image credit: Rouskin et al., Nature 505(7485):701-705, (2014)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35
Using structure-probing data• The high-throughput RNA footprinting
(structure-probing) data only tell whether a base is paired or not, but not with which other base
• The data can be used to help RNA secondary structure prediction
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36
Using structure-probing data• Several ways to use the data:– Free energy penalty– Pseudo free energy terms– Discrepancy minimization– Identifying closest structure centroid
Last update: 17-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37
Example: StructureFold• Overall workflow:
Last update: 17-Oct-2015
Image credit: Tang et al., Bioinformatics 31(16):2668-2675, (2015)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38
Example: RNApbfold• Minimizing discrepancies between predicted
( ) and measured ( ) probabilities of bases being unpaired:
Last update: 17-Oct-2015
𝐹ሺ𝜖Ԧሻ= 𝜖𝜇2𝜏𝜇2𝜇 + 1𝜎𝑖2ሺ𝑝𝑖ሺ𝜖Ԧሻ− 𝑞𝑖ሻ2𝑛
𝑖=1
𝑝𝑖ሺ𝜖Ԧሻ 𝑞𝑖
Variance terms
indicating uncertainty
Washietl et al., Nucleic Acids Research 40(10):4261-4272, (2012)
Perturbation of the energy
parameter values
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39
Example: RNApbfold• Sample results:
Last update: 17-Oct-2015
Image credit: Washietl et al., Nucleic Acids Research 40(10):4261-4272, (2012)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40
Summary• Sequence-based RNA secondary structure
prediction– For specific types of RNA– Single sequence• Minimum free energy (MFE)• Partition function
– Multiple sequences• High-throughput RNA structure probing– Modification of objective function– Selection of appropriate structures
Last update: 17-Oct-2015