Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Post on 21-Dec-2015

220 views 0 download

Tags:

Transcript of Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Noncoding RNA Genes Pt. 2SCFGs

CS374

Vincent Dorie

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Motivation

Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything

Location

rRNA, snRNA Exons? Introns Viral vectors

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Function

Function, pt. 2

Overview

“RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003)

“Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

Comparison - Methodology

RSEARCH DART (Stemloc)

Sequence

Comparison, Pt. 2 - Uses

RSEARCH Find parts of a

genome which may be homologous to query sequence

More practical in comparative genomics

DART (Stemloc) Investigate a specific

sequence suspected of being homologous to query sequence

Comparison, Pt. 3 - Complexity

RSEARCH O((M - B)LD + BLD2)

to scan O(M4) to calculate

statistics

DART (Stemloc) Between O(LM) and

O(L3M3)

Background:Context Free Grammars

Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S N P is a set of productions

Context Free Grammars, pt. 2Sample Grammar

N = {S, A, B} T = {a, u, c, g, } P = {

S -> A | B,

A -> aAc | aBc | g,

B -> g

}

Context Free Grammars, pt. 3Parse Trees

Parse: aagccS

A

A

g

ca

ca

S

A

A

g

ca

ca B

Stochastic CFG

Each production associated with a probability

Probabilities for all productions starting from a given nonterminal sum to one

Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3

| B, 0.7

Pairwise (profile) SCFG

Terminals in each production can exist in each of two strings

E.g. W -> xiykVxjyl

RSEARCH: pSCFG Simplified Each secondary

structure specifies (most of) a grammar, creating a “Model Architecture”

Eschews probabilistic interpretation

Problem becomes fitting target to model architecture

Sequence

Node Types vs. Node States

Nodes types are what we want to do given model (e.g. MATP is match pair)

Node state represents what happens when scanning a target sequence

E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

Node States

Set of node states possible for node type

Gap Classes

Gap class per node type/state pair

Transition Scores

Gap class determines transition scores Gap penalties are affine

Emission Scores

Emission scores determined empirically

Parameterizing the ModelEmission Scores

AA AU AC AG UA …AA sAAAA sAAAU sAAAC sAAAG sAAAU …AU - sAUAU sAUAC sAUAG sAUUA …AC - - sACAC sACAG sACUA …AG - - - sAGAG sAGUA …UA - - - - sUAUA …… … … … … … …

Substitution Matrices

sij = log2f ijgig j

A U C GA sAA sAU sAC sAG

U - sUU sUC sUG

C - - sCC sCG

G - - - sGG

sijkl = log2f ijkl

gig jgkgl

Scores are observed / random

RIBOSUM Matrices

Start with MSA Whose MSA?

RIBOSUM[X, Y] Sequences X% identical are reweighted to

sum to 1 Only sequences Y% identical are counted in

making matrices

Model Parameters

Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty

Solution

Guess and check “We might have been able to derive a more

robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

Digression: Biostatistics

Confidence intervals Expectation values

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Gumbel Distribution

Parameterized by and K E = KNe-x, P = 1 - e-E

Gumbel Distriubtion, pt. 2

K and depend on G+C content of target database

For database with heterogeneous G+C content, compute K and for G+C bins

Putting it All Together

Run against database substrings of length two times the query

Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database,

alignment, E-value, P-value Statistics need to be calculated for every

query and target database

Time

For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics

For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics

Parallelized to 33 minutes and 7.4 hours respectively

Shifting GearsFold Envelopes

Pre-enumerates pSCFGs search space

Presents conditional versions of dynamical programming algorithms

User defined complexity

Fold Envelopes, pt. 2

Conceptualize search over grammars and parse trees

Each node in tree accounts for subsequence

Wu

…Accounts for Xi..j

… Accounts for X0..i and Xj..L

Outside sequence

Inside sequence

Analogy: Message Passing

Inside algorithm: likelihood of sequence over all possible parses

Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence

Inside-Outside algorithm: expected number each grammar production is used

Use fold envelopes to limit messages by restricting subsequences considered

The Inside Algorithm

To compute

a(i, j, V) = P(xi…xj, produced by V)

a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY)

k k+1i j

V

X Y

Batzolgou

Constructing Fold Envelopes

Constrain to possible 2ndary structures Constrain to primary sequence alignment

Summary

RSEARCH to find a set of possible homologs, sorted by score and statistics

Fold Envelopes permit greater search depth in case of unfolded comparisons

RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full

spectrum of comparisons but represent more computationally complex situations