Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Noncoding RNA Genes Pt. 2SCFGs

Vincent Dorie

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Motivation

Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything

Location

rRNA, snRNA Exons? Introns Viral vectors

Function

Function, pt. 2

Overview

“RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003)

“Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

Comparison - Methodology

RSEARCH DART (Stemloc)

Sequence

Comparison, Pt. 2 - Uses

RSEARCH Find parts of a

genome which may be homologous to query sequence

More practical in comparative genomics

DART (Stemloc) Investigate a specific

sequence suspected of being homologous to query sequence

Comparison, Pt. 3 - Complexity

RSEARCH O((M - B)LD + BLD2)

to scan O(M4) to calculate

statistics

DART (Stemloc) Between O(LM) and

O(L3M3)

Background:Context Free Grammars

Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S N P is a set of productions

Context Free Grammars, pt. 2Sample Grammar

N = {S, A, B} T = {a, u, c, g, } P = {

S -> A | B,

A -> aAc | aBc | g,

B -> g

Context Free Grammars, pt. 3Parse Trees

Parse: aagccS

Stochastic CFG

Each production associated with a probability

Probabilities for all productions starting from a given nonterminal sum to one

Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3

| B, 0.7

Pairwise (profile) SCFG

Terminals in each production can exist in each of two strings

E.g. W -> xiykVxjyl

RSEARCH: pSCFG Simplified Each secondary

structure specifies (most of) a grammar, creating a “Model Architecture”

Eschews probabilistic interpretation

Problem becomes fitting target to model architecture

Sequence

Node Types vs. Node States

Nodes types are what we want to do given model (e.g. MATP is match pair)

Node state represents what happens when scanning a target sequence

E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

Node States

Set of node states possible for node type

Gap Classes

Gap class per node type/state pair

Transition Scores

Gap class determines transition scores Gap penalties are affine

Emission Scores

Emission scores determined empirically

Parameterizing the ModelEmission Scores

AA AU AC AG UA …AA sAAAA sAAAU sAAAC sAAAG sAAAU …AU - sAUAU sAUAC sAUAG sAUUA …AC - - sACAC sACAG sACUA …AG - - - sAGAG sAGUA …UA - - - - sUAUA …… … … … … … …

Substitution Matrices

sij = log2f ijgig j

A U C GA sAA sAU sAC sAG

U - sUU sUC sUG

C - - sCC sCG

G - - - sGG

sijkl = log2f ijkl

gig jgkgl

Scores are observed / random

RIBOSUM Matrices

Start with MSA Whose MSA?

RIBOSUM[X, Y] Sequences X% identical are reweighted to

sum to 1 Only sequences Y% identical are counted in

making matrices

Model Parameters

Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty

Solution

Guess and check “We might have been able to derive a more

robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

Digression: Biostatistics

Confidence intervals Expectation values

Gumbel Distribution

Parameterized by and K E = KNe-x, P = 1 - e-E

Gumbel Distriubtion, pt. 2

K and depend on G+C content of target database

For database with heterogeneous G+C content, compute K and for G+C bins

Putting it All Together

Run against database substrings of length two times the query

Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database,

alignment, E-value, P-value Statistics need to be calculated for every

query and target database

For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics

For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics

Parallelized to 33 minutes and 7.4 hours respectively

Shifting GearsFold Envelopes

Pre-enumerates pSCFGs search space

Presents conditional versions of dynamical programming algorithms

User defined complexity

Fold Envelopes, pt. 2

Conceptualize search over grammars and parse trees

Each node in tree accounts for subsequence

…Accounts for Xi..j

… Accounts for X0..i and Xj..L

Outside sequence

Inside sequence

Analogy: Message Passing

Inside algorithm: likelihood of sequence over all possible parses

Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence

Inside-Outside algorithm: expected number each grammar production is used

Use fold envelopes to limit messages by restricting subsequences considered

The Inside Algorithm

To compute

a(i, j, V) = P(xi…xj, produced by V)

a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY)

k k+1i j

Batzolgou

Constructing Fold Envelopes

Constrain to possible 2ndary structures Constrain to primary sequence alignment

Summary

RSEARCH to find a set of possible homologs, sorted by score and statistics

Fold Envelopes permit greater search depth in case of unfolded comparisons

RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full

spectrum of comparisons but represent more computationally complex situations

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Documents

Transcript of Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Using data (without putting people to sleep) Dorie Turner Nolt Assistant Director of Communications Georgia Department of Education Recovering Journalist.

10 Insights to Become a Thought Leader — Dorie Clark

Kartsuba’s Algorithm and Linear Time Selection · Fall 2015 Kartsuba’s Algorithm and Linear Time Selection Lecture 09 September 22, 2015 Chandra & Manoj (UIUC) CS374 1 Fall 2015

Episode 75: Standing Out and Reinventing Yourself …jenntgrace.com/.../01/Jenn-T-Grace-Ep-76...with-Guest-Dorie-Clark-.pdf · Dorie Clark: Yeah I really- you'd think that starting

Dorie fata pe care nu o iubea nimeni

Personal Branding - Corporativo MAPFRE · Personal Branding The most successful companies are not those that simply manage their commercial operations ... Dorie Clark (dorie@ dorieclark.com)

CORRECTIVE INSTRUCTION I Have the Data…Now What??? Dorie Hall AFL Coordinator

A Close Look at Immigration - Dorie Combs - KRA 2013

Mindy Simmons US Army Corps of Engineers Dorie Welch, Daniel Spear Bonneville Power Administration

11-24-1943 Dorie Miller

1 CORRECTIVE INSTRUCTION I Have the Data…Now What??? Dorie Hall AFL Coordinator Instructional Facilitators’ Meeting Staff Development Center September.

Dorie Greenspan -Baking From My Home to Yours

CS 374: Algorithms & Models of Computation · CS 374: Algorithms & Models of Computation, Fall 2015 More Dynamic Programming Lecture 12 October 8, 2015 Chandra & Manoj (UIUC) CS374

Inbound Marketing Summit - Building Your Personal Brand Through Inbound Marketing - Dorie Clark

Metabolic Engineering: A Survey of the Fundamentals Lekan Wang CS374 Spring 2009.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Dorie Clark - The new thought leadership - IMS Boston 2012

Dorie Clark - ExecuNet · Net! HOW to Develop Your Breakthrough Idea and Build a Following Around It Dorie Clark STAND* DEFINE BRAND IMAGINE FUTURE Reinventing you DORIE CLARK

Building Your Network: Online and Off with Dorie Clark

Digital Photography Tips and tricksdorieweb.com/Class/Handout_Tips.pdf · Digital Photography— Tips and tricks Dorie Parsons DorieWeb.com dorie_parsons@hotmail.com (251) 948-8922