CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal...
Transcript of CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal...
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
CFG – PSA Algorithm
Sequence Alignment Guided By Common Motifs Described By Context Free Grammars
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
motivation
External Base Internal Loop
Multi-loop
Bulge
Hairpin Loop
Hairpin Loop
• Find motifs- conserved regions that indicate a biological function or signature. Other algorithm do not always align motif regions together.
• Incorporate knowledge about common structures into the alignment process. Forcing the alignments align such common motifs.
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
The goal
To align sequences in a way to include all or some of the motif-matches in an order to optimize the resulting score.
Example
A(AA,AA)+A(GCA,GA)+A(UAU,UA)+A(GC,CGC)+ β(G1)+β(G2)+β(G3)
max 𝐴 𝑧𝑖, 𝑤𝑖 + max β(G)
𝑦,𝑥∈𝐺, 𝐺∈𝐶
𝑘+1
𝑖=1
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Let A(u, v) denote the maximum ordinary alignment score for strings u and v that can be computed using the Alignment algorithm. Given: • two sequences S1 and S2 • set of context-free grammars C = {G1, . . . , Gf } each of which represents a motif
with all its possible variations. • weight function β(G) Compute:
max 𝐴 𝑍𝑖, 𝑤𝑖 + max β(G) 𝑋,𝑌∈𝐺, 𝐺∈𝐶 𝑘+1𝑖=1
over all possible X1, X2, . . . ,Xk, Y1, Y2, . . . , Yk
S1 = z1 x1 z2 x2 . . . zk xk zk+1 S2 = w1 y1 w2 y2 . . .w k y k w k+1
Each Z j or w j , 1 <= j <= k + 1, can be an empty string For all I 1<= i<= k, there exists a 𝐺 ∈ 𝐶 such that 𝑦i, xi ∈ 𝐿 𝐺 , where L(G) is the language generated by CFG G.
motif-matching region
Problem definition
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
C={G1, G2 , G3 , G4} G1=(V,T,P,V0) – context free grammar
Set of variables V = {V0, V1, V2}
V0 is a start variable
Terminal symbols T = {A,U,C,G}
Set of rules: • V0 --> CV1G | GV1C • V1 --> GV2C • V2 --> GAA
G2={ {V0, V1, V2} , {A,U,C,G} , p , V0} Set of rules: • V0 --> AV1U | UV1A • V1 --> CV2G • V2 --> CC | GCG
G3={ {V0, V1, V2, V3} , {A,U,C,G} , p , V0}
Set of rules: • V0 --> AV1U | GV1C | CV1G • V1 --> AV2U • V2 --> AV3U | UV3A • V3--> AAC | AA
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Reminder – CYK Algorithm
CYK is an algorithm that receives a CNF, G, and a string S, and
determines if S can be produced from G, and how.
ParseAll
ParseAll is a modification of CYK that receives a CNF, G, and a
string S, and finds all of the substrings of S that can be produced
from G.
CNF – Chomsky normal form
A CNF is a CFG in which all of the production rules are of one
of the forms: V→TS | V→a | S→ε
Every CFG is easily convertible to CNF by following a simple
algorithm.
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
CNF G
String S = “Poor fool, do you think you can understand
Dynamic Programming? No one can understand it.”
B1 → can
B2 → B3C
B3 → understand
C → DP | it.
D → Dynamic
P → Programming
Running Example:
V0 → AB
A → A1A2 | you
A1 →No
A2 →one
B → B1B2
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
String S = “Poor fool, do you think you can understand
Dynamic Programming? No one can understand it.”
A → you
B1 → can
B3 → understand
D → Dynamic
P → Programming
A1 → No
A2 → one
B1 → can
B3 → understand
C → it
T[4,4]={A} T[6,6]={A}
T[7,7]={B1}
T[9,9]={D}
T[10,10]={P}
T[11,11]={A1}
T[12,12]={A2}
T[13,13]={B1}
T[14,14]={B3}
T[15,15]={C}
T[8,8]={B3}
Running Example:
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Running Example:
15 14 13 12 11 10 9 8 7 6 5 4
A 4
5
A 6
B1 7
B3 8
D 9
P 10
A1 11
A2 12
B1 13
B3 14
C 15
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
15 14 13 12 11 10 9 8 7 6 5 4
A 4
5
A 6
B1 7
B3 8
C D 9
P 10
A A1 11
A2 12
B1 13
B2 B3 14
C 15
C → DP A → A1A2 B2 → B3C
Running Example:
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
15 14 13 12 11 10 9 8 7 6 5 4
A 4
5
A 6
B1 7
B2 B3 8
C D 9
P 10
A A1 11
A2 12
B B1 13
B2 B3 14
C 15
B → B1B2 B2 → B3C
Running Example:
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
15 14 13 12 11 10 9 8 7 6 5 4
A 4
5
A 6
B B1 7
B2 B3 8
C D 9
P 10
A A1 11
A2 12
B B1 13
B2 B3 14
C 15
Running Example:
B → B1B2
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
15 14 13 12 11 10 9 8 7 6 5 4
A 4
5
V0 A 6
B B1 7
B2 B3 8
C D 9
P 10
V0 A A1 11
A2 12
B B1 13
B2 B3 14
C 15
Running Example:
V0 → AB V0 → AB
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
S1 = [6,10] S2=[11,15]
Running Example:
15 14 13 12 11 10 9 8 7 6 5 4
A 4
5
V0 A 6
B B1 7
B2 B3 8
C D 9
P 10
V0 A A1 11
A2 12
B B1 13
B2 B3 14
C 15
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Step 1. Find the substrings derived
by the rules of the form X → a:
for i =1 to n do
set T[i, i]= ϕ
for each variable X
if X → S[i] is a rule
then add X to T[i, i]
Step 2. Find the substrings derived by
the rules of the form X → YZ:
for l =2 to n do
for i =1 to n - l +1 do
set j = i + l – 1
for k = i to j - 1 do
for each rule X → YZ
if Y ∈ T[i, k]
and Z ∈ T[k +1,j]
Then add X to T[i, j]
Algorithm ParseAll (input string S, length n, CFG G)
Runtime = O(|G|N3)
Step 3. Return the set
of all substrings
generated by G:
P = ϕ
for i =1 to n do
for j = i to n do
if V0 ∈T[i, j]
then add (i, j) to P
return P
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
_ A A C G A A C G G C A A A A A A C U
_ 0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞
A -∞ 1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17
A -∞ -1 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14
G -∞ -2 1 0 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12
G -∞ -3 0 -1 1 0 -1 -2 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -∞ -4 -1 -2 0 -1 -2 -3 -1 1 0 -1 -2 -3 -4 -5 -6 -7 -8
A -∞ -5 -2 -3 -1 1 0 -1 -2 0 -1 1 0 -1 -2 -3 -4 -5 -6
A -∞ -6 -3 -4 -2 0 2 1 0 -1 -2 0 2 1 0 -1 -2 -3 -4
C -∞ -7 -8 -2 -3 -1 1 3 2 1 0 -1 1 0 -1 -2 -3 -1 -2
C -∞ -8 -9 -3 -4 -2 0 2
G -∞
A -∞
C -∞
A -∞
U -∞
A -∞
A -∞
A -∞
-1 -1
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Preprocessing
For all G k in C do:
X k parseAll(S 1,n 1,G k) Y k parseAll(S 2,n 2,G k)
Move every (i',i)∈X k to list X as (i,j,k)
Move every (i',i) ∈ Y k to list Y as (i,j,k) Sort X & Y in ascending order of the end points i'.
(1,2,4)
(2,5, 1)
(3,7, 4)
(5,7, 1)
(3,9, 2)
(7,11, 4)
(9,11, 1)
(8,18, 3)
(2,6, 4)
(4,8, 3)
(3,8, 2)
(5,11, 1)
(2,11, 4)
(9,11, 3)
(1,5, 2)
X Y j=7
i=11
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Time Complexity of CFG - PSA Where N=max{N1,N2}
Running ParseAll for C
And creating X, Y:
Sorting X, Y:
Computing the maxterms: O(|X||Y| + N2)
Because of the usual N2 table checking, and the new maxterm:
max{H(i’,j’) +β(Gk1 )} (i’,i,k1,j’,j,k2) ∈ Xi*Yj ,Gk1=Gk2
Which passes every CNF in X and Y and check for matching CNFs.
O(|C|N3)
O(|X|log|X|+ |Y|log|Y|)
Total time complexity: O(|C|N3 + |X||Y|)
CFG - PSA
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA
Possible Modifications
Affine gap penalties
local alignment computations
more advanced algorithms
Solve the general problem of optimally aligning multiple
sequences guided by a given set of motifs described by
CFGs.
used in the alignment
of non-motif-matching
regions. Same or better
complexity.
CFG - PSA