Towards Syntactically Constrained Statistical Word Alignment
Regular Expression Constrained Sequence Alignment
description
Transcript of Regular Expression Constrained Sequence Alignment
Regular Expression Constrained Sequence Alignment
Abdullah N. ArslanAssistant Professor
Computer Science Department
Outline
• Sequence alignment Common frame-work DP solution Why constrained ?
• RE constrained sequence alignment Algorithm
• Concluding Remarks
Alignment Matrix
Edit Graph
Dynamic Programming Solution
Hi,j: maximum score achieved at (i, j)
where Hi,j = 0 whenever i=0 or j=0,
Hn,m in O(nm) time, O(m) space
DP Solution: Local Alignment
Hi,j: similarity score achieved at (i, j)
where Si,j = 0 whenever i=0 or j=0,
max Hi,j in O(nm) time, O(m) space
Dynamic Programming Formulation
Affine gap penalties Penalty for a gap of length k is +(k-1)
where Si,j = Fi,j = Ei,j = 0 when i=0 or j=0
max Hi,j O(nm) time, O(m) space
The Definition of the Constrained LCS Problem
• The contrained LCS (CLCS) problem Given strings S1,S2, and P
• Find lcs of S1 and S2 s.t. P is a subsequence of this lcs
• Motivation: Computing the homology of two biological
sequences that have a specific part in common
Constrained Sequence Alignment Problems
• Constrained LCS Tsai 2003, O(n2m2r) time Chin et. al 2004, Arslan and Egecioglu 2004
• O(nmr) time
• Edit-distance constrained sequence alignment Arslan and Egecioglu 2004, O(dnmr)
• Regular-expression constrained sequence alignment Motivation:
• Comet and Henry, 2002• PROSITE patterns
This paper
PROSITE patterns as constraints
• PROSITE patterns are Regular expressions with no Kleene closure PROSITE database e.g. [GA]-X(4)-G-K-[ST]
• ATP/GTP-binding site motif A (P-loop) (PS00017)
• Comet and Henry reward alignments• Regular expression constrained sequence
alignment Find a maximal alignment that includes a given
RE
Example: For [GA]-X(4)-G-K-[ST]
Using Edit Graph: e.g. A(C+G)*(S+T)
Automata for A(C+G)*(S+T)
Some Details of Automata Construction
• Equivalent NFA N to a given RE R
• Construct from N a new NxN automaton
Moves on edit operations • (or equivalently on alignment columns)
States have weights• Interested in the weights of the final states after the
alignment is complete
Weighted Automaton
• Initial weights are
• Weight of (q0,q0) is initially 0
• Update new maximum scores at reachable states
• Weights become in unreachable states
• What are the maximum weights at the final states?
Computations on Automata
Complexity• Simulate automata based on DP solution
Each steps requires examining the trasition functions
Maintain a list of active (reachable) states
Update state weights as alignments are formed
Automaton Mi,j has the optimum weights
Generalizations: Local Alignment & Affine gaps
CONCLUSION
• Introduced the regular expression constrained sequence alignment problem
• Present an algorithm for the problem
• Future work Generalization of the problem for
• Multiple sequence alignment• Multiple regular expressions as a constraint
Thank YouThank You