GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA...
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Find Gene Structures in DNA
Intergene State
First Exon State
IntronState
Hidden Markov Model for Gene Finding
• Intron, Exon, Intergenic states
• Exon frame is encoded in the architecture by defining more states
• Exon states have explicit duration density
• Intron states have geometric duration
• Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)
Comparison-based Methods
Cross-species gene finding
5’ 3’
Exon1 Exon2 Exon3Intron1 Intron2
[human]
[mouse]
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
Comparison of 1196 orthologous genes(Makalowski et al., 1996)
• Sequence identity between genes in human/mouse– exons: 84.6%– protein: 85.4%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%
• 27 proteins were 100% identical.
Human Mouse
Human-mouse homology
Not always: HoxA human-mouse
Twinscan
• Twinscan is an augmented version of the Gencscan HMM.
E I
transitions
duration
emissionsACUAUACAGACAUAUAUCAU
Twinscan Algorithm
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 letters
= { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
Twinscan Algorithm
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Note:
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exons
eI(x-) > eE(x-): gaps (and mismatches) favored in introns
Example
Human: ACGGCGACGUGCACGU
Mouse: ACUGUGACGUGCACUU
Alignment: ||:|:|||||||||:|
Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|
Recall, eE(A|) > eI(A|)
eE(A-) < eI(A-)
Likely exon
HMMs for simultaneous alignment and gene finding:
Generalized Pair HMMs
A Pair HMM for alignments
MP(xi, yj)
IP(xi)
JP(yj)
1 - 2
1- - 2
1- - 2
BEGIN
END
M JI
Generalized Pair HMMs
Exon GPHMM
d
e
1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
Cross-species gene finding
5’ 3’
Exon1 Exon2 Exon3Intron1 Intron2
CNS CNS CNS
[human]
[mouse]
The SLAM hidden Markov model
Model Time Space
HMM N2T NTPHMM N2TU NTUGHMM D2N 2T NTGPHMM D4N 2TU NTU
N no. states
Dmax durationT length
seq1U length seq2
Computational complexity
Approximate alignment
Reduces
TU -factor
to
hT
Measuring Performance
Example: HoxA2 and HoxA3
SLAM
SGP-2
TwinscanGenscan
TBLASTXSLAM CNS
VISTARefSeq
Suffix Trees
(a short break from biology)
Suffix Trees
• Suffix trees are a method to find all maximal matches between two strings (and much more)
Example: x = dabdac d a b d a c
ca
bd
acc
cca
db
1
4
25
63
Definition of a Suffix Tree
Definition:
For string x = x1…xm, a suffix tree is:
A rooted tree with m leaves
Leaf i: xi…xm
Each edge is a substring
No two edges out of a node, start with same letter
It follows, every substring corresponds to
an initial part of a path from root to a leaf
Naïve Algorithm to Construct a Suffix Tree
1. Initialize tree T: a single root node r
2. Insert special symbol $ at end of x
3. For j = 1 to m
• Find longest match of xi…xm to T, starting from r
• Split edge where match stops: new node w
• Create edge (w, j), and label with unmatched portion of xi…xm
Example of Suffix Tree Construction
1
x = d a b d a $
d a b d a $
1. Insert d a b d a $
a
bd
a$
2
2. Insert a b d a $
$a
db
3
3. Insert b d a $
$
4
4. Insert d a $
$
5
5. Insert a $
$
6
6. Insert $
Memory to Store Suffix Tree
• Can store in O( N ) memory!
• Every edge is labeled with (i, j):
(i,j) denotes xi…xj
• Tree has O( N ) nodes
Proof:1. # leafs # nodes – 1
2. # leafs = |x|
Faster Construction
Several algorithms
O( N ) time,
O( N ) memory with a big constant ~15 bytes/char
Technical but not deep, outside the scope of this course
Optional: Gusfield, chapter 6
Application: find all matches between x, y
1. Build suffix tree for x, mark nodes with x
2. Insert y in suffix tree, mark all nodes y “passes from” with y
The path label of every node marked both 0 and 1, is a common substring
1
x = d a b d a $y = a b a d a $
d a b d a $1. Construct tree for x
a
bd
a$2
$a
db
3
$
4
$
5
$6
xx
x
6. Insert a $
5
6
6. Insert $
4. Insert a d a $
da$
3
5. Insert d a $
y
4
2. Insert a b a d a $
a
y
da
$
1
y
yx
3. Insert b a d a $ ady
2
a$
x
Example of Suffix Tree construction
Application: common substrings of k strings
To find the longest common substring of s1, s2, …sn
1. Build suffix tree for s1,…, sn
2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik
Suffix Arrays
ABRACADABRA$
11 $10 A$ 7 ABRA$ 0
ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$
• Fast O(log n) search for every specific string
• Used for data compression such as bzip2
• Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal Too much memory— ~15n bytes Difficult to implement
• Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory
• Hot topic how to build fast in practice