Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev
description
Transcript of Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev
Simple cluster structure oftriplet distributions in genetic texts
Andrei Zinovyev
Institute des Hautes Etudes Scientifique,Bures-sur-Yvette
Transition probabilities = Frequencies of N-grams
…AGGTCGATC …
…AGGTCGATC …
…AGGTCGATC …
Markov chain models
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC
Sliding window
width W
fAAA
fAAC
fGGG
…= fijk, i,j,k in [A,C,G,T]
AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT
Protein-coding sequences
bacterial gene
corr
ect f
ram
e
fijk
fijk(1)
fijk(2)
nml
kmnlijijk fffP,,
)1(
nml
ijnlmiijk fffP,,
)2(
TCCAGCTTA TGAGGCATAACTGTTTACTGAGGCCAT ACT GTACTGTTAGGTTGTACTGTTA
AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT
“Shadow” genes
shadow gene,
ijkijkR
ijk ffCf ˆˆˆˆ TA ˆ C =G
ijkijk fPf ˆˆ )1()1( ijkijk fPf ˆˆ )2()2(
When we can detect genes (by their content)?
,
1. When non-coding regions are very different in base composition (e.g., different GC-content)
2. When distances between the phases are large:
ijkfP )1(ijkfP )2(
ijkfnon-coding
ijk kji
ijkijk ppp
ffM 2log
Simple experiment
,
1. Only the forward strands of genomes are used for triplet counting
2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x
3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated
4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence
5. Every data point Xi={xis} corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64
Principal Component Analysis
,
Max
imal
disp
ersio
n
1st Principalaxis
2nd principalaxis
ViDaExpert tool
,
Caulobacter crescentus (GenBank NC_002696)
,
ijkf
ijkf
ijkfP )1(
ijkfP )2(
“Path” of sliding window
,
Helicobacter pylori (GenBank NC_000921)
,
Saccharomyces cerevisiae chromosome IV
,
Model sequences: (random codon usage)
,
Model sequences: (random codon usage+50% of frequencies are set to 0)
,
Graph of coding phase
,
Assessment
,
Sequence L W% of
codingbases
Sn1 Sp1 Sn2 Sp2
Helicobacter pylori, complete genome (NC_000921)Caulobacter crescentus, complete genome (NC_002696)Prototheca wickerhamii mitochondrion (NC_001613)Saccharomyces cerevisiae chromosome III (NC_001135)Saccharomyces cerevisiae chromosome IV (NC_001136)
16438314016947
55328316613
1531929
300300120399399
9091496973
0.930.930.820.900.89
0.970.970.930.880.91
0.930.940.840.900.92
0.980.980.950.900.92
Model text RANDOMModel text RANDOM_BIAS
100000100000
500500
4945
0.900.99
0.610.83
0.820.94
0.770.90
FNTP
TPSn
FPTP
TPSp
Completelyblind prediction
Dependence on window size
,
0.75
0.8
0.85
0.9
0.95
1
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
window size
Sn
Sp
Dependence on window size
,
W = 51 W = 252
W = 900 W = 2000
State of art: GLIMMER strategy
,
1. Use MM of 5th order (hexamers) 2. Use interpolation for transition probabilities3. Use long ORF (>500bp) as learning dataset
Problems:1. The number of hexamers to be evaluated
is still big2. Applicable only for collected genomes
of good quality (<1frameshift/1000bp)
What can we learn from this game?
,
• Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when
concentration of coding sequences is high• Correlations in the order of codons are small• Codon usage is approximately the same along the
genome
• The method presented allows self-learning on piecesof even uncollected DNA (>150 bp)
• The method gives alternative to HMM view on the problem of gene recognition
Acknowledgements
,
Professor Alexander GorbanProfessor Misha Gromov
My coordinates:http://www.ihes.fr/~zinovyev