Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...
-
Upload
tyler-perkins -
Category
Documents
-
view
226 -
download
2
Transcript of Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...
Recognition of regulatory signals
Mikhail S. Gelfand
IntegratedGenomics-Moscow
NATO ASI School, October 2001
Why?
• Additional annotation tool (e.g. specificity of transporters and enzymes from large families)
• Important for practice (in addition to metabolic reconstruction)
• Interesting from the evolutionary point of view
Overview
0. Biological introduction
1. Algorithms• Representation of signals
• Deriving the signal
• Site recognition
2. Comparative genomics• Phylogenetic footprinting
• Consistency filtering
Some biology
• Transcription (DNA RNA)
• Splicing (pre-mRNA mRNA)
• Translation (mRNA protein)
• Regulation of transcription in prokaryotes
• … and eukaryotes
• Initiation of translation
Transcription and translation in prokaryotes
Initiation of transcription (bacteria)
Translation in prokaryotes
Translation (details)
Splicing (eukaryotes)
Regulation of transcriptionin prokaryotes
Structure of DNA-binding domain. Example 1
Structure of DNA-binding domain. Example 2
Protein-DNA interactions
Regulation of transcriptionin eukaryotes
Representation of signals
• Consensus
• Pattern (consensus with degenerate positions)
• Positional weight matrix (PWM, or profile)
• Logical rules
• RNA signals
Consensus
codB CCCACGAAAACGATTGCTTTTT
purE GCCACGCAACCGTTTTCCTTGC
pyrD GTTCGGAAAACGTTTGCGTTTT
purT CACACGCAAACGTTTTCGTTTA
cvpA CCTACGCAAACGTTTTCTTTTT
purC GATACGCAAACGTGTGCGTCTG
purM GTCTCGCAAACGTTTGCTTTCC
purH GTTGCGCAAACGTTTTCGTTAC
purL TCTACGCAAACGGTTTCGTCGG
consensus ACGCAAACGTTTTCGT
Pattern
codB CCCACGAAAACGATTGCTTTTT
purE GCCACGCAACCGTTTTCCTTGC
pyrD GTTCGGAAAACGTTTGCGTTTT
purT CACACGCAAACGTTTTCGTTTA
cvpA CCTACGCAAACGTTTTCTTTTT
purC GATACGCAAACGTGTGCGTCTG
purM GTCTCGCAAACGTTTGCTTTCC
purH GTTGCGCAAACGTTTTCGTTAC
purL TCTACGCAAACGGTTTCGTCGG
consensus ACGCAAACGTTTTCGT
pattern aCGmAAACGtTTkCkT
Frequency matrix
j a C G m A A A C G t T T k C k T
A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0
C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0
G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0
T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
I = j b f(b,j)[log f(b,j) / p(b)] Information content
Sequence logo
Positional weight matrix (PWM)
j a C G m A A A C G t T T k C k T
A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0
C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0
G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0
T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9
A 1.1 –1.0 –0.7 0.5 2.2 2.2 1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7
C –0.4 1.9 –0.7 1.6 –0.7 –0.7 0.1 2.2 –0.7 –1.2 –1.0 –0.7 –1.1 2.2 –0.3 –0.7
G –0.4 0.1 2.2 –1.1 –0.7 –0.7 –1.0 –0.7 2.2 –0.1 –0.1 –0.7 1.2 –0.7 1.0 –0.7
T –0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7 1.5 1.9 2.2 1.0 –0.7 0.6 2.2
• Probabilistic motivation: log-likelihood (up to a linear transformation)
• More probabilistic motivation: z-score (with the suitable base of the logarithm)
• Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation)
• Pseudocounts
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
Logical rules, trees etc.
Compilation of samples• Initial sample:
– GenBank
– specialized databases
– literature (reviews)
– literature (original papers)
• Correction of GenBank errors
• Checking the literature • removal of predicted sites
• Removal of duplicates
Re-alignment approaches
• Initial alignment by a biological landmark– start of transcription for promoters
– start codon for ribosome binding sites
– exon-intron boundary for splicing sites
• Deriving the signal within a sliding window
• Re-alignment
• etc. etc. until convergence
Gene starts of Bacillus subtilisdnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
cons. aaagtatataagggagggttaataATG
num. 001000000000110110000000111
760666658967228106888659666
dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
cons. tacataaaggaggtttaaaaat
num. 0000000111111000000001
5755779156663678679890
Positional information content before and after re-alignment
Positional nucleotide frequencies after re-alignment (aGGAGG pattern)
Enhancement of a weak signal
Deriving the signal ab initio
• “Discrete” (pattern-driven) approaches: word counting
• “Continuous” (profile-driven) approaches: optimization
Word counting. Short words
• Consider all k-mers
• For each k-mer compute the number of sequences containing this k-mer
– (maybe with some mismatches)
• Select the most frequent k-mer
Problem: Complete search is possible only for short words
Assumption: if a long word is over-represented, its subwords also are overrepresented
Solution: select a set of over-represented words and combine them into longer words
Word counting. Long words
• Consider some k-mers
• For each k-mer compute the number of sequences containing this k-mer
– (maybe with some mismatches)
• Select the most frequent k-mer
Problem: what k-tuples to start with?
1st attempt: those actually occurring in the sample.
But: the correct signal (the consensus word) may not be among them.
2nd attempt: those actually occurring in the sample and some neighborhood.
But: – again, the correct signal (the consensus word)
may not be among them;– the size of the neighborhood grows
exponentially
Graph approach
Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k).
Thus we obtain an n-partite graph (n is the number of sequences).
A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.
A simple algorithm
• Remove vertices that cannot be extended to complete subgraphs – that is, do not have arcs to all parts of the graph
• Remove pairs that cannot be extended …– that is, do not form triangles with the third
vertex in all parts of the graph
• Etc.(will not work “as is” for dense subgraphs)
Optimization. EM algorithms
• Generate an initial set of profiles (e.g. seed with all k-mers)
• For each profile
– find the best (highest scoring) representative in each sequence
– update the profile
• Iterate until convergence
This algorithm converges.
However, it cannot leave the basin of attraction.
Thus, if the initial approximation is bad, it will converge to nonsense.
Solution: stochastic optimization.
Simulated annealing
• Goal: maximize the information content I
I = j b f(b,j)[log f(b,j) / p(b)]
• or any other measure of homogeneity of the sites
Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content.
Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content.
• if I(B) I(A), B is accepted• if I(B) < I(A), B is accepted with probability
P = exp [(I(B) – I(A)) / T]The temperature T decreases exponentially, but
slowly; the initial temperature is chosen such that almost all changes are accepted.
Gibbs sampler
Again, A is a signal (set of sites), and I(A) is its information content.
At each step a new site is selected in one sequence with probability
P ~ exp [(I(Anew)]For each candidate site the total time of
occupation is computed.(Note that the signal changes all the time)
Use of symmetry• DNA-binding factors and their signals
Co-operative homogeneous
Palindromes
Repeats
Co-operative non-homogeneous
Cassetes
Others
RNA signals
Recognition: PWM/profiles
The simplest technique: positional nucleotide weights are
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
Score of a candidate site b1…bk is the sum of the corresponding positional nucleotide weights:
S(b1…bk ) = j=1,…,kW(bj,j)
Distribution of RBS profile scores on sites (green) and non-sites (red)
Pattern recognition
• Linear discriminant analysis
• Logical rules
• Syntactic analysis
• Context-sensitive grammars
• Perceptron
• Neural networks
Neural networks: architecture
• 4k input neurons (sensors), each responsible for observing a particular nucleotide at particular position
OR 2k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC)
• One or more layers of hidden neurons• One output neuron
• Each neuron is connected to all neurons of the next layer
• Each connection is ascribed a numerical weight
A neuron• Sums the signals at incoming connections• Compares the total with the threshold (or
transforms it according to a fixed function)• If the threshold is passed, excites the
outcoming connections (resp. sends the modified value)
Training:
• Sites and non-sites from the training sample are presented one by one.
• The output neuron produces the prediction.• The connection weights and thresholds are
modified if the prediction is incorrect.
Networks differ by architecture, particulars of the signal processing, the training schedule
Use of sequence context
• Presence of multiple co-operative sites– ArgR (E. coli), purine regulator (Pyrococcus)– XylR+CRP; CytR+CRP (E. coli)– MEF+MyoD in muscle-specific promoters
(mammals)
• Location relative to promoters – repressors vs. activators
BenchmarkingDifficult, because:• Different algorithms are optimized for different
performance parameters• Incompatible training sets• Difficult to construct a homogeneous and
unambiguous testing set:– Unobserved sites– Competition between closely located sites– Activation in specific conditions– non-specific binding (52 out of 54 candidate HNF-1
binding sites do bind the factor)
Promoters of E. coli
• PWM at false positive rate 1 per 2000 bp:– 25% of all promoters,– 60% of constitutive (non-activated) promoters
• PWM perform as well as neural networks
Eukaryotic promoters
Ribosome binding sites• Information content of the profile predicts
the average reliability of predictions
CRP (E. coli)
0102030405060708090
100110
3 3,2 3,4 3,6 3,8 4 4,2 4,4 4,6 4,8 5
threshold
OV
UN
OV: overprediction (% of false positives among candidate sites)UN: underprediction (% of lost true sites)
Comparative approach to the analysis of regulation
Making good predictions
with bad rules
Regulation of transcription in prokaryotes
Difficult:
• Small sample size
• Weak signals (or we do not know what features are relevant, maybe the DNA structure)
CRP (E. coli)
0102030405060708090
100110
3 3,2 3,4 3,6 3,8 4 4,2 4,4 4,6 4,8 5
threshold
OV
UN
OV: overprediction (% of false positives among candidate sites)UN: underprediction (% of lost true sites)
GenBank entry for the E. coli genomegene complement(120178..121551) /note="b0112" /gene="aroP"CDS complement(120178..121551) /gene="aroP" /product="aromatic amino acid transport protein"protein_bind complement(121599..121617) /bound_moiety="TyrR documented site"protein_bind complement(121622..121640) /bound_moiety="TyrR documented site"protein_bind complement(121653..121664) /bound_moiety="PutA predicted site"promoter complement(121683..121711) /note="factor Sigma70; promoter aroP; documented +1 at 121671"protein_bind complement(121810..121823) /bound_moiety="OxyR predicted site"protein_bind complement(121813..121835) /bound_moiety="ArgR predicted site"
aroP TyrR TyrR PutA Pr. OxyR ArgR
Many genomes are available =>
comparative approach
Basic assumption
Regulons (sets of co-regulated genes) are conserved
• well …in some cases
• in fact, in many cases
Corollary: The consistency check
• True sutes occur upstream of orthologous genes
• False sites are scattered at random
Orthologs
• Orthologous genes: – diverged by specitation– retain cellular role
• Paralogous genes: – diverged by duplication– retain biochemical function only
Orthology (definition)
• Genomes are shown as black “pipes”
• 1st event: duplication• 2nd event: specitation• Genes of the same
color are orthologous• Genes of different
color are paralogous
Search for orthologs (fast and dirty)
Genome 1 Genome 2
symmetrical best hit
A
B
B"
A'
B'
The basic procedure
Genome 2Genome 2Genome 1Genome 1
Set of known sitesSet of known sites ProfileProfile
Genome NGenome N
Accounting for the operon structure
«Old» genome «New» genome
A
A
BC
BC
D
XD
EF
E
F
X
X
X
X
Checklist
• Presence of orthologous transcription factors
• Really orthologous (BETs, COGs etc. are not sufficient)
• * Conservation of the DNA-binding domain
• * Conservation of the core pathway
Purine regulons of E. coli and H. influenzae purR purR guaBA guaBA glyA pyrD pyrD prsA prsA glnB glnB purA purA codBA - codA pyrC - purT - gcvTHP - speAB - - ycfC purB
ycfC purB
purHD glyA
purHDglyA
purL purL cvpApurF
cvpApurF
purMN purMN purKE purKE purC purC yjcD yieG
HI0125
Predicted purine transporters
YgfO
YicE
UAPA_En
UAPC_En
YgfU
2635740_Bs
2635741_Bs
YcdG_Ec
UraA_Hi
UraA_Ec
2895752_EfPyrP_Bc
PyrP_Bs
YjcD_Hi
YjcDYgfQ
YtiP_Bs2239289_Bs
YieG YicO
Y326_Mj
2314333_Hp
2689889_Bb
2689890_Bb
997
746
979
PbuX_Bs
965
969
981
997
980
965
758
940
714
996
997
999
994
778
749
9981000
Changes in the operon structure: more examples
• glnK-amtB loci of methanogenic acrhaebacteria
M. thermoautotrophicum
NIF amtB glnK NIF amtB glnK
M. jannaschii
NIF glnK amtB
glnK NIF amtB
Tryptophan operons
E. coli
H. influenzae
trpE trpD trpC trpB trpA
ydfG trpB trpA
trpE trpD trpC
Heat chock (HrcA) regulons / CIRCE elements
Bacillus subtilis
CIRCE hrcA grpE dnaK dnaJ
CIRCE groES groEL
Mycobacterium tuberculosis
hrcA dnaJ
dnaK grpE dnaJ
CIRCE groES groEL
CIRCE groEL
Chlamidiae
CIRCE hrcA grpE dnaK
dnaJ
CIRCE groES groEL
groEL
Synechocystis
hrcA
grpE dnaK
dnaJ
CIRCE groES groEL
CIRCE groEL
Mycoplasma
hrcA
grpE
CIRCE dnaK
CIRCE dnaJ
CIRCE groES groEL
CIRCE lon
CIRCE clpB
Closely related genomes: Phylogenetic footprinting
Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.
High conservation
purL
ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGCEC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGATYP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC ** *********** ** ****** ******* *** * ** * * * *
ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGTEC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGTYP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT ** ** ** ***** ***************** *** ******** ******* ** ***
ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA--EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA--YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA ** *** *** * ** ** **** ************ ****
ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAACEC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGGYP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC * ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** ***
Low conservation
yjcD
ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCACEC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCACYP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC *** * *** * *** ***** * * **** ** ************* ** * * *
ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAAEC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAAYP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG ** * *** * ** ** ****** ******** **** * *** * * *
ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAATEC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAACYP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C * * * ** * * * **** * * ** * * ** * * * *
ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTEC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTYP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA * ******* * * ****** * ** *** * * ** * ** ** ** * ***** **
Degeneration of sitestrpH
ttGtACAagttaactaGTacaaEC gtcgccgaATGTACTAGAGAACTAGTGCATtagcttatST accgcaggATGTACTAGTAAACTAGTTTAAtggattggYP gtcgtcggATGTTTTAACTAAATATTTTCAtgagtgatEH ctcgccgcATGTACTGATGGGTAACCGGCGctgaactg .**..* ****..*. .. .* . . . .BA tcactgtatttttttagtatactattaaacttatcctc
Problems and solutions
Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members.
Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities.
Too many genomes and regulons: apply preliminary automated screening.
Modification: ubiquitous regulators
• Present in many genomes
• Only core regulon is conserved
• Mode of regulation may vary
• Signals may be slightly different
Arginine repressor ArgR/AhrC
artJRv1652 Rv1653 Rv1654 Rv1655 Rv1656 Rv1658 Rv1659 Rv1383 Rv1384
argC argJ argB argD argF argGargHcarA carB yqiXyqiYyqiZ
rocRrocC rocArocB rocF rocDrocE
AhrC
2787 278827862785 414 1203 12043089 3090 4268426642652443
yqjN
4913533
TM1782 TM1783 TM1784 TM1785 TM1097TM1780 TM1781 TM0558TM0577 TM0593TM0592TM0591TM0371
? ? ? DR1415 DR0080DR0674 DR0678DR684 DR0668 DR2610 ? ?DR0742
Mycobacterium tuberculosus
Bacillus subtilis
Clostridium acetobutylicum
Thermotoga maritima
Deinococcus radiodurans
AhrC
argC argB argD argFargGargH carA carB artIartM artQargR
Escherichia coli
? HI0596HI0811 HI1727HI1209
Haemophilus influenzae
argE
argA
artP
HI1179H1177 HI1178 HI1180
Vibrio choleraeVC2644 VC2643 VC2641argR VC2645 VC2642 VC2618 VC2390 VC2389 VC2508 VCA075
9VCA075
7VCA075
8VCA076
0VC2316
ABC transporters (periplasmic components)
TM1170CA_3898
HI1080BS_yckK
DR0564
Cpn0604DR2278
Cpn0482HI1179
EC_artJ (arg)EC_artI (arg)
EC_argT (arg)EC_hisJ (his)
TM0593BS_glnH (gln)
Rv0411cEC_ybeJ
EC_yhdWBS_yqiX
EC_glnH (gln)CA_0129
DR2154DR2610
CA_4268CA_0491
BS_yxeMCA_1093
BS_ytmKBS_ytmJ
0.1 changes per site
EC_fliY (biosynthesis of flagellae)
Modification: horizontal transfer
• Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration
• Often rgulate large loci (several adjacent operons)
• Signals are mainly conserved
New signals
• Select a group of related genomes
• In each genome select metabolically related genes
• Add possibly co-transcribed genes
• Compare upstream regions for each genome independently
• Construct profiles
• Compare constructed profiles: if similar, then relevant
The purine regulon of Pyrococcus spp.• Use functional annotation and COGs to select genes encoding enzymes
from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA.
• Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position).
• However, the profiles are almost identical.
• There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct.
• Low specificity of profiles, thus >300 candidate genes in each genome.
• Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer.
• The new rule is absolutely specific: only one additional gene in each genome.
YgfO
YicE
UAPA_En
UAPC_En
YgfU
2635740_Bs
2635741_Bs
YcdG_Ec
UraA_Hi
UraA_Ec
2895752_EfPyrP_Bc
PyrP_Bs
YjcD_Hi
YjcDYgfQ
YtiP_Bs
2239289_Bs
YieG YicO
Y326_Mj
2314333_Hp
2689889_Bb
2689890_Bb
997
746
979
PbuX_Bs
965
969
981
997
980
965
758
940
714
996
997
999
994
778
749
998
1000
PH
PA A
PF
Sources
• G. Stormo
• J. Fickett
• W. Miller
• I. Dubchak
• Yuh et al. (1998)
• Tronche et al. (1997)
• textbooks
Discussions and collaboration
• Farid Chetouani (Institute Pasteur)
• Eugene Koonin (NCBI)
• Yuri Kozlov (Aginomoto)
• Leonid Mirny (Harvard - MIT)
• Alexander Mironov (GosNIIGenetika)
• Vasily Lybetsky (Inst. Probl. Inform. Trans.)
• Andrey Osterman (IntegratedGenomics)
• Danila Perumov (Inst. Nucl. Phys.)
• Pavel Pevzner (UC San Diego)
• Michael Roytberg (Inst. Math. Probl. Biol.)
Collaborators
• Andrey A. Mironov
• A. B. Rakhmaninova• Vadim Brodyansky• Lyudmila Danilova• Anna Gerasimova • Alexey Kazakov• Ekaterina Kotelnikova
• Olga Laikova• Pavel Novichkov• Ekaterina Panina • Elya Permina • Dmitry Ravcheev• Dmitry Rodionov• Natalya Sadovskaya• Alexey Vitreschak