Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...

Recognition of regulatory signals

Mikhail S. Gelfand

IntegratedGenomics-Moscow

NATO ASI School, October 2001

Why?

• Additional annotation tool (e.g. specificity of transporters and enzymes from large families)

• Important for practice (in addition to metabolic reconstruction)

• Interesting from the evolutionary point of view

Overview

0. Biological introduction

1. Algorithms• Representation of signals

• Deriving the signal

• Site recognition

2. Comparative genomics• Phylogenetic footprinting

• Consistency filtering

Some biology

• Transcription (DNA RNA)

• Splicing (pre-mRNA mRNA)

• Translation (mRNA protein)

• Regulation of transcription in prokaryotes

• … and eukaryotes

• Initiation of translation

Transcription and translation in prokaryotes

Initiation of transcription (bacteria)

Translation in prokaryotes

Translation (details)

Splicing (eukaryotes)

Regulation of transcriptionin prokaryotes

Structure of DNA-binding domain. Example 1

Structure of DNA-binding domain. Example 2

Protein-DNA interactions

Regulation of transcriptionin eukaryotes

Representation of signals

• Consensus

• Pattern (consensus with degenerate positions)

• Positional weight matrix (PWM, or profile)

• Logical rules

• RNA signals

Consensus

codB CCCACGAAAACGATTGCTTTTT

purE GCCACGCAACCGTTTTCCTTGC

pyrD GTTCGGAAAACGTTTGCGTTTT

purT CACACGCAAACGTTTTCGTTTA

cvpA CCTACGCAAACGTTTTCTTTTT

purC GATACGCAAACGTGTGCGTCTG

purM GTCTCGCAAACGTTTGCTTTCC

purH GTTGCGCAAACGTTTTCGTTAC

purL TCTACGCAAACGGTTTCGTCGG

consensus ACGCAAACGTTTTCGT

Pattern

codB CCCACGAAAACGATTGCTTTTT

purE GCCACGCAACCGTTTTCCTTGC

pyrD GTTCGGAAAACGTTTGCGTTTT

purT CACACGCAAACGTTTTCGTTTA

cvpA CCTACGCAAACGTTTTCTTTTT

purC GATACGCAAACGTGTGCGTCTG

purM GTCTCGCAAACGTTTGCTTTCC

purH GTTGCGCAAACGTTTTCGTTAC

purL TCTACGCAAACGGTTTCGTCGG

consensus ACGCAAACGTTTTCGT

pattern aCGmAAACGtTTkCkT

Frequency matrix

j a C G m A A A C G t T T k C k T

A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0

C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0

G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0

T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

I = j b f(b,j)[log f(b,j) / p(b)] Information content

Sequence logo

Positional weight matrix (PWM)

j a C G m A A A C G t T T k C k T

A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0

C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0

G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0

T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9

A 1.1 –1.0 –0.7 0.5 2.2 2.2 1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7

C –0.4 1.9 –0.7 1.6 –0.7 –0.7 0.1 2.2 –0.7 –1.2 –1.0 –0.7 –1.1 2.2 –0.3 –0.7

G –0.4 0.1 2.2 –1.1 –0.7 –0.7 –1.0 –0.7 2.2 –0.1 –0.1 –0.7 1.2 –0.7 1.0 –0.7

T –0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7 1.5 1.9 2.2 1.0 –0.7 0.6 2.2

• Probabilistic motivation: log-likelihood (up to a linear transformation)

• More probabilistic motivation: z-score (with the suitable base of the logarithm)

• Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation)

• Pseudocounts

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

Logical rules, trees etc.

Compilation of samples• Initial sample:

– GenBank

– specialized databases

– literature (reviews)

– literature (original papers)

• Correction of GenBank errors

• Checking the literature • removal of predicted sites

• Removal of duplicates

Re-alignment approaches

• Initial alignment by a biological landmark– start of transcription for promoters

– start codon for ribosome binding sites

– exon-intron boundary for splicing sites

• Deriving the signal within a sliding window

• Re-alignment

• etc. etc. until convergence

Gene starts of Bacillus subtilisdnaN ACATTATCCGTTAGGAGGATAAAAATG

gyrA GTGATACTTCAGGGAGGTTTTTTAATG

serS TCAATAAAAAAAGGAGTGTTTCGCATG

bofA CAAGCGAAGGAGATGAGAAGATTCATG

csfB GCTAACTGTACGGAGGTGGAGAAGATG

xpaC ATAGACACAGGAGTCGATTATCTCATG

metS ACATTCTGATTAGGAGGTTTCAAGATG

gcaD AAAAGGGATATTGGAGGCCAATAAATG

spoVC TATGTGACTAAGGGAGGATTCGCCATG

ftsH GCTTACTGTGGGAGGAGGTAAGGAATG

pabB AAAGAAAATAGAGGAATGATACAAATG

rplJ CAAGAATCTACAGGAGGTGTAACCATG

tufA AAAGCTCTTAAGGAGGATTTTAGAATG

rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG

rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG

rplM AGATCATTTAGGAGGGGAAATTCAATG

dnaN ACATTATCCGTTAGGAGGATAAAAATG
















cons. aaagtatataagggagggttaataATG

num. 001000000000110110000000111

760666658967228106888659666

dnaN ACATTATCCGTTAGGAGGATAAAAATG
















cons. tacataaaggaggtttaaaaat

num. 0000000111111000000001

5755779156663678679890

Positional information content before and after re-alignment

Positional nucleotide frequencies after re-alignment (aGGAGG pattern)

Enhancement of a weak signal

Deriving the signal ab initio

• “Discrete” (pattern-driven) approaches: word counting

• “Continuous” (profile-driven) approaches: optimization

Word counting. Short words

• Consider all k-mers

• For each k-mer compute the number of sequences containing this k-mer

– (maybe with some mismatches)

• Select the most frequent k-mer

Problem: Complete search is possible only for short words

Assumption: if a long word is over-represented, its subwords also are overrepresented

Solution: select a set of over-represented words and combine them into longer words

Word counting. Long words

• Consider some k-mers

• For each k-mer compute the number of sequences containing this k-mer

– (maybe with some mismatches)

• Select the most frequent k-mer

Problem: what k-tuples to start with?

1st attempt: those actually occurring in the sample.

But: the correct signal (the consensus word) may not be among them.

2nd attempt: those actually occurring in the sample and some neighborhood.

But: – again, the correct signal (the consensus word)

may not be among them;– the size of the neighborhood grows

exponentially

Graph approach

Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k).

Thus we obtain an n-partite graph (n is the number of sequences).

A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.

A simple algorithm

• Remove vertices that cannot be extended to complete subgraphs – that is, do not have arcs to all parts of the graph

• Remove pairs that cannot be extended …– that is, do not form triangles with the third

vertex in all parts of the graph

• Etc.(will not work “as is” for dense subgraphs)

Optimization. EM algorithms

• Generate an initial set of profiles (e.g. seed with all k-mers)

• For each profile

– find the best (highest scoring) representative in each sequence

– update the profile

• Iterate until convergence

This algorithm converges.

However, it cannot leave the basin of attraction.

Thus, if the initial approximation is bad, it will converge to nonsense.

Solution: stochastic optimization.

Simulated annealing

• Goal: maximize the information content I

I = j b f(b,j)[log f(b,j) / p(b)]

• or any other measure of homogeneity of the sites

Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content.

Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content.

• if I(B) I(A), B is accepted• if I(B) < I(A), B is accepted with probability

P = exp [(I(B) – I(A)) / T]The temperature T decreases exponentially, but

slowly; the initial temperature is chosen such that almost all changes are accepted.

Gibbs sampler

Again, A is a signal (set of sites), and I(A) is its information content.

At each step a new site is selected in one sequence with probability

P ~ exp [(I(Anew)]For each candidate site the total time of

occupation is computed.(Note that the signal changes all the time)

Use of symmetry• DNA-binding factors and their signals

Co-operative homogeneous

Palindromes

Repeats

Co-operative non-homogeneous

Cassetes

Others

RNA signals

Recognition: PWM/profiles

The simplest technique: positional nucleotide weights are

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

Score of a candidate site b1…bk is the sum of the corresponding positional nucleotide weights:

S(b1…bk ) = j=1,…,kW(bj,j)

Distribution of RBS profile scores on sites (green) and non-sites (red)

Pattern recognition

• Linear discriminant analysis

• Logical rules

• Syntactic analysis

• Context-sensitive grammars

• Perceptron

• Neural networks

Neural networks: architecture

• 4k input neurons (sensors), each responsible for observing a particular nucleotide at particular position

OR 2k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC)

• One or more layers of hidden neurons• One output neuron

• Each neuron is connected to all neurons of the next layer

• Each connection is ascribed a numerical weight

A neuron• Sums the signals at incoming connections• Compares the total with the threshold (or

transforms it according to a fixed function)• If the threshold is passed, excites the

outcoming connections (resp. sends the modified value)

Training:

• Sites and non-sites from the training sample are presented one by one.

• The output neuron produces the prediction.• The connection weights and thresholds are

modified if the prediction is incorrect.

Networks differ by architecture, particulars of the signal processing, the training schedule

Use of sequence context

• Presence of multiple co-operative sites– ArgR (E. coli), purine regulator (Pyrococcus)– XylR+CRP; CytR+CRP (E. coli)– MEF+MyoD in muscle-specific promoters

(mammals)

• Location relative to promoters – repressors vs. activators

BenchmarkingDifficult, because:• Different algorithms are optimized for different

performance parameters• Incompatible training sets• Difficult to construct a homogeneous and

unambiguous testing set:– Unobserved sites– Competition between closely located sites– Activation in specific conditions– non-specific binding (52 out of 54 candidate HNF-1

binding sites do bind the factor)

Promoters of E. coli

• PWM at false positive rate 1 per 2000 bp:– 25% of all promoters,– 60% of constitutive (non-activated) promoters

• PWM perform as well as neural networks

Eukaryotic promoters

Ribosome binding sites• Information content of the profile predicts

the average reliability of predictions

CRP (E. coli)

0102030405060708090

100110

3 3,2 3,4 3,6 3,8 4 4,2 4,4 4,6 4,8 5

threshold

OV

UN

OV: overprediction (% of false positives among candidate sites)UN: underprediction (% of lost true sites)

Comparative approach to the analysis of regulation

Making good predictions

with bad rules

Regulation of transcription in prokaryotes

Difficult:

• Small sample size

• Weak signals (or we do not know what features are relevant, maybe the DNA structure)

CRP (E. coli)

0102030405060708090

100110

3 3,2 3,4 3,6 3,8 4 4,2 4,4 4,6 4,8 5

threshold

OV

UN

OV: overprediction (% of false positives among candidate sites)UN: underprediction (% of lost true sites)

GenBank entry for the E. coli genomegene complement(120178..121551) /note="b0112" /gene="aroP"CDS complement(120178..121551) /gene="aroP" /product="aromatic amino acid transport protein"protein_bind complement(121599..121617) /bound_moiety="TyrR documented site"protein_bind complement(121622..121640) /bound_moiety="TyrR documented site"protein_bind complement(121653..121664) /bound_moiety="PutA predicted site"promoter complement(121683..121711) /note="factor Sigma70; promoter aroP; documented +1 at 121671"protein_bind complement(121810..121823) /bound_moiety="OxyR predicted site"protein_bind complement(121813..121835) /bound_moiety="ArgR predicted site"

aroP TyrR TyrR PutA Pr. OxyR ArgR

Many genomes are available =>

comparative approach

Basic assumption

Regulons (sets of co-regulated genes) are conserved

• well …in some cases

• in fact, in many cases

Corollary: The consistency check

• True sutes occur upstream of orthologous genes

• False sites are scattered at random

Orthologs

• Orthologous genes: – diverged by specitation– retain cellular role

• Paralogous genes: – diverged by duplication– retain biochemical function only

Orthology (definition)

• Genomes are shown as black “pipes”

• 1st event: duplication• 2nd event: specitation• Genes of the same

color are orthologous• Genes of different

color are paralogous

Search for orthologs (fast and dirty)

Genome 1 Genome 2

symmetrical best hit

A

B

B"

A'

B'

The basic procedure

Genome 2Genome 2Genome 1Genome 1

Set of known sitesSet of known sites ProfileProfile

Genome NGenome N

Accounting for the operon structure

«Old» genome «New» genome

A

A

BC

BC

D

XD

EF

E

F

X

X

X

X

Checklist

• Presence of orthologous transcription factors

• Really orthologous (BETs, COGs etc. are not sufficient)

• * Conservation of the DNA-binding domain

• * Conservation of the core pathway

Purine regulons of E. coli and H. influenzae purR purR guaBA guaBA glyA pyrD pyrD prsA prsA glnB glnB purA purA codBA - codA pyrC - purT - gcvTHP - speAB - - ycfC purB

ycfC purB

purHD glyA

purHDglyA

purL purL cvpApurF

cvpApurF

purMN purMN purKE purKE purC purC yjcD yieG

HI0125

Predicted purine transporters

YgfO

YicE

UAPA_En

UAPC_En

YgfU

2635740_Bs

2635741_Bs

YcdG_Ec

UraA_Hi

UraA_Ec

2895752_EfPyrP_Bc

PyrP_Bs

YjcD_Hi

YjcDYgfQ

YtiP_Bs2239289_Bs

YieG YicO

Y326_Mj

2314333_Hp

2689889_Bb

2689890_Bb

997

746

979

PbuX_Bs

965

969

981

997

980

965

758

940

714

996

997

999

994

778

749

9981000

Changes in the operon structure: more examples

• glnK-amtB loci of methanogenic acrhaebacteria

M. thermoautotrophicum

NIF amtB glnK NIF amtB glnK

M. jannaschii

NIF glnK amtB

glnK NIF amtB

Tryptophan operons

E. coli

H. influenzae

trpE trpD trpC trpB trpA

ydfG trpB trpA

trpE trpD trpC

Heat chock (HrcA) regulons / CIRCE elements

Bacillus subtilis

CIRCE hrcA grpE dnaK dnaJ

CIRCE groES groEL

Mycobacterium tuberculosis

hrcA dnaJ

dnaK grpE dnaJ

CIRCE groES groEL

CIRCE groEL

Chlamidiae

CIRCE hrcA grpE dnaK

dnaJ

CIRCE groES groEL

groEL

Synechocystis

hrcA

grpE dnaK

dnaJ

CIRCE groES groEL

CIRCE groEL

Mycoplasma

hrcA

grpE

CIRCE dnaK

CIRCE dnaJ

CIRCE groES groEL

CIRCE lon

CIRCE clpB

Closely related genomes: Phylogenetic footprinting

Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.

High conservation

purL

ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGCEC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGATYP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC ** *********** ** ****** ******* *** * ** * * * *

ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGTEC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGTYP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT ** ** ** ***** ***************** *** ******** ******* ** ***

ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA--EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA--YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA ** *** *** * ** ** **** ************ ****

ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAACEC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGGYP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC * ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** ***

Low conservation

yjcD

ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCACEC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCACYP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC *** * *** * *** ***** * * **** ** ************* ** * * *

ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAAEC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAAYP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG ** * *** * ** ** ****** ******** **** * *** * * *

ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAATEC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAACYP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C * * * ** * * * **** * * ** * * ** * * * *

ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTEC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTYP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA * ******* * * ****** * ** *** * * ** * ** ** ** * ***** **

Degeneration of sitestrpH

ttGtACAagttaactaGTacaaEC gtcgccgaATGTACTAGAGAACTAGTGCATtagcttatST accgcaggATGTACTAGTAAACTAGTTTAAtggattggYP gtcgtcggATGTTTTAACTAAATATTTTCAtgagtgatEH ctcgccgcATGTACTGATGGGTAACCGGCGctgaactg .**..* ****..*. .. .* . . . .BA tcactgtatttttttagtatactattaaacttatcctc

Problems and solutions

Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members.

Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities.

Too many genomes and regulons: apply preliminary automated screening.

Modification: ubiquitous regulators

• Present in many genomes

• Only core regulon is conserved

• Mode of regulation may vary

• Signals may be slightly different

Arginine repressor ArgR/AhrC

artJRv1652 Rv1653 Rv1654 Rv1655 Rv1656 Rv1658 Rv1659 Rv1383 Rv1384

argC argJ argB argD argF argGargHcarA carB yqiXyqiYyqiZ

rocRrocC rocArocB rocF rocDrocE

AhrC

2787 278827862785 414 1203 12043089 3090 4268426642652443

yqjN

4913533

TM1782 TM1783 TM1784 TM1785 TM1097TM1780 TM1781 TM0558TM0577 TM0593TM0592TM0591TM0371

? ? ? DR1415 DR0080DR0674 DR0678DR684 DR0668 DR2610 ? ?DR0742

Mycobacterium tuberculosus

Bacillus subtilis

Clostridium acetobutylicum

Thermotoga maritima

Deinococcus radiodurans

AhrC

argC argB argD argFargGargH carA carB artIartM artQargR

Escherichia coli

? HI0596HI0811 HI1727HI1209

Haemophilus influenzae

argE

argA

artP

HI1179H1177 HI1178 HI1180

Vibrio choleraeVC2644 VC2643 VC2641argR VC2645 VC2642 VC2618 VC2390 VC2389 VC2508 VCA075

9VCA075

7VCA075

8VCA076

0VC2316

ABC transporters (periplasmic components)

TM1170CA_3898

HI1080BS_yckK

DR0564

Cpn0604DR2278

Cpn0482HI1179

EC_artJ (arg)EC_artI (arg)

EC_argT (arg)EC_hisJ (his)

TM0593BS_glnH (gln)

Rv0411cEC_ybeJ

EC_yhdWBS_yqiX

EC_glnH (gln)CA_0129

DR2154DR2610

CA_4268CA_0491

BS_yxeMCA_1093

BS_ytmKBS_ytmJ

0.1 changes per site

EC_fliY (biosynthesis of flagellae)

Modification: horizontal transfer

• Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration

• Often rgulate large loci (several adjacent operons)

• Signals are mainly conserved

New signals

• Select a group of related genomes

• In each genome select metabolically related genes

• Add possibly co-transcribed genes

• Compare upstream regions for each genome independently

• Construct profiles

• Compare constructed profiles: if similar, then relevant

The purine regulon of Pyrococcus spp.• Use functional annotation and COGs to select genes encoding enzymes

from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA.

• Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position).

• However, the profiles are almost identical.

• There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct.

• Low specificity of profiles, thus >300 candidate genes in each genome.

• Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer.

• The new rule is absolutely specific: only one additional gene in each genome.

YgfO

YicE

UAPA_En

UAPC_En

YgfU

2635740_Bs

2635741_Bs

YcdG_Ec

UraA_Hi

UraA_Ec

2895752_EfPyrP_Bc

PyrP_Bs

YjcD_Hi

YjcDYgfQ

YtiP_Bs

2239289_Bs

YieG YicO

Y326_Mj

2314333_Hp

2689889_Bb

2689890_Bb

997

746

979

PbuX_Bs

965

969

981

997

980

965

758

940

714

996

997

999

994

778

749

998

1000

PH

PA A

PF

Sources

• G. Stormo

• J. Fickett

• W. Miller

• I. Dubchak

• Yuh et al. (1998)

• Tronche et al. (1997)

• textbooks

Discussions and collaboration

• Farid Chetouani (Institute Pasteur)

• Eugene Koonin (NCBI)

• Yuri Kozlov (Aginomoto)

• Leonid Mirny (Harvard - MIT)

• Alexander Mironov (GosNIIGenetika)

• Vasily Lybetsky (Inst. Probl. Inform. Trans.)

• Andrey Osterman (IntegratedGenomics)

• Danila Perumov (Inst. Nucl. Phys.)

• Pavel Pevzner (UC San Diego)

• Michael Roytberg (Inst. Math. Probl. Biol.)

Collaborators

• Andrey A. Mironov

• A. B. Rakhmaninova• Vadim Brodyansky• Lyudmila Danilova• Anna Gerasimova • Alexey Kazakov• Ekaterina Kotelnikova

• Olga Laikova• Pavel Novichkov• Ekaterina Panina • Elya Permina • Dmitry Ravcheev• Dmitry Rodionov• Natalya Sadovskaya• Alexey Vitreschak

Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...

Documents

Transcript of Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...