Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...

88
Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001

Transcript of Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School,...

Page 1: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Recognition of regulatory signals

Mikhail S. Gelfand

IntegratedGenomics-Moscow

NATO ASI School, October 2001

Page 2: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Why?

• Additional annotation tool (e.g. specificity of transporters and enzymes from large families)

• Important for practice (in addition to metabolic reconstruction)

• Interesting from the evolutionary point of view

Page 3: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Overview

0. Biological introduction

1. Algorithms• Representation of signals

• Deriving the signal

• Site recognition

2. Comparative genomics• Phylogenetic footprinting

• Consistency filtering

Page 4: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Some biology

• Transcription (DNA RNA)

• Splicing (pre-mRNA mRNA)

• Translation (mRNA protein)

• Regulation of transcription in prokaryotes

• … and eukaryotes

• Initiation of translation

Page 5: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Transcription and translation in prokaryotes

Page 6: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Initiation of transcription (bacteria)

Page 7: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Translation in prokaryotes

Page 8: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Translation (details)

Page 9: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Splicing (eukaryotes)

Page 10: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Regulation of transcriptionin prokaryotes

Page 11: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Structure of DNA-binding domain. Example 1

Page 12: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Structure of DNA-binding domain. Example 2

Page 13: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Protein-DNA interactions

Page 14: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Regulation of transcriptionin eukaryotes

Page 15: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Representation of signals

• Consensus

• Pattern (consensus with degenerate positions)

• Positional weight matrix (PWM, or profile)

• Logical rules

• RNA signals

Page 16: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Consensus

codB CCCACGAAAACGATTGCTTTTT

purE GCCACGCAACCGTTTTCCTTGC

pyrD GTTCGGAAAACGTTTGCGTTTT

purT CACACGCAAACGTTTTCGTTTA

cvpA CCTACGCAAACGTTTTCTTTTT

purC GATACGCAAACGTGTGCGTCTG

purM GTCTCGCAAACGTTTGCTTTCC

purH GTTGCGCAAACGTTTTCGTTAC

purL TCTACGCAAACGGTTTCGTCGG

consensus ACGCAAACGTTTTCGT

Page 17: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Pattern

codB CCCACGAAAACGATTGCTTTTT

purE GCCACGCAACCGTTTTCCTTGC

pyrD GTTCGGAAAACGTTTGCGTTTT

purT CACACGCAAACGTTTTCGTTTA

cvpA CCTACGCAAACGTTTTCTTTTT

purC GATACGCAAACGTGTGCGTCTG

purM GTCTCGCAAACGTTTGCTTTCC

purH GTTGCGCAAACGTTTTCGTTAC

purL TCTACGCAAACGGTTTCGTCGG

consensus ACGCAAACGTTTTCGT

pattern aCGmAAACGtTTkCkT

Page 18: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Frequency matrix

j a C G m A A A C G t T T k C k T

A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0

C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0

G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0

T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

I = j b f(b,j)[log f(b,j) / p(b)] Information content

Page 19: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Sequence logo

Page 20: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Positional weight matrix (PWM)

j a C G m A A A C G t T T k C k T

A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0

C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0

G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0

T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9

A 1.1 –1.0 –0.7 0.5 2.2 2.2 1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7

C –0.4 1.9 –0.7 1.6 –0.7 –0.7 0.1 2.2 –0.7 –1.2 –1.0 –0.7 –1.1 2.2 –0.3 –0.7

G –0.4 0.1 2.2 –1.1 –0.7 –0.7 –1.0 –0.7 2.2 –0.1 –0.1 –0.7 1.2 –0.7 1.0 –0.7

T –0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7 1.5 1.9 2.2 1.0 –0.7 0.6 2.2

Page 21: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

• Probabilistic motivation: log-likelihood (up to a linear transformation)

• More probabilistic motivation: z-score (with the suitable base of the logarithm)

• Thermodynamical motivation: free energy (assuming independence of positions, up to a linear transformation)

• Pseudocounts

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

Page 22: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Logical rules, trees etc.

Page 23: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Compilation of samples• Initial sample:

– GenBank

– specialized databases

– literature (reviews)

– literature (original papers)

• Correction of GenBank errors

• Checking the literature • removal of predicted sites

• Removal of duplicates

Page 24: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Re-alignment approaches

• Initial alignment by a biological landmark– start of transcription for promoters

– start codon for ribosome binding sites

– exon-intron boundary for splicing sites

• Deriving the signal within a sliding window

• Re-alignment

• etc. etc. until convergence

Page 25: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Gene starts of Bacillus subtilisdnaN ACATTATCCGTTAGGAGGATAAAAATG

gyrA GTGATACTTCAGGGAGGTTTTTTAATG

serS TCAATAAAAAAAGGAGTGTTTCGCATG

bofA CAAGCGAAGGAGATGAGAAGATTCATG

csfB GCTAACTGTACGGAGGTGGAGAAGATG

xpaC ATAGACACAGGAGTCGATTATCTCATG

metS ACATTCTGATTAGGAGGTTTCAAGATG

gcaD AAAAGGGATATTGGAGGCCAATAAATG

spoVC TATGTGACTAAGGGAGGATTCGCCATG

ftsH GCTTACTGTGGGAGGAGGTAAGGAATG

pabB AAAGAAAATAGAGGAATGATACAAATG

rplJ CAAGAATCTACAGGAGGTGTAACCATG

tufA AAAGCTCTTAAGGAGGATTTTAGAATG

rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG

rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG

rplM AGATCATTTAGGAGGGGAAATTCAATG

Page 26: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

dnaN ACATTATCCGTTAGGAGGATAAAAATG

gyrA GTGATACTTCAGGGAGGTTTTTTAATG

serS TCAATAAAAAAAGGAGTGTTTCGCATG

bofA CAAGCGAAGGAGATGAGAAGATTCATG

csfB GCTAACTGTACGGAGGTGGAGAAGATG

xpaC ATAGACACAGGAGTCGATTATCTCATG

metS ACATTCTGATTAGGAGGTTTCAAGATG

gcaD AAAAGGGATATTGGAGGCCAATAAATG

spoVC TATGTGACTAAGGGAGGATTCGCCATG

ftsH GCTTACTGTGGGAGGAGGTAAGGAATG

pabB AAAGAAAATAGAGGAATGATACAAATG

rplJ CAAGAATCTACAGGAGGTGTAACCATG

tufA AAAGCTCTTAAGGAGGATTTTAGAATG

rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG

rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG

rplM AGATCATTTAGGAGGGGAAATTCAATG

cons. aaagtatataagggagggttaataATG

num. 001000000000110110000000111

760666658967228106888659666

Page 27: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

dnaN ACATTATCCGTTAGGAGGATAAAAATG

gyrA GTGATACTTCAGGGAGGTTTTTTAATG

serS TCAATAAAAAAAGGAGTGTTTCGCATG

bofA CAAGCGAAGGAGATGAGAAGATTCATG

csfB GCTAACTGTACGGAGGTGGAGAAGATG

xpaC ATAGACACAGGAGTCGATTATCTCATG

metS ACATTCTGATTAGGAGGTTTCAAGATG

gcaD AAAAGGGATATTGGAGGCCAATAAATG

spoVC TATGTGACTAAGGGAGGATTCGCCATG

ftsH GCTTACTGTGGGAGGAGGTAAGGAATG

pabB AAAGAAAATAGAGGAATGATACAAATG

rplJ CAAGAATCTACAGGAGGTGTAACCATG

tufA AAAGCTCTTAAGGAGGATTTTAGAATG

rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG

rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG

rplM AGATCATTTAGGAGGGGAAATTCAATG

cons. tacataaaggaggtttaaaaat

num. 0000000111111000000001

5755779156663678679890

Page 28: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Positional information content before and after re-alignment

Page 29: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Positional nucleotide frequencies after re-alignment (aGGAGG pattern)

Page 30: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Enhancement of a weak signal

Page 31: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Deriving the signal ab initio

• “Discrete” (pattern-driven) approaches: word counting

• “Continuous” (profile-driven) approaches: optimization

Page 32: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Word counting. Short words

• Consider all k-mers

• For each k-mer compute the number of sequences containing this k-mer

– (maybe with some mismatches)

• Select the most frequent k-mer

Page 33: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Problem: Complete search is possible only for short words

Assumption: if a long word is over-represented, its subwords also are overrepresented

Solution: select a set of over-represented words and combine them into longer words

Page 34: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Word counting. Long words

• Consider some k-mers

• For each k-mer compute the number of sequences containing this k-mer

– (maybe with some mismatches)

• Select the most frequent k-mer

Page 35: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Problem: what k-tuples to start with?

1st attempt: those actually occurring in the sample.

But: the correct signal (the consensus word) may not be among them.

Page 36: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

2nd attempt: those actually occurring in the sample and some neighborhood.

But: – again, the correct signal (the consensus word)

may not be among them;– the size of the neighborhood grows

exponentially

Page 37: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Graph approach

Each k-mer in each sequence corresponds to a vertex. Two k-mers are linked by an arc, if they differ in at most h positions (h<<k).

Thus we obtain an n-partite graph (n is the number of sequences).

A signal corresponds to a clique (a complete subgraph) – or at least a dense subgraph – with vertices in each part.

Page 38: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

A simple algorithm

• Remove vertices that cannot be extended to complete subgraphs – that is, do not have arcs to all parts of the graph

• Remove pairs that cannot be extended …– that is, do not form triangles with the third

vertex in all parts of the graph

• Etc.(will not work “as is” for dense subgraphs)

Page 39: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Optimization. EM algorithms

• Generate an initial set of profiles (e.g. seed with all k-mers)

• For each profile

– find the best (highest scoring) representative in each sequence

– update the profile

• Iterate until convergence

Page 40: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

This algorithm converges.

However, it cannot leave the basin of attraction.

Thus, if the initial approximation is bad, it will converge to nonsense.

Solution: stochastic optimization.

Page 41: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Simulated annealing

• Goal: maximize the information content I

I = j b f(b,j)[log f(b,j) / p(b)]

• or any other measure of homogeneity of the sites

Page 42: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Let A be the current signal (set of candidate sites), and let I(A) be the corresponding information content.

Let B be a set of sites obtained by randomly choosing a different site in one sequence, and let I(B) be its information content.

• if I(B) I(A), B is accepted• if I(B) < I(A), B is accepted with probability

P = exp [(I(B) – I(A)) / T]The temperature T decreases exponentially, but

slowly; the initial temperature is chosen such that almost all changes are accepted.

Page 43: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Gibbs sampler

Again, A is a signal (set of sites), and I(A) is its information content.

At each step a new site is selected in one sequence with probability

P ~ exp [(I(Anew)]For each candidate site the total time of

occupation is computed.(Note that the signal changes all the time)

Page 44: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Use of symmetry• DNA-binding factors and their signals

Co-operative homogeneous

Palindromes

Repeats

Co-operative non-homogeneous

Cassetes

Others

RNA signals

Page 45: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Recognition: PWM/profiles

The simplest technique: positional nucleotide weights are

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

Score of a candidate site b1…bk is the sum of the corresponding positional nucleotide weights:

S(b1…bk ) = j=1,…,kW(bj,j)

Page 46: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Distribution of RBS profile scores on sites (green) and non-sites (red)

Page 47: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Pattern recognition

• Linear discriminant analysis

• Logical rules

• Syntactic analysis

• Context-sensitive grammars

• Perceptron

• Neural networks

Page 48: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Neural networks: architecture

• 4k input neurons (sensors), each responsible for observing a particular nucleotide at particular position

OR 2k neurons (one discriminates between purines and pyrimidines, the other, between AT and GC)

• One or more layers of hidden neurons• One output neuron

Page 49: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

• Each neuron is connected to all neurons of the next layer

• Each connection is ascribed a numerical weight

A neuron• Sums the signals at incoming connections• Compares the total with the threshold (or

transforms it according to a fixed function)• If the threshold is passed, excites the

outcoming connections (resp. sends the modified value)

Page 50: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Training:

• Sites and non-sites from the training sample are presented one by one.

• The output neuron produces the prediction.• The connection weights and thresholds are

modified if the prediction is incorrect.

Networks differ by architecture, particulars of the signal processing, the training schedule

Page 51: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Use of sequence context

• Presence of multiple co-operative sites– ArgR (E. coli), purine regulator (Pyrococcus)– XylR+CRP; CytR+CRP (E. coli)– MEF+MyoD in muscle-specific promoters

(mammals)

• Location relative to promoters – repressors vs. activators

Page 52: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

BenchmarkingDifficult, because:• Different algorithms are optimized for different

performance parameters• Incompatible training sets• Difficult to construct a homogeneous and

unambiguous testing set:– Unobserved sites– Competition between closely located sites– Activation in specific conditions– non-specific binding (52 out of 54 candidate HNF-1

binding sites do bind the factor)

Page 53: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Promoters of E. coli

• PWM at false positive rate 1 per 2000 bp:– 25% of all promoters,– 60% of constitutive (non-activated) promoters

• PWM perform as well as neural networks

Page 54: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Eukaryotic promoters

Page 55: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Ribosome binding sites• Information content of the profile predicts

the average reliability of predictions

Page 56: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

CRP (E. coli)

0102030405060708090

100110

3 3,2 3,4 3,6 3,8 4 4,2 4,4 4,6 4,8 5

threshold

OV

UN

OV: overprediction (% of false positives among candidate sites)UN: underprediction (% of lost true sites)

Page 57: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Comparative approach to the analysis of regulation

Making good predictions

with bad rules

Page 58: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Regulation of transcription in prokaryotes

Difficult:

• Small sample size

• Weak signals (or we do not know what features are relevant, maybe the DNA structure)

Page 59: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

CRP (E. coli)

0102030405060708090

100110

3 3,2 3,4 3,6 3,8 4 4,2 4,4 4,6 4,8 5

threshold

OV

UN

OV: overprediction (% of false positives among candidate sites)UN: underprediction (% of lost true sites)

Page 60: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

GenBank entry for the E. coli genomegene complement(120178..121551) /note="b0112" /gene="aroP"CDS complement(120178..121551) /gene="aroP" /product="aromatic amino acid transport protein"protein_bind complement(121599..121617) /bound_moiety="TyrR documented site"protein_bind complement(121622..121640) /bound_moiety="TyrR documented site"protein_bind complement(121653..121664) /bound_moiety="PutA predicted site"promoter complement(121683..121711) /note="factor Sigma70; promoter aroP; documented +1 at 121671"protein_bind complement(121810..121823) /bound_moiety="OxyR predicted site"protein_bind complement(121813..121835) /bound_moiety="ArgR predicted site"

aroP TyrR TyrR PutA Pr. OxyR ArgR

Page 61: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Many genomes are available =>

comparative approach

Basic assumption

Regulons (sets of co-regulated genes) are conserved

• well …in some cases

• in fact, in many cases

Page 62: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Corollary: The consistency check

• True sutes occur upstream of orthologous genes

• False sites are scattered at random

Page 63: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Orthologs

• Orthologous genes: – diverged by specitation– retain cellular role

• Paralogous genes: – diverged by duplication– retain biochemical function only

Page 64: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Orthology (definition)

• Genomes are shown as black “pipes”

• 1st event: duplication• 2nd event: specitation• Genes of the same

color are orthologous• Genes of different

color are paralogous

 

 

 

 

Page 65: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Search for orthologs (fast and dirty)

Genome 1 Genome 2

symmetrical best hit

A

B

B"

A'

B'

Page 66: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

The basic procedure

Genome 2Genome 2Genome 1Genome 1

Set of known sitesSet of known sites ProfileProfile

Genome NGenome N

Page 67: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Accounting for the operon structure

«Old» genome «New» genome

A

A

BC

BC

D

XD

EF

E

F

X

X

X

X

Page 68: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Checklist

• Presence of orthologous transcription factors

• Really orthologous (BETs, COGs etc. are not sufficient)

• * Conservation of the DNA-binding domain

• * Conservation of the core pathway

Page 69: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Purine regulons of E. coli and H. influenzae purR purR guaBA guaBA glyA pyrD pyrD prsA prsA glnB glnB purA purA codBA - codA pyrC - purT - gcvTHP - speAB - - ycfC purB

ycfC purB

purHD glyA

purHDglyA

purL purL cvpApurF

cvpApurF

purMN purMN purKE purKE purC purC yjcD yieG

HI0125

Page 70: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Predicted purine transporters

YgfO

YicE

UAPA_En

UAPC_En

YgfU

2635740_Bs

2635741_Bs

YcdG_Ec

UraA_Hi

UraA_Ec

2895752_EfPyrP_Bc

PyrP_Bs

YjcD_Hi

YjcDYgfQ

YtiP_Bs2239289_Bs

YieG YicO

Y326_Mj

2314333_Hp

2689889_Bb

2689890_Bb

997

746

979

PbuX_Bs

965

969

981

997

980

965

758

940

714

996

997

999

994

778

749

9981000

Page 71: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Changes in the operon structure: more examples

• glnK-amtB loci of methanogenic acrhaebacteria

M. thermoautotrophicum

NIF amtB glnK NIF amtB glnK

M. jannaschii

NIF glnK amtB

glnK NIF amtB

Page 72: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Tryptophan operons

E. coli

H. influenzae

trpE trpD trpC trpB trpA

ydfG trpB trpA

trpE trpD trpC

Page 73: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Heat chock (HrcA) regulons / CIRCE elements

Bacillus subtilis

CIRCE hrcA grpE dnaK dnaJ

CIRCE groES groEL

Mycobacterium tuberculosis

hrcA dnaJ

dnaK grpE dnaJ

CIRCE groES groEL

CIRCE groEL

Chlamidiae

CIRCE hrcA grpE dnaK

dnaJ

CIRCE groES groEL

groEL

Synechocystis

hrcA

grpE dnaK

dnaJ

CIRCE groES groEL

CIRCE groEL

Mycoplasma

hrcA

grpE

CIRCE dnaK

CIRCE dnaJ

CIRCE groES groEL

CIRCE lon

CIRCE clpB

Page 74: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Closely related genomes: Phylogenetic footprinting

Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.

Page 75: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

High conservation

purL

ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGCEC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGATYP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC ** *********** ** ****** ******* *** * ** * * * *

ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGTEC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGTYP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT ** ** ** ***** ***************** *** ******** ******* ** ***

ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA--EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA--YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA ** *** *** * ** ** **** ************ ****

ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAACEC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGGYP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC * ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** ***

Page 76: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Low conservation

yjcD

ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCACEC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCACYP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC *** * *** * *** ***** * * **** ** ************* ** * * *

ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAAEC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAAYP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG ** * *** * ** ** ****** ******** **** * *** * * *

ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAATEC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAACYP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C * * * ** * * * **** * * ** * * ** * * * *

ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTEC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTYP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA * ******* * * ****** * ** *** * * ** * ** ** ** * ***** **

Page 77: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Degeneration of sitestrpH

ttGtACAagttaactaGTacaaEC gtcgccgaATGTACTAGAGAACTAGTGCATtagcttatST accgcaggATGTACTAGTAAACTAGTTTAAtggattggYP gtcgtcggATGTTTTAACTAAATATTTTCAtgagtgatEH ctcgccgcATGTACTGATGGGTAACCGGCGctgaactg .**..* ****..*. .. .* . . . .BA tcactgtatttttttagtatactattaaacttatcctc

Page 78: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Problems and solutions

Unique members of regulons may be lost: use of additional genomes decreases the number of “orphan” regulon members.

Closely related factors may have similar sites: careful analysis of function and analysis of particular sites is usually sufficient to resolve ambiguities.

Too many genomes and regulons: apply preliminary automated screening.

Page 79: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Modification: ubiquitous regulators

• Present in many genomes

• Only core regulon is conserved

• Mode of regulation may vary

• Signals may be slightly different

Page 80: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Arginine repressor ArgR/AhrC

artJRv1652 Rv1653 Rv1654 Rv1655 Rv1656 Rv1658 Rv1659 Rv1383 Rv1384

argC argJ argB argD argF argGargHcarA carB yqiXyqiYyqiZ

rocRrocC rocArocB rocF rocDrocE

AhrC

2787 278827862785 414 1203 12043089 3090 4268426642652443

yqjN

4913533

TM1782 TM1783 TM1784 TM1785 TM1097TM1780 TM1781 TM0558TM0577 TM0593TM0592TM0591TM0371

? ? ? DR1415 DR0080DR0674 DR0678DR684 DR0668 DR2610 ? ?DR0742

Mycobacterium tuberculosus

Bacillus subtilis

Clostridium acetobutylicum

Thermotoga maritima

Deinococcus radiodurans

AhrC

argC argB argD argFargGargH carA carB artIartM artQargR

Escherichia coli

? HI0596HI0811 HI1727HI1209

Haemophilus influenzae

argE

argA

artP

HI1179H1177 HI1178 HI1180

Vibrio choleraeVC2644 VC2643 VC2641argR VC2645 VC2642 VC2618 VC2390 VC2389 VC2508 VCA075

9VCA075

7VCA075

8VCA076

0VC2316

Page 81: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

ABC transporters (periplasmic components)

TM1170CA_3898

HI1080BS_yckK

DR0564

Cpn0604DR2278

Cpn0482HI1179

EC_artJ (arg)EC_artI (arg)

EC_argT (arg)EC_hisJ (his)

TM0593BS_glnH (gln)

Rv0411cEC_ybeJ

EC_yhdWBS_yqiX

EC_glnH (gln)CA_0129

DR2154DR2610

CA_4268CA_0491

BS_yxeMCA_1093

BS_ytmKBS_ytmJ

0.1 changes per site

EC_fliY (biosynthesis of flagellae)

Page 82: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Modification: horizontal transfer

• Impossible to resolve the orthology relationships: a homologous regulated gene is sufficient for corroboration

• Often rgulate large loci (several adjacent operons)

• Signals are mainly conserved

Page 83: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

New signals

• Select a group of related genomes

• In each genome select metabolically related genes

• Add possibly co-transcribed genes

• Compare upstream regions for each genome independently

• Construct profiles

• Compare constructed profiles: if similar, then relevant

Page 84: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

The purine regulon of Pyrococcus spp.• Use functional annotation and COGs to select genes encoding enzymes

from purine pathway: purA, purB, purC, purF, purD, purE, purL-I, purL-II, purT, guaA.

• Construct profiles for each genome. The quality of profiles is weak (< 1 bit/position).

• However, the profiles are almost identical.

• There is no significant similarity of upstream regions (outside sites). Thus the profiles are probably correct.

• Low specificity of profiles, thus >300 candidate genes in each genome.

• Observation: in upstream regions of all genes from the initial sample the candidate sites occur twice with 22 bp spacer.

• The new rule is absolutely specific: only one additional gene in each genome.

Page 85: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

YgfO

YicE

UAPA_En

UAPC_En

YgfU

2635740_Bs

2635741_Bs

YcdG_Ec

UraA_Hi

UraA_Ec

2895752_EfPyrP_Bc

PyrP_Bs

YjcD_Hi

YjcDYgfQ

YtiP_Bs

2239289_Bs

YieG YicO

Y326_Mj

2314333_Hp

2689889_Bb

2689890_Bb

997

746

979

PbuX_Bs

965

969

981

997

980

965

758

940

714

996

997

999

994

778

749

998

1000

PH

PA A

PF

Page 86: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Sources

• G. Stormo

• J. Fickett

• W. Miller

• I. Dubchak

• Yuh et al. (1998)

• Tronche et al. (1997)

• textbooks

Page 87: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Discussions and collaboration

• Farid Chetouani (Institute Pasteur)

• Eugene Koonin (NCBI)

• Yuri Kozlov (Aginomoto)

• Leonid Mirny (Harvard - MIT)

• Alexander Mironov (GosNIIGenetika)

• Vasily Lybetsky (Inst. Probl. Inform. Trans.)

• Andrey Osterman (IntegratedGenomics)

• Danila Perumov (Inst. Nucl. Phys.)

• Pavel Pevzner (UC San Diego)

• Michael Roytberg (Inst. Math. Probl. Biol.)

Page 88: Recognition of regulatory signals Mikhail S. Gelfand IntegratedGenomics-Moscow NATO ASI School, October 2001.

Collaborators

• Andrey A. Mironov

• A. B. Rakhmaninova• Vadim Brodyansky• Lyudmila Danilova• Anna Gerasimova • Alexey Kazakov• Ekaterina Kotelnikova

• Olga Laikova• Pavel Novichkov• Ekaterina Panina • Elya Permina • Dmitry Ravcheev• Dmitry Rodionov• Natalya Sadovskaya• Alexey Vitreschak