CONTROLLING 1. Controlling Controlling is monitoring, comparing and correcting work performance. 2.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science &...
-
Upload
sylvia-byrd -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science &...
![Page 1: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/1.jpg)
1
Prediction of Regulatory Elements Controlling
Gene Expression
Martin Tompa
Computer Science & EngineeringGenome Sciences
University of WashingtonSeattle, Washington, U.S.A.
![Page 2: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/2.jpg)
2
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 3: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/3.jpg)
3
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 4: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/4.jpg)
4
DNA, Genes, and Proteins
DNA: program for cell processes
Proteins: execute cell processes
TCCAA
CGGTGC
TGAGGT
GCAC
GeneProtein
DNA
![Page 5: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/5.jpg)
5
Regulation of Genes
• What turns genes on (producing a protein) and off?
• When is a gene turned on or off?
• Where (in which cells) is a gene turned on?
• At what rate is the gene product produced?
![Page 6: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/6.jpg)
6
Regulation of Genes
GeneRegulatory Element
Transcription Factor(Protein)
DNA
RNA polymerase
(Protein)
![Page 7: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/7.jpg)
7
Regulation of Genes
DNA
Regulatory Element Gene
Transcription Factor(Protein)
RNA polymerase
(Protein)
![Page 8: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/8.jpg)
8
Regulation of Genes RNA
polymerase(Protein)
DNA
New protein
Regulatory Element Gene
Transcription Factor(Protein)
![Page 9: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/9.jpg)
9
GoalIdentify regulatory elements in DNA sequences. These are:
• Binding sites for proteins
• Short sequences (5-25 nucleotides)
• Up to 1000 nucleotides (or farther) from gene
• Inexactly repeating patterns (“motifs”)
![Page 10: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/10.jpg)
10
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 11: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/11.jpg)
11
2 Types of Motif Discovery
1. Motif discovery by overrepresentation• One species
• Multiple (co-regulated) genes
2. Motif discovery by phylogenetic footprinting
• Multiple species
• One gene
![Page 12: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/12.jpg)
12
Overrepresentation: Daf-19 Binding Sites in C. elegans
GTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6
F02D8.3-150 -1
![Page 13: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/13.jpg)
13
Phylogenetic Footprinting:Regulatory Element of Growth Hormone Gene
-200 -1
Chicken
Rat
Human
Dog
Sheep
AGGGGATAAGGGTATAAGGGTATAAGGGTATAAGGGTATA
![Page 14: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/14.jpg)
14
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 15: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/15.jpg)
15
MEME
• (Multiple EM for Motif Elicitation)
Bailey & Elkan, 1995
• Very general iterative method based on Expectation Maximization
• Available at meme.sdsc.edu/meme/website/intro.html
![Page 16: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/16.jpg)
16
Overrepresented Motifs
• Given sequences X = {X1, X2, …, Xn},
find statistically overrepresented motifs of length k
• For simplicity, assume– Exactly one motif instance per sequence
– Sequences over DNA alphabet
![Page 17: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/17.jpg)
17
Hidden Information
• Z = {Zij}, where
1, if motif instance starts at Zij = position j of Xi
0, otherwise• Iterate over probabilistic models that
could generate X and Z, trying to converge on this solution
{
![Page 18: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/18.jpg)
18
Model Parameters
• Motif profile: 4×k matrix θ = (θrp),
r {A,C,G,T}
1 p k
θrp = Pr(residue r in position p of motif)
• Background distribution:
θr0 = Pr(residue r in random nonmotif
position)
![Page 19: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/19.jpg)
19
Profile Example
GTTGTC 0 0 0 .4 0 0GTTTCC 0 .2 0 0 .8 1GCTACC 1 0 0 .2 0 0GTTACC 0 .8 1 .4 .2 0GTTTCC
profile θ
![Page 20: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/20.jpg)
20
Overview: Expectation Maximization
• Goal: Find profile θ and motif positions Z that have maximum likelihood
• At each iteration:
– E-step: From θ predict likely motif positions Z
– M-step: From sequences at positions Z compute new profile θ
![Page 21: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/21.jpg)
21
Expectation Maximization
• Goal: Find θ, Z that maximize Pr (X, Z | θ)
• At iteration t:– E-step: Z(t) = E (Z | X, θ(t))
– M-step: Find θ(t+1) that maximizes
Pr (X, Z(t) | θ(t+1))
![Page 22: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/22.jpg)
22
E-step Details
Zij(t) = Pr(Xi | Zij=1, θ(t))
Σj Pr(Xi | Zij=1, θ(t))
Xi
j
Use θ1(t), θ2
(t), …, θk(t) Use θ0
(t)
![Page 23: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/23.jpg)
23
M-step Details
• If Zij(t) {0,1} it would be straightforward:
Calculate profile θ1, θ2, …, θk from motif instances and θr0 from frequency of r outside of motif instances.
• But Zij(t) [0,1], so weight these
frequencies by the appropriate values of Zij
(t) .
![Page 24: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/24.jpg)
24
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 25: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/25.jpg)
25
Gibbs Sampler
• Lawrence et al., 1993• Very general iterative method, related
to Markov Chain Monte Carlo (MCMC)• Available at bayesweb.wadsworth.org/gibbs/gibbs.html
![Page 26: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/26.jpg)
26
One Iteration of Gibbs Sampler• n motif instances each of
length kGGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG
CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG
GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG
GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG
GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG
![Page 27: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/27.jpg)
27
One Iteration of Gibbs Sampler• n motif instances each of
length k
• Remove one at random
• Form profile of remaining n-1
• Let pi be the probability with
which g[i .. i+k-1] fits profile
GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG
CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG
GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG
GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG
GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG
i
![Page 28: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/28.jpg)
28
One Iteration of Gibbs Sampler• n motif instances each of
length k
• Remove one at random
• Form profile of remaining n-1
• Let pi be the probability with
which g[i .. i+k-1] fits profile
• Choose to start replacement at i with probability proportional to pi
GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG
CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG
GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG
GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG
GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG
i
![Page 29: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/29.jpg)
29
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 30: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/30.jpg)
30
FootPrinter
• Blanchette & Tompa, 2002
• First algorithm explicitly designed for phylogenetic footprinting
• Available at bio.cs.washington.edu/software.html
![Page 31: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/31.jpg)
31
Phylogenetic Footprinting(Tagle et al. 1988)
Functional regions of DNA evolve slower than nonfunctional ones.
![Page 32: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/32.jpg)
32
Phylogenetic Footprinting(Tagle et al. 1988)
Functional regions of DNA evolve slower than nonfunctional ones.
• Consider a set of orthologous (i.e., corresponding) sequences from different species
• Identify unusually well conserved substrings (i.e., ones that have not changed much over the course of evolution)
![Page 33: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/33.jpg)
33
CLUSTALW multiple sequence alignment (rbcS gene)Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC
Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA
Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA
Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
![Page 34: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/34.jpg)
34
FootPrinter• Inputs:
– evolutionary tree T– corresponding regulatory regions at leaves
• Output: motifs well conserved w.r.t. T.
![Page 35: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/35.jpg)
35
Finding Short Motifs
AGTCGTACGTGAC... (Human)
AGTAGACGTGCCG... (Chimp)
ACGTGAGATACGT... (Rabbit)
GAACGGAGTACGT... (Mouse)
TCGTGACGGTGAT... (Rat)
Size of motif sought: k = 4
![Page 36: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/36.jpg)
36
Most Parsimonious Solution
“Parsimony score”: 1 mutation
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGTGAGATACGT...
GAACGGAGTACGT...
TCGTGACGGTGAT...ACGGACGT
ACGT
ACGT
![Page 37: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/37.jpg)
37
Substring Parsimony ProblemGiven:
• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d
Problem:
• Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d.
This problem is NP-hard.
![Page 38: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/38.jpg)
38
FootPrinter’s Exact Algorithm(with Mathieu Blanchette, generalizing Sankoff and Rousseau
1975)
Wu [s] = best parsimony score for subtree rooted at node u,
if u is labeled with string s.
AGTCGTACGTG
ACGGGACGTGC
ACGTGAGATAC
GAACGGAGTAC
TCGTGACGGTG
… ACGG: 2 ACGT: 1 ...
… ACGG: 0 ACGT: 2...
… ACGG: 1 ACGT: 1 ...
…
ACGG: + ACGT: 0
...
… ACGG: 1 ACGT: 0 ...
4k entries
… ACGG: 0 ACGT: + ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
![Page 39: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/39.jpg)
39
Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u
Running Time
Number of species
Average sequence
length
Motif length
Total time O(n k (4k + l ))
![Page 40: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/40.jpg)
40
Improvements• Better algorithm reduces time from
O(n k (42k + l )) to O(n k (4k + l ))
• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)
• Amenable to many useful extensions (e.g., allow insertions and deletions)
![Page 41: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/41.jpg)
41
Application to -actin Gene
Gilthead sea bream (678 bp)
Medaka fish (1016 bp)
Common carp (696 bp)
Grass carp (917 bp)
Chicken (871 bp)
Human (646 bp)
Rabbit (636 bp)
Rat (966 bp)
Mouse (684 bp)
Hamster (1107 bp)
![Page 42: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/42.jpg)
42
Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG
AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA
GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC
ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG
TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA
GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA
TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT
HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT
TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT
GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC
CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG
Parsimony score over 10 vertebrates: 0 1 2
![Page 43: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/43.jpg)
43
Motifs Absent from Some Species
• Find motifs – with small parsimony score
– that span a large part of the tree
• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs
![Page 44: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/44.jpg)
44
Application to c-fos Gene
Asked for motifs of length 10, with 0 mutations over tree of
size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26
Puffer fish
Chicken
Pig
Mouse
Hamster
Human
10
2
7
2
2
21
0
1
1
Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28
![Page 45: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/45.jpg)
45
Application to c-fos GeneMotif Score Conserved in Known?
CAGGTGCGAATGTTC 0 4 mammals
TTCCCGCCTCCCCTCCCC 0 4 mammals yes
GAGTTGGCTGcagcc 3 puffer + 4 mammals
GTTCCCGTCAATCcct 1 chicken + 4 mammals yes
CACAGGATGTcc 4 all 6 yes
AGGACATCTG 1 chicken + 4 mammals yes
GTCAGCAGGTTTCCACG 0 4 mammals yes
TACTCCAACCGC 0 4 mammals
metK in B. subtilis
![Page 46: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/46.jpg)
46
Outline
• Regulation of genes
• Motif discovery by overrepresentation– MEME– Gibbs sampling
• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter
![Page 47: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/47.jpg)
47
MicroFootPrinter
• Neph & Tompa, 2006
• Designed specifically for phylogenetic footprinting in prokaryotic genomes
• Front end to FootPrinter• Available at bio.cs.washington.edu/software.html
![Page 48: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/48.jpg)
48
Microbial Footprinting• 1454 prokaryotes with genomes completely
sequenced (as of 2/17/2011)– For any prokaryotic gene of interest, plenty of close genes
in other species available– Relatively simple genomes
• MicroFootPrinter– undergraduate Computational Biology Capstone project– Goal: simple interface for microbiologists– User specifies species and gene of interest– Automates collection of orthologous genes, cis-regulatory
sequences, gene tree, parameters
![Page 49: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/49.jpg)
49
Demo
• MicroFootPrinter home• Examples: Agrobacterium tumefaciens
genes regulated by ChvI (with Eugene Nester)
– chvI (two component response regulator)– ropB (outer membrane protein )
![Page 50: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/50.jpg)
50
Sample chvI motifParsimony score: 2Span: 41.10Significance score: 4.22
B. henselae -151 GCTACAATTTR. etli -90 GCCACAATTTR. leguminosarum -106 GCCACAATTTS. meliloti -119 GCCACAATTTS. medicae -118 GCCACAATTTA. tumefaciens -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp. -87 GCCACATTTTO. anthropi -158 GCCACATTTTB. suis -38 GCCACATTTTB. melitensis -156 GCCACATTTTB. abortus -156 GCCACATTTTB. ovis -156 GCCACATTTTB. canis -38 GCCACATTTT
![Page 51: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/51.jpg)
51
Sample ropB motifParsimony score: 1Span: 20.70Significance score: 1.34
Jannaschia sp. -151 CACATTTTGGR. etli -134 CACAATTTGGR. leguminosarum -135 CACAATTTGGA. tumefaciens -131 CACATTTTGGS. meliloti -128 CACATTTTGGS. medicae -128 CACATTTTGG
![Page 52: 1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,](https://reader036.fdocuments.net/reader036/viewer/2022070403/56649f315503460f94c4c3f1/html5/thumbnails/52.jpg)
52
Combined ChvI MotifropB: CACATTTTGGchvI: GCCACAATTTAtu1221: TTGTCACAAT
ultimate: GYCACAWTTTGGY={C,T}
W={A,T}