Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....
-
Upload
loreen-simpson -
Category
Documents
-
view
216 -
download
0
Transcript of Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....
Discovery of transcription networks
Lecture3 Nov 2012Regulatory GenomicsWeizmann Institute Prof. Yitzhak Pilpel
Lecture3 Nov 2012Regulatory GenomicsWeizmann Institute Prof. Yitzhak Pilpel
0 5 10 15-2
-1
0
1
2
3
4
Hierarchical clustering
Promoter Motifs and expression
profilesCGGCCCCGCGGA
CTCCTCCCCCCCTTC TGGCCAATCA
ATGTACGGGTG
3
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
AlignACE ExampleAlignACE Example
…HIS7 …ARO4…ILV6…THR4…ARO1…HOM2…PRO3
300-600 bp of upstream sequence per gene are searched in
Saccharomyces cerevisiae.
http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
A cluster of gene may contain a A cluster of gene may contain a common motif in their promotercommon motif in their promoter
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Find a needle in a haystack
Computational Identification of Computational Identification of Cis-regulatory Elements Cis-regulatory Elements
Associated with Groups of Associated with Groups of Functionally Related Genes in Functionally Related Genes in
Saccharomyces cerevisiaeSaccharomyces cerevisiae
J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church
Journal of Molecular Biology (2000)
Example Example
http://www.cifn.unam.mx/Computational_Genomics/old_research/BIOL2.html
GAL4 is one of the yeast genes required for growth on galactose.
1 2 3 4 5 6
A 0.8 0.4 1 0.6 0 1
C 0 0 0 0 0 0
G 0.2 0.6 0 0 1 0
T 0 0 0 0.4 0 0
Motif Representation
G1 A G A A G AG2 A A A T G AG3 G A A T G AG4 A G A A G AG5 A G A A G A
Finding New MotifFinding New Motif
• By lab work
• By comparison to known motifs in other species
• By searching upstream regions of a set of potentially co-regulated genes
The genes bound by the TF Abf1 can be clustered into several groups, some contain a motif
NCGTNNNNARTGAT
CGATGAGMTK
NCGTNNNNARTGAT & CGATGAGMTK
(sporulation experiment)
Search SpaceSearch Space
• Size of search space:
• L=600, W = 15, N = 10 :
NN LWL )1(
2710size
• Exact search methods are not feasible
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
300-600 bp of upstream sequence per gene are searched in
Saccharomyces cerevisiae.
Based on slides from G. Church Computational Biology course at Harvard
AlignACE ExampleAlignACE Example
Input Data Set
K-means
• Start with random positions of centroids.
• Assign data points to centroids.
• Move centroids to center of assigned points.
• Iterate till minimal cost.
Iteration = 3
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCACMAP score = -10.0
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
AlignACE ExampleAlignACE Example
Based on slides from G. Church Computational Biology course at Harvard
Initial Seeding
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
Add?
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
TCTCTCTCCA
How much better is the alignment with this site as opposed to without?
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Based on slides from G. Church Computational Biology course at Harvard
AlignACE ExampleAlignACE Example
Sampling
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
Add?
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
How much better is the alignment with this site as opposed to without?
Remove.
TGAAAAAATG
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Based on slides from G. Church Computational Biology course at Harvard
AlignACE ExampleAlignACE Example
Sampling
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
GACATCGAAAC
GCACTTCGGCG
GAGTCATTACA
GTAAATTGTCA
CCACAGTCCGC
TGTGAAGCACA
How much better is the alignment with this new
column structure?
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Based on slides from G. Church Computational Biology course at Harvard
AlignACE ExampleAlignACE Example
Column Sampling
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA MAP score = 20.37
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Based on slides from G. Church Computational Biology course at Harvard
AlignACE ExampleAlignACE Example
The Best Motif
• MAP – Maximal a priori log likelihood score
• This is what the algorithm tries to optimize.
• Measures the degree of over representation
of the motif in the input sequence relative to expectation in a random sequence.
The MAP ScoreThe MAP Score
MAP
= standard Beta & Gamma functionsN = number of aligned sites; T = number of total possible sitesFjb = number of occurrences of base b at position j (Fsum)Gb = background genomic frequency for base bb = n x Gb for n pseudocounts (sum)W = width of motif; C = number of columns in motif (W>=C)
Based on slides from G. Church Computational Biology course at Harvard
The MAP ScoreThe MAP Score
N = number of aligned sites
exp = expected number of sites in the input sequence, comparing to a random model
expNNMAP log
The MAP ScoreThe MAP Score
P = 1 site every 16,000 bases
7
4
1
For 64,000 bases sequence - exp = 4
Motif Number of genes (each
1,000 BPs long
promoter)
Number of times found
Expected number of
times
MAP score
AGGGTAA (7) 16 10 ~1 10GTAGATG (7) 16 2 ~1 0.60206CCGTGAG (7) 160 10 ~10 0GATGTA (6) 16 2 ~4 -0.60206AGGGTA (6) 16 10 4 4.089354A (1) 16 2504 ~2500 1.73AAAAAAA (7) 16 5 ~1.5 2.614394GGGGGGG (7) 16 5 ~0.5 5
expNNMAP log
Some examples
Very intuitive: any things
that’s long, that occurs many
times and that is different
from background will score
highly
The MAP Score PropertiesThe MAP Score Properties
a) Motif should be “strong”
b) Input sequence can’t be too long
expNNMAP log
P = 1 site every 16,000 bases
7
4
1
15001012216000
1exp 6 Genome length ~12Mb :
Motif needs more than 1500 sites to get a positive MAP score: 0
1500
1500log1500log
exp
NNMAP
Problem: most transcription factor binding sites will only occur in dozens to hundreds of genes
Solution:Cluster genes before searching for motifs
Time-point 1
Tim
e-po
int 3
Tim
e-po
int 2
Group Specificity Score:Group Specificity Score:
How well a motif targets the genes used to find it comparing to all genome ?
What is the probability to have suchlarge intersection?
1Sx
All Genome (N)All Genome (N)
Motif ORFs Motif ORFs Group (SGroup (S11))
ORFs with ORFs with best sites (Sbest sites (S22))
XX
N = Total # of ORFs in the genome (6226)
S1 = # ORFs used to align the motif
S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores)
X = # size of intersection of S1 and S2
1
2
SNxS
NS2
Based on slides from G. Church Computational Biology course at Harvard
Group Specificity Score:Group Specificity Score:
How well a motif targets the genes used to find it comparing to all genome ?
What is the probability to have suchlarge intersection?
1Si
All Genome (N)All Genome (N)
Motif ORFs Motif ORFs Group (SGroup (S11))
ORFs with ORFs with best sites (Sbest sites (S22))
XX
N = Total # of ORFs in the genome (6226)
S1 = # ORFs used to align the motif
S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores)
X = # size of intersection of S1 and S2
1
2
SNiS
NS2
),min( 21 SS
xi
S
Based on slides from G. Church Computational Biology course at Harvard
Positional Bias Score:Positional Bias Score:
Measures the degree of preference of positioning
in a particular range upstream to translational start.
Based on slides from G. Church Computational Biology course at Harvard
#ORFS 10
6
1
Start -600 bp50 bp
• Find best 200 sites in the genome
Restrict sites to segment of length [s = 600 bp] from translation start
• t = # sites in the segment
• Choose window size [w = 50 bp]
• m = # sites in the most enriched window
Positional Bias Score:Positional Bias Score:
What is the probability to have m or more
sites in a window of size w?
Based on slides from G. Church Computational Biology course at Harvard
m
tmt
s
w
1
m
s
w
#ORFS 10
1
Start -600 bp
50 bp
t
miP
• Find best 200 sites in the genome
Restrict sites to segment of length [s = 600 bp] from translation start
• t = # sites in the segment
• Choose window size [w = 50 bp]
• m = # sites in the most enriched window
Positional Bias Score:Positional Bias Score:
What is the probability to have m or more
sites in a window of size w?
Based on slides from G. Church Computational Biology course at Harvard
i
tit
s
w
1
i
s
w
#ORFS 10
1
Start -600 bp
50 bp
Lecture TopicsLecture Topics
• Introduction to DNA regulatory motifs
• AlignACE - A motif finding algorithm
• Assessment of motifs
• AlignACE results on yeast genome
• Summary & Conclusions
Comparisons of motifsComparisons of motifs
• The CompareACE program finds best alignment between two motifs and calculates the correlation between the two position-specific scoring matrices
• Similar motifs: CompareACE score > 0.7
Based on slides from G. Church Computational Biology course at Harvard
Clustering motifs by similarityClustering motifs by similaritymotif Amotif Bmotif Cmotif D A B C D
A 1.0 0.9 0.1 0.0 B 1.0 0.2 0.1C 1.0 0.8D 1.0
Pairwise CompareACE scores
Compa
reACE
cluster 1: A, Bcluster 2: C, D
HierarchicalClustering
1 2 3 4 5 6
A 0.8 0.4 1 0.6 0 1
C 0 0 0 0 0 0
G 0.2 0.6 0 0 1 0
T 0 0 0 0.4 0 0
1 2 3 4 5 6
A 0.4 0.4 1 0.6 0 0
C 0 0 0 0 0 1
G 0.6 0.6 0 0 1 0
T 0 0 0 0.4 0 0
Most Group Specific MotifsMost Group Specific Motifs
Most Positional Biased MotifsMost Positional Biased Motifs
• 250 AlignACE runs on randomly created groups of ORFs, of size 20, 40, 60, 80,and 100 ORFs.
Negative ControlsNegative Controls
Based on slides from G. Church Computational Biology course at Harvard
MAP MAP
random real
Negative ControlsNegative Controls
MAP cut off of 10, Group Specificity cutoff of :
False Positives = 10-20%
1010
Positive ControlsPositive Controls
• 29 listed TFs with five or more known binding sites were chosen.
• AlignACE was run on the upstream regions of the corresponding regulated genes.
• An appropriate motif was found in 21/29 cases.
• False negative rate = ~ 10-30 %
Based on slides from G. Church Computational Biology course at Harvard
The dataThe data
• Organism: Saccharomyces cerevisiae• Microarray experiment : Affymetrix
microarrays of 6,220 mRNA • Data: gathered by Cho et al. • 15 time points, spanned about 4 hours
across two cell cycles• Genome sequence
Typical clusters of genes in the data
Variance normalization and clustering of expression time series
Variance normalization and clustering of expression time series
•3,000 most variable ORFs were chosen (based on the normalized dispersion in expression level of each gene across the time points (s.d./mean).
•The 15 time points were used to construct a 3,000 by 15 data matrix.
•The variance of each gene was normalized across the 15 conditions:
Subtracting the mean across the time points from the expression level of each gene and dividing by the standard deviation across the time point.
Before and after mean - variance normalizationBefore and after mean - variance normalization
13579
11
13
-6-4-202468
Time
Gene Expression
Gene1 Gene2 Gene3
13579
11
13
-3
-2
-1
0
1
2
Time
Normalized Expression
Gene1 Gene2 Gene3
Before normalization
After normalization
Time-point 1
Gene 1
Gene 2
Normalized ExpressionData from microarrays
Representation of expression dataRepresentation of expression data
Euclidean distance
K-means
= position of data point Xi
•Start with random positions of centroids.
= position of data centroid C
Iteration = 0
Choosing K
Since we don’t know the number of clusters in advance we need a way to estimate it.
In order to choose the number of clusters K, the Sum of Squares of Errors is calculated for different K values. A clear break point indicates the “natural” number of clusters in the data.
K
Sum Squared errors
Significantly enrichment of functional category within clusters
Significantly enrichment of functional category within clusters
• Each gene was mapped into one of 199 functional categories ( according to MIPS database ).
• For each cluster, P-values was calculated for observing the frequencies of genes from particular functional categories.
• There was significant grouping of genes within the same cluster.
The hyper-geometric scoreThe hyper-geometric scoreP values were calculated for finding at least (k) ORFs from a particular functional category within a cluster of size (n).
where (f) is the total number of genes within a functional category and (g) is the total number of genes within the genome (6,220).
P- values greater than 3×10- 4 are not reported, as their total expectation within the cluster would be higher than 0.05 As we tested 199 MIPS (ref.15).
Challenge: generalize hyper-geometric for more than two sets
Chr V
Expression cluster
Functional group
Sequence- MCB element ConsensusesSequence- MCB element Consensuses
This motif was later mapped to the literature and confirmed to be the very well known MCB element which is known to control the periodicity of the genes which peak at G1-S.
nucleotides
MCB element
clusters
The existence of motif in all ORF’s of each clustersThe existence of motif in all ORF’s of each clusters
Location of the motif - MCB element Location of the motif - MCB element
•
Distance from ATG (b.p)
SCB element SCB element
This motif (later found to be the SCB element) was the second scoring motif within this cluster. The SCB element is also a very well-known cis-regulatory element which contributes to the periodicity of the genes within the G1-S regulon.
ribonucleotide reductase
Determining the cell-cycle periodicity of clusters
Determining the cell-cycle periodicity of clusters
Show Fourier Analysis allow to rank the genes according to the periodicity of cell cycle.
expression matrix
-10
-5
0
5
10
1357911131517192123252729313335
time
expr
essi
on
cell cycle high Periodicity low periodicity low periodicity
Explain FFT… (including ORs variability)
Periodic clustersPeriodic clusters
Non periodic clustersNon periodic clusters
And this was just the beginning…
Collaboration
?
Co-occurrence
(AND)
Redundancy
(OR)
In case of two motifs derived from a cluster
http://longitude.weizmann.ac.il/publications/PilpelNatGent01.pdf
Logic of interaction of motifs
5 10 15
-2
0
2EC=0.05
5 10 15
-2
0
2EC=0.05
TimeTime
Exp
ress
ion
leve
l
Only M1 Only M2
Exp
ress
ion
leve
l
Time5 10 15
-2
0
2EC=0.23
M1 AND M2
G2 G2
M1 M2
Synergistic motifs
A combination of two motifs is called ‘synergistic’ if the expression coherence
score of the genes that have the two motifs is significantly higher than the scores of the
genes that have either of the motifs
SFFMcm1
A global map of combinatorial expression control
mRPE72
SWI5
SFF '
MCM1
SFFMCM1'
ECB SCB
MCB
PAC
mRRPE
mRRSE3
GCN4
BAS1
LYS14
RAP1
mRPE34
mRPE57
mRPE6mRPE58
STRE
RPN4 ABF1
PDR
CCA
PHO4
AFT1
STE12
MIG1
CSRE
HAP234
ALPHA1'
ALPHA1
ALPHA2
mRPE8
mRPE69
Heat-shockCell cycleSporulationDiauxic shiftMAPK signalingDNA damage
*High connectivity
*Hubs*Alternative partners in various conditions
Pilpel et al. Nature Genetics 2001
The human cell cycle
G1-Phase S-Phase
G2-Phase M-Phase
The proliferation cluster genes are cell cycle periodic
5 10 15 20 25 30 35 40 45
4
3
2
1
0
-1
-2
-3
-4
G2/M
G1/S
CHR
Samples
Gen
e E
xpre
ssio
n
Pro
port
ion
All genes
Proliferation genes
200 150 100 50 TSS
NFYE2F
ELK1
CDE
CHR
The cell cycle motifs are enriched among the proliferation cluster genes
Not in the cluster, mutated in cancer
Regulation of the proliferation cluster:Regulation of the proliferation cluster: significant motifssignificant motifs
Sequence logo
1.42*10-05CHR
3.10*10-06ELK1
2.37*10-09E2F
5.31*10-10CDE
3.74*10-11NFY
P-valueMotif
1000bp up stream 326 MathInspector motifs
Potential regulatory motifs in 3’ UTRs
Finding 3’ UTRs elements associated with high/low transcript stability (in yeast)
AAGCTTCC CCTACAACEntire genome