Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....

72
Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel 0 5 10 15 -2 -1 0 1 2 3 4

Transcript of Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....

Page 1: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Discovery of transcription networks

Lecture3 Nov 2012Regulatory GenomicsWeizmann Institute Prof. Yitzhak Pilpel

Lecture3 Nov 2012Regulatory GenomicsWeizmann Institute Prof. Yitzhak Pilpel

0 5 10 15-2

-1

0

1

2

3

4

Page 2: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Hierarchical clustering

Page 3: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Promoter Motifs and expression

profilesCGGCCCCGCGGA

CTCCTCCCCCCCTTC TGGCCAATCA

ATGTACGGGTG

3

Page 4: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

AlignACE ExampleAlignACE Example

…HIS7 …ARO4…ILV6…THR4…ARO1…HOM2…PRO3

300-600 bp of upstream sequence per gene are searched in

Saccharomyces cerevisiae.

http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf

Page 5: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

AAAAGAGTCA

AAATGACTCA

AAGTGAGTCA

AAAAGAGTCA

GGATGAGTCA

AAATGAGTCA

GAATGAGTCA

AAAAGAGTCA

**********

A cluster of gene may contain a A cluster of gene may contain a common motif in their promotercommon motif in their promoter

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Find a needle in a haystack

Page 6: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Computational Identification of Computational Identification of Cis-regulatory Elements Cis-regulatory Elements

Associated with Groups of Associated with Groups of Functionally Related Genes in Functionally Related Genes in

Saccharomyces cerevisiaeSaccharomyces cerevisiae

J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church

Journal of Molecular Biology (2000)

Page 7: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Example Example

http://www.cifn.unam.mx/Computational_Genomics/old_research/BIOL2.html

GAL4 is one of the yeast genes required for growth on galactose.

1 2 3 4 5 6

A 0.8 0.4 1 0.6 0 1

C 0 0 0 0 0 0

G 0.2 0.6 0 0 1 0

T 0 0 0 0.4 0 0

Motif Representation

G1 A G A A G AG2 A A A T G AG3 G A A T G AG4 A G A A G AG5 A G A A G A

Page 8: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Finding New MotifFinding New Motif

• By lab work

• By comparison to known motifs in other species

• By searching upstream regions of a set of potentially co-regulated genes

Page 9: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

The genes bound by the TF Abf1 can be clustered into several groups, some contain a motif

NCGTNNNNARTGAT

CGATGAGMTK

NCGTNNNNARTGAT & CGATGAGMTK

(sporulation experiment)

Page 10: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Search SpaceSearch Space

• Size of search space:

• L=600, W = 15, N = 10 :

NN LWL )1(

2710size

• Exact search methods are not feasible

Page 11: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

300-600 bp of upstream sequence per gene are searched in

Saccharomyces cerevisiae.

Based on slides from G. Church Computational Biology course at Harvard

AlignACE ExampleAlignACE Example

Input Data Set

Page 12: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

K-means

• Start with random positions of centroids.

• Assign data points to centroids.

• Move centroids to center of assigned points.

• Iterate till minimal cost.

Iteration = 3

Page 13: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
Page 14: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCACMAP score = -10.0

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

AlignACE ExampleAlignACE Example

Based on slides from G. Church Computational Biology course at Harvard

Initial Seeding

Page 15: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

Add?

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

TCTCTCTCCA

How much better is the alignment with this site as opposed to without?

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Based on slides from G. Church Computational Biology course at Harvard

AlignACE ExampleAlignACE Example

Sampling

Page 16: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

Add?

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

How much better is the alignment with this site as opposed to without?

Remove.

TGAAAAAATG

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Based on slides from G. Church Computational Biology course at Harvard

AlignACE ExampleAlignACE Example

Sampling

Page 17: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

GACATCGAAAC

GCACTTCGGCG

GAGTCATTACA

GTAAATTGTCA

CCACAGTCCGC

TGTGAAGCACA

How much better is the alignment with this new

column structure?

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Based on slides from G. Church Computational Biology course at Harvard

AlignACE ExampleAlignACE Example

Column Sampling

Page 18: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

AAAAGAGTCA

AAATGACTCA

AAGTGAGTCA

AAAAGAGTCA

GGATGAGTCA

AAATGAGTCA

GAATGAGTCA

AAAAGAGTCA MAP score = 20.37

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Based on slides from G. Church Computational Biology course at Harvard

AlignACE ExampleAlignACE Example

The Best Motif

Page 19: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

• MAP – Maximal a priori log likelihood score

• This is what the algorithm tries to optimize.

• Measures the degree of over representation

of the motif in the input sequence relative to expectation in a random sequence.

The MAP ScoreThe MAP Score

MAP

Page 20: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

= standard Beta & Gamma functionsN = number of aligned sites; T = number of total possible sitesFjb = number of occurrences of base b at position j (Fsum)Gb = background genomic frequency for base bb = n x Gb for n pseudocounts (sum)W = width of motif; C = number of columns in motif (W>=C)

Based on slides from G. Church Computational Biology course at Harvard

The MAP ScoreThe MAP Score

Page 21: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

N = number of aligned sites

exp = expected number of sites in the input sequence, comparing to a random model

expNNMAP log

The MAP ScoreThe MAP Score

P = 1 site every 16,000 bases

7

4

1

For 64,000 bases sequence - exp = 4

Page 22: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Motif Number of genes (each

1,000 BPs long

promoter)

Number of times found

Expected number of

times

MAP score

AGGGTAA (7) 16 10 ~1 10GTAGATG (7) 16 2 ~1 0.60206CCGTGAG (7) 160 10 ~10 0GATGTA (6) 16 2 ~4 -0.60206AGGGTA (6) 16 10 4 4.089354A (1) 16 2504 ~2500 1.73AAAAAAA (7) 16 5 ~1.5 2.614394GGGGGGG (7) 16 5 ~0.5 5

expNNMAP log

Some examples

Very intuitive: any things

that’s long, that occurs many

times and that is different

from background will score

highly

Page 23: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

The MAP Score PropertiesThe MAP Score Properties

a) Motif should be “strong”

b) Input sequence can’t be too long

expNNMAP log

P = 1 site every 16,000 bases

7

4

1

15001012216000

1exp 6 Genome length ~12Mb :

Motif needs more than 1500 sites to get a positive MAP score: 0

1500

1500log1500log

exp

NNMAP

Problem: most transcription factor binding sites will only occur in dozens to hundreds of genes

Page 24: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Solution:Cluster genes before searching for motifs

Time-point 1

Tim

e-po

int 3

Tim

e-po

int 2

Page 25: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Group Specificity Score:Group Specificity Score:

How well a motif targets the genes used to find it comparing to all genome ?

What is the probability to have suchlarge intersection?

1Sx

All Genome (N)All Genome (N)

Motif ORFs Motif ORFs Group (SGroup (S11))

ORFs with ORFs with best sites (Sbest sites (S22))

XX

N = Total # of ORFs in the genome (6226)

S1 = # ORFs used to align the motif

S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores)

X = # size of intersection of S1 and S2

1

2

SNxS

NS2

Based on slides from G. Church Computational Biology course at Harvard

Page 26: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Group Specificity Score:Group Specificity Score:

How well a motif targets the genes used to find it comparing to all genome ?

What is the probability to have suchlarge intersection?

1Si

All Genome (N)All Genome (N)

Motif ORFs Motif ORFs Group (SGroup (S11))

ORFs with ORFs with best sites (Sbest sites (S22))

XX

N = Total # of ORFs in the genome (6226)

S1 = # ORFs used to align the motif

S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores)

X = # size of intersection of S1 and S2

1

2

SNiS

NS2

),min( 21 SS

xi

S

Based on slides from G. Church Computational Biology course at Harvard

Page 27: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Positional Bias Score:Positional Bias Score:

Measures the degree of preference of positioning

in a particular range upstream to translational start.

Based on slides from G. Church Computational Biology course at Harvard

#ORFS 10

6

1

Start -600 bp50 bp

Page 28: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

• Find best 200 sites in the genome

Restrict sites to segment of length [s = 600 bp] from translation start

• t = # sites in the segment

• Choose window size [w = 50 bp]

• m = # sites in the most enriched window

Positional Bias Score:Positional Bias Score:

What is the probability to have m or more

sites in a window of size w?

Based on slides from G. Church Computational Biology course at Harvard

m

tmt

s

w

1

m

s

w

#ORFS 10

1

Start -600 bp

50 bp

Page 29: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

t

miP

• Find best 200 sites in the genome

Restrict sites to segment of length [s = 600 bp] from translation start

• t = # sites in the segment

• Choose window size [w = 50 bp]

• m = # sites in the most enriched window

Positional Bias Score:Positional Bias Score:

What is the probability to have m or more

sites in a window of size w?

Based on slides from G. Church Computational Biology course at Harvard

i

tit

s

w

1

i

s

w

#ORFS 10

1

Start -600 bp

50 bp

Page 30: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Lecture TopicsLecture Topics

• Introduction to DNA regulatory motifs

• AlignACE - A motif finding algorithm

• Assessment of motifs

• AlignACE results on yeast genome

• Summary & Conclusions

Page 31: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Comparisons of motifsComparisons of motifs

• The CompareACE program finds best alignment between two motifs and calculates the correlation between the two position-specific scoring matrices

• Similar motifs: CompareACE score > 0.7

Based on slides from G. Church Computational Biology course at Harvard

Page 32: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Clustering motifs by similarityClustering motifs by similaritymotif Amotif Bmotif Cmotif D A B C D

A 1.0 0.9 0.1 0.0 B 1.0 0.2 0.1C 1.0 0.8D 1.0

Pairwise CompareACE scores

Compa

reACE

cluster 1: A, Bcluster 2: C, D

HierarchicalClustering

1 2 3 4 5 6

A 0.8 0.4 1 0.6 0 1

C 0 0 0 0 0 0

G 0.2 0.6 0 0 1 0

T 0 0 0 0.4 0 0

1 2 3 4 5 6

A 0.4 0.4 1 0.6 0 0

C 0 0 0 0 0 1

G 0.6 0.6 0 0 1 0

T 0 0 0 0.4 0 0

Page 33: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Most Group Specific MotifsMost Group Specific Motifs

Page 34: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Most Positional Biased MotifsMost Positional Biased Motifs

Page 35: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

• 250 AlignACE runs on randomly created groups of ORFs, of size 20, 40, 60, 80,and 100 ORFs.

Negative ControlsNegative Controls

Based on slides from G. Church Computational Biology course at Harvard

MAP MAP

random real

Page 36: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Negative ControlsNegative Controls

MAP cut off of 10, Group Specificity cutoff of :

False Positives = 10-20%

1010

Page 37: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Positive ControlsPositive Controls

• 29 listed TFs with five or more known binding sites were chosen.

• AlignACE was run on the upstream regions of the corresponding regulated genes.

• An appropriate motif was found in 21/29 cases.

• False negative rate = ~ 10-30 %

Based on slides from G. Church Computational Biology course at Harvard

Page 38: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
Page 39: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

                                                         

                      

Page 40: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

The dataThe data

• Organism: Saccharomyces cerevisiae• Microarray experiment : Affymetrix

microarrays of 6,220 mRNA • Data: gathered by Cho et al. • 15 time points, spanned about 4 hours

across two cell cycles• Genome sequence

Page 41: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Typical clusters of genes in the data

Page 42: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Variance normalization and clustering of expression time series

Variance normalization and clustering of expression time series

•3,000 most variable ORFs were chosen (based on the normalized dispersion in expression level of each gene across the time points (s.d./mean).

•The 15 time points were used to construct a 3,000 by 15 data matrix.

•The variance of each gene was normalized across the 15 conditions:

Subtracting the mean across the time points from the expression level of each gene and dividing by the standard deviation across the time point.

Page 43: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Before and after mean - variance normalizationBefore and after mean - variance normalization

13579

11

13

-6-4-202468

Time

Gene Expression

Gene1 Gene2 Gene3

13579

11

13

-3

-2

-1

0

1

2

Time

Normalized Expression

Gene1 Gene2 Gene3

Before normalization

After normalization

Page 44: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Time-point 1

Gene 1

Gene 2

Normalized ExpressionData from microarrays

Representation of expression dataRepresentation of expression data

Euclidean distance

Page 45: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

K-means

= position of data point Xi

•Start with random positions of centroids.

= position of data centroid C

Iteration = 0

Page 46: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Choosing K

Since we don’t know the number of clusters in advance we need a way to estimate it.

In order to choose the number of clusters K, the Sum of Squares of Errors is calculated for different K values. A clear break point indicates the “natural” number of clusters in the data.

K

Sum Squared errors

Page 47: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Significantly enrichment of functional category within clusters

Significantly enrichment of functional category within clusters

• Each gene was mapped into one of 199 functional categories ( according to MIPS database ).

• For each cluster, P-values was calculated for observing the frequencies of genes from particular functional categories.

• There was significant grouping of genes within the same cluster.

Page 48: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

                                                                        

                      

Page 49: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
Page 50: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

The hyper-geometric scoreThe hyper-geometric scoreP values were calculated for finding at least (k) ORFs from a particular functional category within a cluster of size (n).

where (f) is the total number of genes within a functional category and (g) is the total number of genes within the genome (6,220).

P- values greater than 3×10- 4 are not reported, as their total expectation within the cluster would be higher than 0.05 As we tested 199 MIPS (ref.15).

Page 51: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Challenge: generalize hyper-geometric for more than two sets

Chr V

Expression cluster

Functional group

Page 52: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

                                                                    

                         

Page 53: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Sequence- MCB element ConsensusesSequence- MCB element Consensuses

This motif was later mapped to the literature and confirmed to be the very well known MCB element which is known to control the periodicity of the genes which peak at G1-S.

nucleotides

Page 54: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

MCB element

clusters

The existence of motif in all ORF’s of each clustersThe existence of motif in all ORF’s of each clusters

Page 55: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Location of the motif - MCB element Location of the motif - MCB element

Distance from ATG (b.p)

Page 56: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

SCB element SCB element

This motif (later found to be the SCB element) was the second scoring motif within this cluster. The SCB element is also a very well-known cis-regulatory element which contributes to the periodicity of the genes within the G1-S regulon.

Page 57: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

ribonucleotide reductase

Page 58: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Determining the cell-cycle periodicity of clusters

Determining the cell-cycle periodicity of clusters

Show Fourier Analysis allow to rank the genes according to the periodicity of cell cycle.

expression matrix

-10

-5

0

5

10

1357911131517192123252729313335

time

expr

essi

on

cell cycle high Periodicity low periodicity low periodicity

Page 59: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Explain FFT… (including ORs variability)

Page 60: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Periodic clustersPeriodic clusters

Page 61: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Non periodic clustersNon periodic clusters

Page 62: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
Page 63: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

And this was just the beginning…

Page 64: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Collaboration

?

Co-occurrence

(AND)

Redundancy

(OR)

In case of two motifs derived from a cluster

http://longitude.weizmann.ac.il/publications/PilpelNatGent01.pdf

Page 65: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Logic of interaction of motifs

5 10 15

-2

0

2EC=0.05

5 10 15

-2

0

2EC=0.05

TimeTime

Exp

ress

ion

leve

l

Only M1 Only M2

Exp

ress

ion

leve

l

Time5 10 15

-2

0

2EC=0.23

M1 AND M2

G2 G2

M1 M2

Page 66: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Synergistic motifs

A combination of two motifs is called ‘synergistic’ if the expression coherence

score of the genes that have the two motifs is significantly higher than the scores of the

genes that have either of the motifs

SFFMcm1

Page 67: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

A global map of combinatorial expression control

mRPE72

SWI5

SFF '

MCM1

SFFMCM1'

ECB SCB

MCB

PAC

mRRPE

mRRSE3

GCN4

BAS1

LYS14

RAP1

mRPE34

mRPE57

mRPE6mRPE58

STRE

RPN4 ABF1

PDR

CCA

PHO4

AFT1

STE12

MIG1

CSRE

HAP234

ALPHA1'

ALPHA1

ALPHA2

mRPE8

mRPE69

Heat-shockCell cycleSporulationDiauxic shiftMAPK signalingDNA damage

*High connectivity

*Hubs*Alternative partners in various conditions

Pilpel et al. Nature Genetics 2001

Page 68: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

The human cell cycle

G1-Phase S-Phase

G2-Phase M-Phase

Page 69: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

The proliferation cluster genes are cell cycle periodic

5 10 15 20 25 30 35 40 45

4

3

2

1

0

-1

-2

-3

-4

G2/M

G1/S

CHR

Samples

Gen

e E

xpre

ssio

n

Pro

port

ion

All genes

Proliferation genes

Page 70: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

200 150 100 50 TSS

NFYE2F

ELK1

CDE

CHR

The cell cycle motifs are enriched among the proliferation cluster genes

Not in the cluster, mutated in cancer

Page 71: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Regulation of the proliferation cluster:Regulation of the proliferation cluster: significant motifssignificant motifs

Sequence logo

1.42*10-05CHR

3.10*10-06ELK1

2.37*10-09E2F

5.31*10-10CDE

3.74*10-11NFY

P-valueMotif

1000bp up stream 326 MathInspector motifs

Page 72: Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.

Potential regulatory motifs in 3’ UTRs

Finding 3’ UTRs elements associated with high/low transcript stability (in yeast)

AAGCTTCC CCTACAACEntire genome