Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....

Discovery of transcription networks

Lecture3 Nov 2012Regulatory GenomicsWeizmann Institute Prof. Yitzhak Pilpel

Lecture3 Nov 2012Regulatory GenomicsWeizmann Institute Prof. Yitzhak Pilpel

0 5 10 15-2

-1

0

1

2

3

4

Hierarchical clustering

Promoter Motifs and expression

profilesCGGCCCCGCGGA

CTCCTCCCCCCCTTC TGGCCAATCA

ATGTACGGGTG

3

http://www.sciencemag.org/content/vol278/issue5338/images/large/se4475903005.jpeg

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

AlignACE ExampleAlignACE Example

…HIS7 …ARO4…ILV6…THR4…ARO1…HOM2…PRO3

300-600 bp of upstream sequence per gene are searched in

Saccharomyces cerevisiae.

http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf

http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf








AAAAGAGTCA

AAATGACTCA

AAGTGAGTCA

AAAAGAGTCA

GGATGAGTCA

AAATGAGTCA

GAATGAGTCA

AAAAGAGTCA

**********

A cluster of gene may contain a A cluster of gene may contain a common motif in their promotercommon motif in their promoter

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Find a needle in a haystack

Computational Identification of Computational Identification of Cis-regulatory Elements Cis-regulatory Elements

Associated with Groups of Associated with Groups of Functionally Related Genes in Functionally Related Genes in

Saccharomyces cerevisiaeSaccharomyces cerevisiae

J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church

Journal of Molecular Biology (2000)

Example Example

http://www.cifn.unam.mx/Computational_Genomics/old_research/BIOL2.html

GAL4 is one of the yeast genes required for growth on galactose.

1 2 3 4 5 6

A 0.8 0.4 1 0.6 0 1

C 0 0 0 0 0 0

G 0.2 0.6 0 0 1 0

T 0 0 0 0.4 0 0

Motif Representation

G1 A G A A G AG2 A A A T G AG3 G A A T G AG4 A G A A G AG5 A G A A G A

Finding New MotifFinding New Motif

• By lab work

• By comparison to known motifs in other species

• By searching upstream regions of a set of potentially co-regulated genes

The genes bound by the TF Abf1 can be clustered into several groups, some contain a motif

NCGTNNNNARTGAT

CGATGAGMTK

NCGTNNNNARTGAT & CGATGAGMTK

(sporulation experiment)

Search SpaceSearch Space

• Size of search space:

• L=600, W = 15, N = 10 :

NN LWL )1(

2710size

• Exact search methods are not feasible








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

300-600 bp of upstream sequence per gene are searched in

Saccharomyces cerevisiae.

Based on slides from G. Church Computational Biology course at Harvard


Input Data Set

K-means

• Start with random positions of centroids.

• Assign data points to centroids.

• Move centroids to center of assigned points.

• Iterate till minimal cost.

Iteration = 3







TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC






TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCACMAP score = -10.0

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3



Initial Seeding








TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

Add?

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

TCTCTCTCCA

How much better is the alignment with this site as opposed to without?

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3



Sampling








TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

Add?

TGAAAAATTC

GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

How much better is the alignment with this site as opposed to without?

Remove.

TGAAAAAATG

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3



Sampling








GACATCGAAA

GCACTTCGGC

GAGTCATTAC

GTAAATTGTC

CCACAGTCCG

TGTGAAGCAC

GACATCGAAAC

GCACTTCGGCG

GAGTCATTACA

GTAAATTGTCA

CCACAGTCCGC

TGTGAAGCACA

How much better is the alignment with this new

column structure?

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3



Column Sampling








AAAAGAGTCA

AAATGACTCA

AAGTGAGTCA

AAAAGAGTCA

GGATGAGTCA

AAATGAGTCA

GAATGAGTCA

AAAAGAGTCA MAP score = 20.37

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3



The Best Motif

• MAP – Maximal a priori log likelihood score

• This is what the algorithm tries to optimize.

• Measures the degree of over representation

of the motif in the input sequence relative to expectation in a random sequence.

The MAP ScoreThe MAP Score

MAP

= standard Beta & Gamma functionsN = number of aligned sites; T = number of total possible sitesFjb = number of occurrences of base b at position j (Fsum)Gb = background genomic frequency for base bb = n x Gb for n pseudocounts (sum)W = width of motif; C = number of columns in motif (W>=C)



N = number of aligned sites

exp = expected number of sites in the input sequence, comparing to a random model

expNNMAP log


P = 1 site every 16,000 bases

7

4

1

For 64,000 bases sequence - exp = 4

Motif Number of genes (each

1,000 BPs long

promoter)

Number of times found

Expected number of

times

MAP score

AGGGTAA (7) 16 10 ~1 10GTAGATG (7) 16 2 ~1 0.60206CCGTGAG (7) 160 10 ~10 0GATGTA (6) 16 2 ~4 -0.60206AGGGTA (6) 16 10 4 4.089354A (1) 16 2504 ~2500 1.73AAAAAAA (7) 16 5 ~1.5 2.614394GGGGGGG (7) 16 5 ~0.5 5

expNNMAP log

Some examples

Very intuitive: any things

that’s long, that occurs many

times and that is different

from background will score

highly

The MAP Score PropertiesThe MAP Score Properties

a) Motif should be “strong”

b) Input sequence can’t be too long

expNNMAP log

P = 1 site every 16,000 bases

7

4

1

15001012216000

1exp 6 Genome length ~12Mb :

Motif needs more than 1500 sites to get a positive MAP score: 0

1500

1500log1500log

exp

NNMAP

Problem: most transcription factor binding sites will only occur in dozens to hundreds of genes

Solution:Cluster genes before searching for motifs

Time-point 1

Tim

e-po

int 3

Tim

e-po

int 2

Group Specificity Score:Group Specificity Score:

How well a motif targets the genes used to find it comparing to all genome ?

What is the probability to have suchlarge intersection?

1Sx

All Genome (N)All Genome (N)

Motif ORFs Motif ORFs Group (SGroup (S11))

ORFs with ORFs with best sites (Sbest sites (S22))

XX

N = Total # of ORFs in the genome (6226)

S1 = # ORFs used to align the motif

S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores)

X = # size of intersection of S1 and S2

1

2

SNxS

NS2


Group Specificity Score:Group Specificity Score:

How well a motif targets the genes used to find it comparing to all genome ?

What is the probability to have suchlarge intersection?

1Si

All Genome (N)All Genome (N)

Motif ORFs Motif ORFs Group (SGroup (S11))

ORFs with ORFs with best sites (Sbest sites (S22))

XX

N = Total # of ORFs in the genome (6226)

S1 = # ORFs used to align the motif

S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores)

X = # size of intersection of S1 and S2

1

2

SNiS

NS2

),min( 21 SS

xi

S


Positional Bias Score:Positional Bias Score:

Measures the degree of preference of positioning

in a particular range upstream to translational start.


#ORFS 10

6

1

Start -600 bp50 bp

• Find best 200 sites in the genome

Restrict sites to segment of length [s = 600 bp] from translation start

• t = # sites in the segment

• Choose window size [w = 50 bp]

• m = # sites in the most enriched window


What is the probability to have m or more

sites in a window of size w?


m

tmt

s

w

1

m

s

w

#ORFS 10

1

Start -600 bp

50 bp

t

miP

• Find best 200 sites in the genome

Restrict sites to segment of length [s = 600 bp] from translation start

• t = # sites in the segment

• Choose window size [w = 50 bp]

• m = # sites in the most enriched window


What is the probability to have m or more

sites in a window of size w?


i

tit

s

w

1

i

s

w

#ORFS 10

1

Start -600 bp

50 bp

Lecture TopicsLecture Topics

• Introduction to DNA regulatory motifs

• AlignACE - A motif finding algorithm

• Assessment of motifs

• AlignACE results on yeast genome

• Summary & Conclusions

Comparisons of motifsComparisons of motifs

• The CompareACE program finds best alignment between two motifs and calculates the correlation between the two position-specific scoring matrices

• Similar motifs: CompareACE score > 0.7


Clustering motifs by similarityClustering motifs by similaritymotif Amotif Bmotif Cmotif D A B C D

A 1.0 0.9 0.1 0.0 B 1.0 0.2 0.1C 1.0 0.8D 1.0

Pairwise CompareACE scores

Compa

reACE

cluster 1: A, Bcluster 2: C, D

HierarchicalClustering

1 2 3 4 5 6

A 0.8 0.4 1 0.6 0 1

C 0 0 0 0 0 0

G 0.2 0.6 0 0 1 0

T 0 0 0 0.4 0 0

1 2 3 4 5 6

A 0.4 0.4 1 0.6 0 0

C 0 0 0 0 0 1

G 0.6 0.6 0 0 1 0

T 0 0 0 0.4 0 0

Most Group Specific MotifsMost Group Specific Motifs

Most Positional Biased MotifsMost Positional Biased Motifs

• 250 AlignACE runs on randomly created groups of ORFs, of size 20, 40, 60, 80,and 100 ORFs.

Negative ControlsNegative Controls


MAP MAP

random real

Negative ControlsNegative Controls

MAP cut off of 10, Group Specificity cutoff of :

False Positives = 10-20%

1010

Positive ControlsPositive Controls

• 29 listed TFs with five or more known binding sites were chosen.

• AlignACE was run on the upstream regions of the corresponding regulated genes.

• An appropriate motif was found in 21/29 cases.

• False negative rate = ~ 10-30 %


The dataThe data

• Organism: Saccharomyces cerevisiae• Microarray experiment : Affymetrix

microarrays of 6,220 mRNA • Data: gathered by Cho et al. • 15 time points, spanned about 4 hours

across two cell cycles• Genome sequence

Typical clusters of genes in the data

Variance normalization and clustering of expression time series

Variance normalization and clustering of expression time series

•3,000 most variable ORFs were chosen (based on the normalized dispersion in expression level of each gene across the time points (s.d./mean).

•The 15 time points were used to construct a 3,000 by 15 data matrix.

•The variance of each gene was normalized across the 15 conditions:

Subtracting the mean across the time points from the expression level of each gene and dividing by the standard deviation across the time point.

Before and after mean - variance normalizationBefore and after mean - variance normalization

13579

11

13

-6-4-202468

Time

Gene Expression

Gene1 Gene2 Gene3

13579

11

13

-3

-2

-1

0

1

2

Time

Normalized Expression

Gene1 Gene2 Gene3

Before normalization

After normalization

Time-point 1

Gene 1

Gene 2

Normalized ExpressionData from microarrays

Representation of expression dataRepresentation of expression data

Euclidean distance

K-means

= position of data point Xi

•Start with random positions of centroids.

= position of data centroid C

Iteration = 0

Choosing K

Since we don’t know the number of clusters in advance we need a way to estimate it.

In order to choose the number of clusters K, the Sum of Squares of Errors is calculated for different K values. A clear break point indicates the “natural” number of clusters in the data.

K

Sum Squared errors

Significantly enrichment of functional category within clusters

Significantly enrichment of functional category within clusters

• Each gene was mapped into one of 199 functional categories ( according to MIPS database ).

• For each cluster, P-values was calculated for observing the frequencies of genes from particular functional categories.

• There was significant grouping of genes within the same cluster.

The hyper-geometric scoreThe hyper-geometric scoreP values were calculated for finding at least (k) ORFs from a particular functional category within a cluster of size (n).

where (f) is the total number of genes within a functional category and (g) is the total number of genes within the genome (6,220).

P- values greater than 3×10- 4 are not reported, as their total expectation within the cluster would be higher than 0.05 As we tested 199 MIPS (ref.15).

Challenge: generalize hyper-geometric for more than two sets

Chr V

Expression cluster

Functional group

Sequence- MCB element ConsensusesSequence- MCB element Consensuses

This motif was later mapped to the literature and confirmed to be the very well known MCB element which is known to control the periodicity of the genes which peak at G1-S.

nucleotides

MCB element

clusters

The existence of motif in all ORF’s of each clustersThe existence of motif in all ORF’s of each clusters

Location of the motif - MCB element Location of the motif - MCB element

•

Distance from ATG (b.p)

SCB element SCB element

This motif (later found to be the SCB element) was the second scoring motif within this cluster. The SCB element is also a very well-known cis-regulatory element which contributes to the periodicity of the genes within the G1-S regulon.

ribonucleotide reductase

Determining the cell-cycle periodicity of clusters

Determining the cell-cycle periodicity of clusters

Show Fourier Analysis allow to rank the genes according to the periodicity of cell cycle.

expression matrix

-10

-5

0

5

10

1357911131517192123252729313335

time

expr

essi

on

cell cycle high Periodicity low periodicity low periodicity

Explain FFT… (including ORs variability)

Periodic clustersPeriodic clusters

Non periodic clustersNon periodic clusters

And this was just the beginning…

Collaboration

?

Co-occurrence

(AND)

Redundancy

(OR)

In case of two motifs derived from a cluster

http://longitude.weizmann.ac.il/publications/PilpelNatGent01.pdf

Logic of interaction of motifs

5 10 15

-2

0

2EC=0.05

5 10 15

-2

0

2EC=0.05

TimeTime

Exp

ress

ion

leve

l

Only M1 Only M2

Exp

ress

ion

leve

l

Time5 10 15

-2

0

2EC=0.23

M1 AND M2

G2 G2

M1 M2

Synergistic motifs

A combination of two motifs is called ‘synergistic’ if the expression coherence

score of the genes that have the two motifs is significantly higher than the scores of the

genes that have either of the motifs

SFFMcm1

A global map of combinatorial expression control

mRPE72

SWI5

SFF '

MCM1

SFFMCM1'

ECB SCB

MCB

PAC

mRRPE

mRRSE3

GCN4

BAS1

LYS14

RAP1

mRPE34

mRPE57

mRPE6mRPE58

STRE

RPN4 ABF1

PDR

CCA

PHO4

AFT1

STE12

MIG1

CSRE

HAP234

ALPHA1'

ALPHA1

ALPHA2

mRPE8

mRPE69

Heat-shockCell cycleSporulationDiauxic shiftMAPK signalingDNA damage

*High connectivity

*Hubs*Alternative partners in various conditions

Pilpel et al. Nature Genetics 2001

The human cell cycle

G1-Phase S-Phase

G2-Phase M-Phase

The proliferation cluster genes are cell cycle periodic

5 10 15 20 25 30 35 40 45

4

3

2

1

0

-1

-2

-3

-4

G2/M

G1/S

CHR

Samples

Gen

e E

xpre

ssio

n

Pro

port

ion

All genes

Proliferation genes

200 150 100 50 TSS

NFYE2F

ELK1

CDE

CHR

The cell cycle motifs are enriched among the proliferation cluster genes

Not in the cluster, mutated in cancer

Regulation of the proliferation cluster:Regulation of the proliferation cluster: significant motifssignificant motifs

Sequence logo

1.42*10-05CHR

3.10*10-06ELK1

2.37*10-09E2F

5.31*10-10CDE

3.74*10-11NFY

P-valueMotif

1000bp up stream 326 MathInspector motifs

Potential regulatory motifs in 3’ UTRs

Finding 3’ UTRs elements associated with high/low transcript stability (in yeast)

AAGCTTCC CCTACAACEntire genome

Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....

Documents

Transcript of Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof....