Modeling Regulatory Motifs

Modeling Regulatory Motifs

3/26/2013

Transcriptional RegulationTranscription is controlled by the interaction of tran-acting elements

called transcription factors (TFs) and cis-acting elements of DNA.

Prediction of cis-acting elements or TF binding sites is a challenging problem in computational biology.

TSS

+1

Promoterregion

FT binding site Terminator

RNA

Transcription

-10-35-300TF1

a a

b b s

Ribosomebinding site

3’UTR

TF2

Transcriptional regulation of in prokaryotes

5’UTR

Specific Protein-DNA interactionsProtein-DNA interactions are specific, guaranteeing that

transcriptional regulation is specific and precise.

The specificity of protein-DNA interactions are realized by the 3-D structures on the DNA-binding face of TF protein and the TF binding site of the DNA sequence.

Usually a TF recognizes variable but similar binding sites associated with different genes.

All the binding site recognized by the same TF is called a TF-binding motif.

Experimental determination of binding sitesThere are in vitro and in vivo methods for determining the binding

sites of TFs.Systematic evolution of ligands

by exponential enrichment (SELEX) is likely to identify all possible sequences recognized by a TF;

SELEX may not work if TF-DNA interaction requires unknown co-factors;

The method is laborious as tedious molecular cloning and sequencing are required to determine the binding sites.

Geertz M , and Maerkl S J Briefings in Functional Genomics 2010;9:362-373 Motif finding

Experimental determination of binding sitesProtein binding microarray (PBM) is another in vitro method, which avoid the molecular cloning step,

and the binding site can be directly read out from the microarray;

PBM can determine binding sites at single base resolution.

But as SELEX, PBM may not work if TF-DNA interaction requires unknown co-factor;

PBM may not work either if the binding site is long, e.g., longer than 12 pb.

The putative binding site determined by PBM may not necessarily the real binding site in cells.

Geertz M and Maerkl S J Briefings in Functional Genomics 2010;9:362-373

Experimental determination of binding sitesChIP-seq and ChIP-chip are two high throughput in vivo methods for

determining the binding sites of a TF.

ChIP-seq and ChIP-chip can determine actual binding sites in a genome, but to determine all binding sites, many cell types need to be explored. Geertz M , and Maerkl S J Briefings in

Functional Genomics 2010;9:362-373Motif finding

Profile representation of TF binding sitesTACGATTATAATTATAATGATACTTATGATTATGTTTATAGT

TATAATConsensus sequence

Examples of s70 binding sites in E. coli

Regular expression [TG]A[TC][GA]XT

Frequency matrix

1 2 3 4 5 6A 0 7 0 4 4 0C 0 0 1 0 1 0G 1 0 0 3 1 0T 6 0 6 0 1 7

To avoid 0 counting, add a pseudo count of 1

1 2 3 4 5 6A 1 8 1 5 5 1C 1 1 2 1 2 1G 2 1 1 4 2 1T 7 1 7 1 2 8

Profile representation of TF binding sites

,4

,, kn

knp ib

ib

where nb,i is the frequency of residue b at position i; and k is a pseudocount to avoid zero probability.

Profile: for a motif of n samples (sequences), the probability of residue b at position i is

Profile pb,i , of the s70 binding sites in E. coli, pseudocount k = 1

1 2 3 4 5 6A 0.09 0.73 0.09 0.45 0.45 0.09C 0.09 0.09 0.18 0.09 0.18 0.09G 0.18 0.09 0.09 0.36 0.18 0.09T 0.64 0.09 0.64 0.09 0.18 0.73

where pb,i is the probability of residue b at position i; andpb is the probability of residue b in the background sequences.

Position specific weigh (scoring) matrix (PSWM): for a motif of n samples, the weight of residue b at position i is defined as

b

ibib p

pw ,

2, log


PSWM of the s70 binding sites in E. coli, assuming pA=pC=pG=pT=0.25

1 2 3 4 5 6A -1.46 1.54 -1.46 0.86 0.86 -1.46C -1.46 -1.46 -0.46 -1.46 -0.46 -1.46G -0.46 -1.46 -1.46 0.54 -0.46 -1.46T 1.35 -1.46 1.35 -1.46 -0.46 1.54

Information content at position i of the sequence profile is given by:,log

},,,{2,

,

TGCAb

pibi b

p ibpI

Logo representation:

Information contents of a motif:

l

i TGCAbib Pb

ibPpI1 },,,{

,,log


1 2 3 4 5 6A -0.13 1.12 -0.13268 0.392044 0.392044 -0.13268C -0.13 -0.1 -0.08353 -0.13268 -0.08353 -0.13268G -0.08 -0.1 -0.13268 0.19657 -0.08353 -0.13268T 0.858 -0.1 0.857769 -0.13268 -0.08353 1.120413

I i 0.509 0.72 0.508885 0.323263 0.141445 0.722387 2.927251

),(log2},,,{

,2, neppITGCAb

ibibi

where e(n) is a correction factor required when one only has a few (n) sample. A pseudo count is not added when computing pb,i.

The height of each base is iibib Iph ,,

http://weblogo.berkeley.edu/logo.cgi

Score of a sequence using a PSWM

S =TATAAT {sj,b}nx4 =

l

j

jj wsSscore1

..)(

}{ , jbwW

The score a sequence against a profile (or PSWM)

is defined as

0 0 0 1

0 0 0 11 0 0 0

1 0 0 0

1 0 0 00 0 0 1

}{ ,bjsS

A C G T1234

56

If we represent a sequence S = {b1 b2 … bj …bn} as a binary matrix:

Score of a sequence using a PSWM

TATAAT = {Sj,b } =

502.7

541.1862.0862.0348.1541.1348.1)(1

..

l

jjj wsSscore

0 0 0 1

0 0 0 11 0 0 0

1 0 0 0

1 0 0 00 0 0 1

A C G T1234

56

}{ , jbwW

1 2 3 4 5 6A -1.46 1.541 -1.46 0.862 0.862 -1.46C -1.46 -1.46 -0.46 -1.46 -0.46 -1.46G -0.46 -1.46 -1.46 0.541 -0.46 -1.46T 1.348 -1.46 1.348 -1.46 -0.46 1.541

Higher order PSWM To account for the dependence among adjacent positions of TF-DNA

interaction, we can use higher order PSWMs.

A higher order PSWM corresponds to a k-th order Markov chain, in which position i is dependent on the previous k positions.

A higher order PSWM is also called a position weight array.

1st 2nd 1 2 3 4 5A A 1 1 1 3 1A C 1 2 1 2 1A G 1 1 1 2 1A T 1 7 1 1 5C A 1 1 1 1 1C C 1 1 1 1 1C G 1 1 2 1 1C T 1 1 1 1 2G A 2 1 1 3 1G C 1 1 1 1 1G G 1 1 1 1 1G T 1 1 1 2 2T A 7 1 5 1 1T C 1 1 1 1 1T G 1 1 3 1 1T T 1 1 1 1 2

TACGATTATAATTATAATGATACTTATGATTATGTTTATAGT

To avoid 0 counting, add a pseudo count of 1

First order PWSM for the s70 factor binding sites

Maximal dependence decomposition Maximal dependence decomposition (MDD) models the dependence

between any two positions. It estimates the extent to which the nucleotides bj at position j depend on the nucleotides bi at position i.

MDD uses the 2 test to determine whether position j depends on positions i.

T A C G A TT A T A A TT A T A A TG A T A C TT A T G A TT A T G T TT A T A G T

T A T A A TConsensus bases:

bj bi

Non-consensus bases:G - C G C – G T

For each position i, we divide binding sites in two groups:

Ci: Binding sites having the consensus base at i;

: Binding sites having non-consensus base at i.iC

T A C G A TT A T A A TT A T A A TT A T G A T

G A T A C TT A T G T TT A T A G T

bj bi bj bi

Ci iC

Maximal dependence decomposition Let fb be the probability base b at position j in the binding sites in

Let N and Nb be the total number of binding sites and count of base b at j in Ci, respectively, then the 2 static is defined as,

iC

T

TT

G

GG

C

CC

A

AA

Nf

NfN

Nf

NfN

Nf

NfN

Nf

NfN 22222 )()()()(

T A C G A TT A T A A TT A T A A TT A T G A T

G A T A C TT A T G T TT A T A G T

bj bi bj bi Ci iC

fA

fC

fG

fT

N binding sites

NA

NC

NG

NT

Maximal dependence decomposition This 2 static describes the dependence of position j on position i, and

is denoted as 2(j|i).

The MDD approach proceeds iteratively as follows.

1. For each position i, compute

2. Among all the positions, select position i with maximum Si, and partition sequences into two groups Ci and ;

3. Repeat steps 1 and 2 separately for Ci and ;

4. Stop if there is no significant dependence or if there is an insufficient number of binding sites in Ci or . In either case construct a standard PWSM for the remaining subset of binding sites.

);|(2 ijSij

i

iC

iC

iC

AACGTGAGGCTGAGCTTT......

TACGTGCACGGTGATGGG


AACGTG

CACGGTGATGGG......

GACTTG

AACGTGAGCCTG......

AACGTG

AAGGTGAGGCTG......

AATGTG

PSWM1

PSWM2

Maximum S1

Maximum S3

Insufficient dependence


Maximal dependence decomposition Illustration of the MDD procedure: modeling


TACGTGCACGGTGATGGG


AACGTG

CACGGTGATGGG......

GACTTG

AACGTGAGCCTG......

AACGTG

AAGGTGAGGCTG......

AATGTG

PSWM1

PSWM2

Maximum S1

Maximum S3



Maximal dependence decomposition Illustration of the MDD procedure: scoring

X=AAGGTGPosition 1 has the

consensus base ‘A’Position 3 has non-consensus base ‘G’

Score X using

PSWM2

AGCGTG

Modeling and detecting arbitrary dependencies We can also use a digraph to model the dependence among the

positions:

S2 S3 S4S1a

S2 S3 S4S1b

S2 S3 S4S1c

S2 S3 S4S1

d

T

)()()()(),,,( 43214321 xpxpxpxpxxxxp

)|()|()|()(),,,( 34231214321 xxpxxpxxpxpxxxxp

)(),|()|()(),,,( 44131214321 xpxxxpxxpxpxxxxp

)|()|()|()|(),,,( 43214321 TxpTxpTxpTxpxxxxp

Searching for novel binding site using a PSWM

502.7

541.1862.0862.0348.1541.1348.1)TATAAT(1

..

l

jjj wsS

}{ , jbwW

1 2 3 4 5 6A -1.46 1.541 -1.46 0.862 0.862 -1.46C -1.46 -1.46 -0.46 -1.46 -0.46 -1.46G -0.46 -1.46 -1.46 0.541 -0.46 -1.46T 1.348 -1.46 1.348 -1.46 -0.46 1.541

Scan a sequence using a sliding window of the length of the PSWM, and return the windows that have a significantly high score.

...G A G T T A T A A T T A A G A...

The significance of a score S can be computed as an empirical p value, or as follows,

where Smin and Smax is the minimal and maximal score can be scored by the PSWM,

min

min

ss

ssp

man

De novel prediction of TF binding sites

1. Greedy algorithms: CONSENSUS, DREME2. Probabilistic algorithms: MEME, BioProspector3. Graph-theoretic algorithms: CUBIC, MotifClick4. ……

The motif-finding problem: Since there are usually no fixed patterns of cis-regulatory elements of

a TF, a cis-regulatory element can be only predicted by comparing a set of sequences that are likely to contain the binding site of the same TF.

The problem of finding cis-regulatory elements in a given set of sequences is called the motif-finding problem.

Currently, all sequence-based motif-finding algorithms are based on the assumption that binding sites of a TF are more conserved than the flanking sequences in a genome. A larger number of motif-finding algorithms have been developed:

Methods for finding a set of intergenic sequences for motif-finding

One genome, multiple genes approach: identify a set of co-regulated genes from an organism of interest through clustering analysis of gene expression profiles.

IA

IB

IC

ID

IE

IF

Motif finding

Methods for finding a set of intergenic sequences for motif-finding

One gene, multiple genomes approach---phylogenetic footprinting: in closely related species, more often both the coding sequences and cis-regulatory elements of orthologous genes are conserved.

+1-10-35-300

+1-10-35-300

Hom

olog

ous

A operon from another genome

TFBSs Genes

Phylogenetic footprinting

Orthologues identification

T.g1

G1.g1

G2.g1

Gn.g1

.

.

.

. PSWM

m

Motif finding Predicted binding

Sites

Intergenic regions

……

T.gm

G1.gm

G2.gm

Gn.gm

.

.

Additional hallmarks of functional TF binding sites In high eukaryote, genes are regulated by multiple TFs binding to a

close cluster of respective binding sites.

These clusters of binding sites of the same and/or different TFs are called cis-regulatory modules (CRMs), they can be in different orientations, located in the upstream, downstream or in the intron of a gene, can be very far away from the target gene, and can be even on a different chromosome.

Borok M J et al. Development

2010;137:5-13Wyeth W. Wasserman & Albin SandelinNature Reviews Genetics 2004; 5, 276-287

Modeling Regulatory Motifs

Documents

Transcript of Modeling Regulatory Motifs