Modeling Regulatory Motifs
-
Upload
vaughan-garza -
Category
Documents
-
view
53 -
download
4
description
Transcript of Modeling Regulatory Motifs
Modeling Regulatory Motifs
3/26/2013
Transcriptional RegulationTranscription is controlled by the interaction of tran-acting elements
called transcription factors (TFs) and cis-acting elements of DNA.
Prediction of cis-acting elements or TF binding sites is a challenging problem in computational biology.
TSS
+1
Promoterregion
FT binding site Terminator
RNA
Transcription
-10-35-300TF1
a a
b b s
Ribosomebinding site
3’UTR
TF2
Transcriptional regulation of in prokaryotes
5’UTR
Specific Protein-DNA interactionsProtein-DNA interactions are specific, guaranteeing that
transcriptional regulation is specific and precise.
The specificity of protein-DNA interactions are realized by the 3-D structures on the DNA-binding face of TF protein and the TF binding site of the DNA sequence.
Usually a TF recognizes variable but similar binding sites associated with different genes.
All the binding site recognized by the same TF is called a TF-binding motif.
Experimental determination of binding sitesThere are in vitro and in vivo methods for determining the binding
sites of TFs.Systematic evolution of ligands
by exponential enrichment (SELEX) is likely to identify all possible sequences recognized by a TF;
SELEX may not work if TF-DNA interaction requires unknown co-factors;
The method is laborious as tedious molecular cloning and sequencing are required to determine the binding sites.
Geertz M , and Maerkl S J Briefings in Functional Genomics 2010;9:362-373 Motif finding
Experimental determination of binding sitesProtein binding microarray (PBM) is another in vitro method, which avoid the molecular cloning step,
and the binding site can be directly read out from the microarray;
PBM can determine binding sites at single base resolution.
But as SELEX, PBM may not work if TF-DNA interaction requires unknown co-factor;
PBM may not work either if the binding site is long, e.g., longer than 12 pb.
The putative binding site determined by PBM may not necessarily the real binding site in cells.
Geertz M and Maerkl S J Briefings in Functional Genomics 2010;9:362-373
Experimental determination of binding sitesChIP-seq and ChIP-chip are two high throughput in vivo methods for
determining the binding sites of a TF.
ChIP-seq and ChIP-chip can determine actual binding sites in a genome, but to determine all binding sites, many cell types need to be explored. Geertz M , and Maerkl S J Briefings in
Functional Genomics 2010;9:362-373Motif finding
Profile representation of TF binding sitesTACGATTATAATTATAATGATACTTATGATTATGTTTATAGT
TATAATConsensus sequence
Examples of s70 binding sites in E. coli
Regular expression [TG]A[TC][GA]XT
Frequency matrix
1 2 3 4 5 6A 0 7 0 4 4 0C 0 0 1 0 1 0G 1 0 0 3 1 0T 6 0 6 0 1 7
To avoid 0 counting, add a pseudo count of 1
1 2 3 4 5 6A 1 8 1 5 5 1C 1 1 2 1 2 1G 2 1 1 4 2 1T 7 1 7 1 2 8
Profile representation of TF binding sites
,4
,, kn
knp ib
ib
where nb,i is the frequency of residue b at position i; and k is a pseudocount to avoid zero probability.
Profile: for a motif of n samples (sequences), the probability of residue b at position i is
Profile pb,i , of the s70 binding sites in E. coli, pseudocount k = 1
1 2 3 4 5 6A 0.09 0.73 0.09 0.45 0.45 0.09C 0.09 0.09 0.18 0.09 0.18 0.09G 0.18 0.09 0.09 0.36 0.18 0.09T 0.64 0.09 0.64 0.09 0.18 0.73
where pb,i is the probability of residue b at position i; andpb is the probability of residue b in the background sequences.
Position specific weigh (scoring) matrix (PSWM): for a motif of n samples, the weight of residue b at position i is defined as
b
ibib p
pw ,
2, log
Profile representation of TF binding sites
PSWM of the s70 binding sites in E. coli, assuming pA=pC=pG=pT=0.25
1 2 3 4 5 6A -1.46 1.54 -1.46 0.86 0.86 -1.46C -1.46 -1.46 -0.46 -1.46 -0.46 -1.46G -0.46 -1.46 -1.46 0.54 -0.46 -1.46T 1.35 -1.46 1.35 -1.46 -0.46 1.54
Information content at position i of the sequence profile is given by:,log
},,,{2,
,
TGCAb
pibi b
p ibpI
Logo representation:
Information contents of a motif:
l
i TGCAbib Pb
ibPpI1 },,,{
,,log
Profile representation of TF binding sites
1 2 3 4 5 6A -0.13 1.12 -0.13268 0.392044 0.392044 -0.13268C -0.13 -0.1 -0.08353 -0.13268 -0.08353 -0.13268G -0.08 -0.1 -0.13268 0.19657 -0.08353 -0.13268T 0.858 -0.1 0.857769 -0.13268 -0.08353 1.120413
I i 0.509 0.72 0.508885 0.323263 0.141445 0.722387 2.927251
),(log2},,,{
,2, neppITGCAb
ibibi
where e(n) is a correction factor required when one only has a few (n) sample. A pseudo count is not added when computing pb,i.
The height of each base is iibib Iph ,,
http://weblogo.berkeley.edu/logo.cgi
Score of a sequence using a PSWM
S =TATAAT {sj,b}nx4 =
l
j
jj wsSscore1
..)(
}{ , jbwW
The score a sequence against a profile (or PSWM)
is defined as
0 0 0 1
0 0 0 11 0 0 0
1 0 0 0
1 0 0 00 0 0 1
}{ ,bjsS
A C G T1234
56
If we represent a sequence S = {b1 b2 … bj …bn} as a binary matrix:
Score of a sequence using a PSWM
TATAAT = {Sj,b } =
502.7
541.1862.0862.0348.1541.1348.1)(1
..
l
jjj wsSscore
0 0 0 1
0 0 0 11 0 0 0
1 0 0 0
1 0 0 00 0 0 1
A C G T1234
56
}{ , jbwW
1 2 3 4 5 6A -1.46 1.541 -1.46 0.862 0.862 -1.46C -1.46 -1.46 -0.46 -1.46 -0.46 -1.46G -0.46 -1.46 -1.46 0.541 -0.46 -1.46T 1.348 -1.46 1.348 -1.46 -0.46 1.541
Higher order PSWM To account for the dependence among adjacent positions of TF-DNA
interaction, we can use higher order PSWMs.
A higher order PSWM corresponds to a k-th order Markov chain, in which position i is dependent on the previous k positions.
A higher order PSWM is also called a position weight array.
1st 2nd 1 2 3 4 5A A 1 1 1 3 1A C 1 2 1 2 1A G 1 1 1 2 1A T 1 7 1 1 5C A 1 1 1 1 1C C 1 1 1 1 1C G 1 1 2 1 1C T 1 1 1 1 2G A 2 1 1 3 1G C 1 1 1 1 1G G 1 1 1 1 1G T 1 1 1 2 2T A 7 1 5 1 1T C 1 1 1 1 1T G 1 1 3 1 1T T 1 1 1 1 2
TACGATTATAATTATAATGATACTTATGATTATGTTTATAGT
To avoid 0 counting, add a pseudo count of 1
First order PWSM for the s70 factor binding sites
Maximal dependence decomposition Maximal dependence decomposition (MDD) models the dependence
between any two positions. It estimates the extent to which the nucleotides bj at position j depend on the nucleotides bi at position i.
MDD uses the 2 test to determine whether position j depends on positions i.
T A C G A TT A T A A TT A T A A TG A T A C TT A T G A TT A T G T TT A T A G T
T A T A A TConsensus bases:
bj bi
Non-consensus bases:G - C G C – G T
For each position i, we divide binding sites in two groups:
Ci: Binding sites having the consensus base at i;
: Binding sites having non-consensus base at i.iC
T A C G A TT A T A A TT A T A A TT A T G A T
G A T A C TT A T G T TT A T A G T
bj bi bj bi
Ci iC
Maximal dependence decomposition Let fb be the probability base b at position j in the binding sites in
Let N and Nb be the total number of binding sites and count of base b at j in Ci, respectively, then the 2 static is defined as,
iC
T
TT
G
GG
C
CC
A
AA
Nf
NfN
Nf
NfN
Nf
NfN
Nf
NfN 22222 )()()()(
T A C G A TT A T A A TT A T A A TT A T G A T
G A T A C TT A T G T TT A T A G T
bj bi bj bi Ci iC
fA
fC
fG
fT
N binding sites
NA
NC
NG
NT
Maximal dependence decomposition This 2 static describes the dependence of position j on position i, and
is denoted as 2(j|i).
The MDD approach proceeds iteratively as follows.
1. For each position i, compute
2. Among all the positions, select position i with maximum Si, and partition sequences into two groups Ci and ;
3. Repeat steps 1 and 2 separately for Ci and ;
4. Stop if there is no significant dependence or if there is an insufficient number of binding sites in Ci or . In either case construct a standard PWSM for the remaining subset of binding sites.
);|(2 ijSij
i
iC
iC
iC
AACGTGAGGCTGAGCTTT......
TACGTGCACGGTGATGGG
AACGTGAGGCTGAGCTTT......
AACGTG
CACGGTGATGGG......
GACTTG
AACGTGAGCCTG......
AACGTG
AAGGTGAGGCTG......
AATGTG
PSWM1
PSWM2
Maximum S1
Maximum S3
Insufficient dependence
Insufficient dependence
Maximal dependence decomposition Illustration of the MDD procedure: modeling
AACGTGAGGCTGAGCTTT......
TACGTGCACGGTGATGGG
AACGTGAGGCTGAGCTTT......
AACGTG
CACGGTGATGGG......
GACTTG
AACGTGAGCCTG......
AACGTG
AAGGTGAGGCTG......
AATGTG
PSWM1
PSWM2
Maximum S1
Maximum S3
Insufficient dependence
Insufficient dependence
Maximal dependence decomposition Illustration of the MDD procedure: scoring
X=AAGGTGPosition 1 has the
consensus base ‘A’Position 3 has non-consensus base ‘G’
Score X using
PSWM2
AGCGTG
Modeling and detecting arbitrary dependencies We can also use a digraph to model the dependence among the
positions:
S2 S3 S4S1a
S2 S3 S4S1b
S2 S3 S4S1c
S2 S3 S4S1
d
T
)()()()(),,,( 43214321 xpxpxpxpxxxxp
)|()|()|()(),,,( 34231214321 xxpxxpxxpxpxxxxp
)(),|()|()(),,,( 44131214321 xpxxxpxxpxpxxxxp
)|()|()|()|(),,,( 43214321 TxpTxpTxpTxpxxxxp
Searching for novel binding site using a PSWM
502.7
541.1862.0862.0348.1541.1348.1)TATAAT(1
..
l
jjj wsS
}{ , jbwW
1 2 3 4 5 6A -1.46 1.541 -1.46 0.862 0.862 -1.46C -1.46 -1.46 -0.46 -1.46 -0.46 -1.46G -0.46 -1.46 -1.46 0.541 -0.46 -1.46T 1.348 -1.46 1.348 -1.46 -0.46 1.541
Scan a sequence using a sliding window of the length of the PSWM, and return the windows that have a significantly high score.
...G A G T T A T A A T T A A G A...
The significance of a score S can be computed as an empirical p value, or as follows,
where Smin and Smax is the minimal and maximal score can be scored by the PSWM,
min
min
ss
ssp
man
De novel prediction of TF binding sites
1. Greedy algorithms: CONSENSUS, DREME2. Probabilistic algorithms: MEME, BioProspector3. Graph-theoretic algorithms: CUBIC, MotifClick4. ……
The motif-finding problem: Since there are usually no fixed patterns of cis-regulatory elements of
a TF, a cis-regulatory element can be only predicted by comparing a set of sequences that are likely to contain the binding site of the same TF.
The problem of finding cis-regulatory elements in a given set of sequences is called the motif-finding problem.
Currently, all sequence-based motif-finding algorithms are based on the assumption that binding sites of a TF are more conserved than the flanking sequences in a genome. A larger number of motif-finding algorithms have been developed:
Methods for finding a set of intergenic sequences for motif-finding
One genome, multiple genes approach: identify a set of co-regulated genes from an organism of interest through clustering analysis of gene expression profiles.
IA
IB
IC
ID
IE
IF
Motif finding
Methods for finding a set of intergenic sequences for motif-finding
One gene, multiple genomes approach---phylogenetic footprinting: in closely related species, more often both the coding sequences and cis-regulatory elements of orthologous genes are conserved.
+1-10-35-300
+1-10-35-300
Hom
olog
ous
A operon from another genome
TFBSs Genes
Phylogenetic footprinting
Orthologues identification
T.g1
G1.g1
G2.g1
Gn.g1
.
.
.
. PSWM
m
Motif finding Predicted binding
Sites
Intergenic regions
……
T.gm
G1.gm
G2.gm
Gn.gm
.
.
Additional hallmarks of functional TF binding sites In high eukaryote, genes are regulated by multiple TFs binding to a
close cluster of respective binding sites.
These clusters of binding sites of the same and/or different TFs are called cis-regulatory modules (CRMs), they can be in different orientations, located in the upstream, downstream or in the intron of a gene, can be very far away from the target gene, and can be even on a different chromosome.
Borok M J et al. Development
2010;137:5-13Wyeth W. Wasserman & Albin SandelinNature Reviews Genetics 2004; 5, 276-287