1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation...

Finding Regulatory Motifs

Copyright notice

• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

Regulation of Transcription

TFs bound to their BSs

Transcription machinery Gene start

Transcription Factors

• Proteins involved in the regulation of gene expression that bind to the upstream promoter region of transcription initiation sites

• Composed of two essential functional regions: a DNA-binding domain and an activator domain.

Transcription factors

Sequence-specific DNA binding

Non-DNA binding

TF1 TF2 TF3 TF4

adapter

Co-activator

Layer I

Layer III

Layer II

BSs Models

(a) Exact string(s)

Example:

BS = TACACC , TACGGC

CAATGCAGGATACACCGATCGGTA

GGAGTACGGCAAGTCCCCATGTGA

AGGCTGGACCAGACTCTACACCTA

BSs Models (II)

(b) String with mismatches

Example:

BS = TACACC + 1 mismatch

CAATGCAGGATTCACCGATCGGTA

GGAGTACAGCAAGTCCCCATGTGA

AGGCTGGACCAGACTCTACACCTA

BSs Models (III)

(c) Degenerate string

Example:

BS = TASDAC (S={C,G} D={A,G,T})

CAATGCAGGATACAACGATCGGTA

GGAGTAGTACAAGTCCCCATGTGA

AGGCTGGACCAGACTCTACGACTA

BSs Models (IV)

(d) Position Weight Matrix (PWM)

Example: BS =A 0.1 0.8 0 0.7 0.2 0

C 0 0.1 0.5 0.1 0.4 0.6

G 0 0 0.5 0.1 0.4 0.1

T 0.9 0.1 0 0.1 0 0.3

ATGCAGGATACACCGATCGGTA 0.0605

GGAGTAGAGCAAGTCCCGTGA 0.0605

AAGACTCTACAATTATGGCGT 0.0151

Position Weight Matrix (PWM)

Frequency Matrix 1 2 3 4 5

a 12 1 0 1 0

c 1 1 10 9 0

g 2 5 5 2 14

t 0 7 0 2 1

Weight Matrix 1 2 3 4 5

a 5.1 -5.7 -8.7 -5.7 -8.7

c -5.7 -5.7 4.2 3.8 -8.7

g -2.7 1.2 1.2 -2.7 5.7

t -8.7 2.7 -8.7 -2.7 -5.7

( )ip b

( )10 log

( )ip b

NOTE: Use pseudo-counts for zero frequencies

( )f b background frequencies

Predicting Motif Occurrences:Sequence Scoring

1 2 3 4 5

a 5.1 -5.7 -8.7 -5.7 -8.7

c -5.7 -5.7 4.2 3.8 -8.7

g -2.7 1.2 1.2 -2.7 5.7

t -8.7 2.7 -8.7 -2.7 -5.7

a g c g g t a

Sum = 13.5

1 2 3 4 5

a 5.1 -5.7 -8.7 -5.7 -8.7

c -5.7 -5.7 4.2 3.8 -8.7

g -2.7 1.2 1.2 -2.7 5.7

t -8.7 2.7 -8.7 -2.7 -5.7

Sum = -15.6

BSs Models (V)

(e) More complex models

– PWM with spacers (e.g., for p53)– Markov model (dependency between

adjacent columns of PWM)– Hybrid models, e.g., mixture of two

PWMs– …

… And we also need to model the non-BSs sequences in the promoters…

Motif RepresentationsCGGCGCACTCTCGCCCGCGGGGCAGACTATTCCGCGGCGGCTTCTAATCCG...CGGGGCAGACTATTCCG

CGGNGCACANTCNTCCG1. Consensus

2. Frequency Matrix

3. Logo

• Graphical representation of nucleotide base (or amino acid) conservation in a motif (or alignment)

• Information theory

• Height of letters represents relative frequency of nucleotide bases

http://weblogo.berkeley.edu/

2 ( ) log ( )b

p b p b

A,C,G,T

Regulatory Motif DiscoveryDNA

Group of co-regulated genesCommon subsequence

Find motifs within groups of corregulated genes

How to find novel motifs

Degenerate string:• YMF - Sinha & Tompa ’02

String with mismatches:• WINNOWER – Pevzner & Sze ‘00• Random Projections – Buhler & Tompa ’02• MULTIPROFILER – Keich & Pevzner ’02

PWM:• MEME – Bailey & Elkan ’95• AlignACE – Hughes et al. ’98• CONSENSUS - Hertz & Stormo ’99

How to find TF modules

• BioProspector – Liu et al. ‘01

• Co-Bind – GuhaThakurta & Stormo ‘01

• MITRA – Eskin & Pevzner ‘02

• CREME – Sharan et al. ‘03

• MCAST – Bailey & Noble ‘03

Characteristics of Regulatory Motifs• Tiny

• Highly Variable

• ~Constant Size– Because a constant-size

transcription factor binds

• Often repeated

• Low-complexity

Problem Definition

Probabilistic

Motif: Mij; 1 i W1 j 4

Mij = Prob[ letter j, pos i ]

Find best M, and positions p1,…, pN in sequences

Combinatorial

Motif M: m1…mW

Some of the mi’s blank

Find M that occurs in all si with k differences

Given a collection of promoter sequences s1,…, sN of genes with common expression

Discrete Approaches to Motif Finding

Discrete Formulations

Given sequences S = {x1, …, xn}

• A motif W is a consensus string w1…wK

• Find motif W* with “best” match to x1, …, xn

Definition of “best”:

d(W, xi) = min hamming dist. between W and any word in xi

d(W, S) = i d(W, xi)

Exhaustive Searches

1. Pattern-driven algorithm:

For W = AA…A to TT…T (4K possibilities)Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4K )(where N = i |xi|)

Advantage: Finds provably “best” motif WDisadvantage: Time

Exhaustive Searches2. Sample-driven algorithm:

For W = any K-long word occurring in some xi

Find d( W, S )

Report W* = argmin( d( W, S ) )or, Report a local improvement of W*

Running time: O( K N2 )

Advantage: Time

Disadvantage: If the true motif is weak and does not occur in datathen a random motif may score better than any instance

of true motif

MULTIPROFILER

• Extended sample-driven approach

Given a K-long word W, define:

Nα(W) = words W’ in S s.t. d(W,W’) α

Assume W is occurrence of true motif W*

Will use Nα(W) to correct “errors” in W

MULTIPROFILERAssume W differs from true motif W* in at most L positions

Define:

A wordlet G of W is a L-long pattern with blanks, differing from W– L is smaller than the word length K

Example:

K = 7; L = 3

W = ACGTTGAG = --A--CG

MULTIPROFILERAlgorithm:

For each W in S:For L = 1 to Lmax

1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,

1. Modify W by the wordlet G W’2. Compute d(W’, S)

Report W* = argmin d(W’, S)

Step 2 above: Smaller motif-finding problem; Use exhaustive search

CONSENSUS

Algorithm:

Cycle 1:For each word W in S (of fixed length!)

For each word W’ in SCreate alignment (gap free) of W, W’

Keep the C1 best alignments, A1, …, AC1

ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …

CONSENSUSAlgorithm:

Cycle t:For each word W in S

For each alignment Aj from cycle t-1Create alignment (gap free) of W, Aj

Keep the Cl best alignments A1, …, ACt

ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …

CONSENSUS

• C1, …, Cn are user-defined heuristic constants

– N is sum of sequence lengths– n is the number of sequences

Running time:

O(N2) + O(N C1) + O(N C2) + … + O(N Cn)

= O( N2 + NCtotal)

Where Ctotal = i Ci, typically O(nC), where C is a big constant

Expectation Maximization in Expectation Maximization in Motif FindingMotif Finding

All K-long wordsmotif background

Expectation Maximization

Algorithm (sketch):

1. Given genomic sequences find all k-long words2. Assume each word is motif or background3. Find likeliest

Motif ModelBackground Modelclassification of words into either Motif or Background

Expectation MaximizationGiven sequences x1, …, xN,

• Find all k-long words X1,…, Xn

• Define motif model: M = (M1,…, MK)Mi = (Mi1,…, Mi4)

(assume {A, C, G, T})

where Mij = Prob[ letter j occurs in motif position i ]

• Define background model:B = B1, …, B4

Bi = Prob[ letter j in background sequence ]

motif background

M1 MKM1 B

Expectation Maximization• Define

Zi1 = { 1, if Xi is motif; 0, otherwise }

Zi2 = { 0, if Xi is motif; 1, otherwise }

• Given a word Xi = x[s]…x[s+k],

P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]

P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]

Let 1 = ; 2 = (1 – )

motif background

M1 MKM1 B

Expectation Maximization

Define:Parameter space = (M, B)

1: Motif; 2: Background

Objective:

Maximize log likelihood of model:

log)|(log

))|(log(),|,...(log

j jjij

i jjijijn

XPZXXP

M1 MKM1 B

Expectation Maximization• Maximize expected likelihood, in iteration of two steps:

Expectation:Find expected value of log likelihood:

Maximization:Maximize expected value over ,

)],|,...([log 1 ZXXPE n

Expectation:

Find expected value of log likelihood:

log][)|(log][

)],|,...([log

j jjij

ZZ EXPE

where expected values of Z can be computed as follows:

jijijij Z

XPZobZE *

)|()1()|(

)|(]1[Pr][

Expectation Maximization: E-step

Expectation Maximization: M-step

Maximization:Maximize expected value over and independently

For , this has the following solution:(we won’t prove it)

Effectively, NEW is the expected # of motifs per position, given our current parameters

Zxam ZZ

*))1log(log(arg **

• For = (M, B), define

cjk = E[ # times letter k appears in motif position j]

c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values

It then follows:

jkNEWjk

to not allow any 0’s, add pseudocounts

Expectation Maximization: M-step

Initial Parameters Matter!Consider the following artificial example:

6-mers X1, …, Xn: (n = 2000)

– 990 words “AAAAAA”– 990 words “CCCCCC”– 20 words “ACACAC”

Some local maxima:

= 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA

= 1%; B = 50% C, 50% A M = 100% ACACAC

Overview of EM Algorithm

1. Initialize parameters = (M, B), :– Try different values of from N-1/2 up to 1/(2K)

2. Repeat:a. Expectationb. Maximization

3. Until change in = (M, B), falls below

4. Report results for several “good”

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy

Lawrence et al. 1993

Notations

• Set of symbols:

• Sequences: S = {S1, S2, …, SN}

• Starting positions of motifs: A = {a1, a2, …, aN}

• Motif model ( ) : qij = P(symbol at the i-th position = j)

• Background model: pj = P(symbol = j)

• Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region

Probability of data given model

cijijqASP

cjijpASP

10 ),|(

Scoring Function• Maximize the log-odds ratio:

• Is greater than zero if the data is a better match to the motif model than to the background model

cijijqASP

cjijpASP

10 ),|(

ijij p

log),|(

),|(log

Scoring function

ijij p

log),|(

),|(log

• A particular alignment “A” gives us the counts cij. • In the scoring function “F”, use:

bcq jijij

Scoring function

• Thus, given an alignment A, we can calculate the scoring function F

• We need to find A that maximizes this scoring function, which is a log-odds score

Optimization and Sampling

• To maximize a function, f(x):– Brute force method: try all possible x– Sample method: sample x from probability

distribution: p(x) ~ f(x)

– Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax

Markov Chain Sampling

• To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy:

)()()()( xyypyxxp

lim xpxCNN

• This would then imply:

Gibbs sampling to maximize F

• Gibbs sampling is a special type of Markov chain sampling algorithm

• Our goal is to find the optimal A = (a1,…aN)• The Markov chain we construct will only have transitions

from A to alignments A’ that differ from A in only one of the ai

• In round-robin order, pick one of the ai to replace• Consider all A’ formed by replacing ai with some other

starting position ai’ in sequence Si

• Move to one of these A’ probabilistically• Iterate the last three steps

Algorithm

Randomly initialize A0;Repeat:

(1) randomly choose a sequence z from S;A* = At \ az; compute θt from A*;

(2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A* x;

Select At that maximizes F;

Qx: the probability of generating x according to θt;

Px: the probability of generating x according to the background model

Algorithm

Current solution At

Algorithm

Choose one az to replace

Algorithm

For each candidate sitex in sequence z, calculate Qx and Px:Probabilities of samplingx from motif model andbackground model resp.

Algorithm

Among all possible candidates, choose one(say x) with probabilityproportional to Qx/Px

Algorithm

Set At+1 = A* x

Algorithm

Repeat

Local optima

• The algorithm may not find the “global” or true maximum of the scoring function

• Once “At” contains many similar substrings, others matching these will be chosen with higher probability

• Algorithm will “get locked” into a “local optimum” – all neighbors have poorer scores, hence low

chance of moving out of this solution

Phase shifts

• After every M iterations, compare the current At with alignments obtained by shifting every aligned substring ai by some amount, either to left or right

Phase shift

Pattern Width

• The algorithm described so far requires pattern width(W) to be input.

• We can modify the algorithm so that it executes for a range of plausible widths.

• The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.

Pattern Width

• Another function based on the incomplete-data log-probability ratio G can be used.

• Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.

Time complexity analysis

• For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence.

• L*W multiplications are performed in Step2 of the algorithm.

• Total multiplications to execute the algorithm = TNLavgW

• Linear Time complexity has been observed in applications

Motif finding

• The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences– Protein motifs represent common sequence

patterns in proteins, that are related to certain structure and function of the protein

• Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites

Advantages / Disadvantages

• Very similar to EM

Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space

Repeats, and a Better Background Model

• Repeat DNA can be confused as motif– Especially low-complexity CACACA… AAAAA, etc.

Solution:

more elaborate background model0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)

Limits of Motif Finders

• Given upstream regions of coregulated genes:– Increasing length makes motif finding harder – random motifs

clutter the true ones– Decreasing length makes motif finding harder – true motif

missing in some sequences

gene???

Example Application: Motifs in Yeast

Group:

Tavazoie et al. 1999, G. Church’s lab, Harvard

• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles

1. Clustering genes according to common expression

• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function

2. AlignACE motif finding • 600-long upstream regions

Motifs in Periodic Clusters

Motifs in Non-periodic Clusters

1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation...

Documents

Transcript of 1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation...

Bioinformatics - Genomics and Post-Genomics - F. Dardel, F. Kepes (Wiley, 2006) WW

Bioinformatics: Making sense of functional genomics data

2015 NRF-Managed Bioinformatics and Functional Genomics ...

Bioinformatics On Genomics

Chapter 24 topics: Genomics, Proteomics, Bioinformatics

Bioinformatics and Evolutionary Genomics : Pathway evolution.

Http:// X4mO4KPdtEM. Genomics and Bioinformatics.

Computational Molecular Biology - Genomics, Bioinformatics

BD Single-Cell Genomics Bioinformatics Handbook

Bioinformatics - Stellenbosch UniversityPevsner J. Bioinformatics and Functional Genomics 3rd Edition Wiley-Blackwell 2015. Bioinformatics, Stellenbosch University • Many bioinformatics

CENTER FOR GENOMICS AND BIOINFORMATICS

Doug Brutlag 2011 Bioinformatics Genomics, Bioinformatics.

Bioinformatics 3 V20 – Kinetic Motifs

Bioinformatics and Evolutionary Genomics

Genomics and Bioinformatics of Parkinson’s Diseaseperspectivesinmedicine.cshlp.org/content/2/7/a009449.full.pdf · Genomics and Bioinformatics of Parkinson’s Disease Sonja W.

Bioinformatics – Functional Genomics – Dr. Víctor Treviño.

Statistical Genomics and Bioinformatics Workshop: Genetic Association and · PDF file · 2013-08-15Statistical Genomics and Bioinformatics Workshop 8/16/2013 1 Statistical Genomics

Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary.

Bioinformatics and Functional Genomics - FASTA · 1/14/18 1 Bioinformatics and Functional Genomics Course Overview, Introduction of Bioinformatics, Biology Background Biol4230 Thurs,

Genomics, Bioinformatics, and Pathology