4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

4. Finding Regulatory Motifs in

DNA Sequences

(Chapter 4 and 12)

Outline

• Implanting Patterns in Random Text

• Gene Regulation

• Regulatory Motifs

• The Motif Finding Problem

• Brute Force Motif Finding

• The Median String Problem

• Search Trees

• Branch-and-Bound Motif Search

• Branch-and-Bound Median String Search

• Consensus and Pattern Branching: Greedy Motif Search

• PMS: Exhaustive Motif Search

Identifying Motifs

Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site

Genes are turned on or off by regulatory proteins

• These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase

• Regulatory protein transcription factor (TF) binds to a short DNA sequence called a motif or (TFBS: transcription factor binding site)

• So finding the same motif in multiple genes‟ regulatory regions suggests a regulatory relationship amongst those genes

Identifying Motifs: Complications

• We do not know the motif sequence

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next

• How to discern it from “random” motifs?

Transcription Factor Binding Sites

(TFBS)

• A TFBS can be located anywhere within

the Regulatory Region.

• TFBS may vary slightly across different

regulatory regions since non-essential

bases could mutate

Transcription for E.Coli

The transcription process in reality even for a simple E.Coli is much more

complex than what we have described. The process is divided into three

phases: initiation, elongation and termination.

Initiation: The RNA polymerase initiates the operation and it must

transcribe not any arbitrary part of DNA but only the gene. For this the

polymerase first „bind‟ to a location upstream of the gene. This site is

called a promoter sequence. The promoter is a short DNA sequence

which can be bind to the polymerase. In E.Coli, the promoter sequence

consists of two distinct sequences at a distance -10 and -35 upstream

from the position at which transcription starts. The actual sequences may

vary from gene to gene but they are related to the following two

consensus sequences both located in the non-template strand:

-35 box 5’-TTGACA-3’

-10 box 5’-TATAAT-3’

5’ 3’gene-10 box-35 box

3’

5’ gene-10 box-35 box

5’

3’

+1-10

+1

RNA polymerase

holoenzymeSigma

subunit

Closed promoter

complex

Open promoter

complex

Initiation:

First two

ribonucletides

-35

Core enzyme

Elongation

3’

5’

5’RNA Transcript

Direction

of transcription

+1

Steps of RNA Transcription

The sigma subunit within the polymerase recognizes the promoter sequence

and a closed promoter complex is formed. The enzyme covers about 60

bp of the double helix. In the next phase, the double helix starts „melting‟ at -

10 box unwinding the DNA into single strands in the region under the core

enzyme. The -10 box consistsof entirely AT pairs which have only two

hydrogen bonds for each bp. This makes it makes it easier to melting to take

place compared to the situation with CG base pairs which have 3 hydrogen

bonds. This configuration is called open promoter complex, the sigma

subunit ejects out of the holoenzyme converting it to a core enzyme. At the

same time, the first two ribonucleotides are sealed in the template strand at

positions +1 and +2 with a phosphodiester bond.

In the next elongation step, the polymerase moves downstream with

relative ease, unzipping the DNA molecule and attaching new

ribonucleotides to the 3‟ end of the growing RNA. At the same time, the DNA

behind it rebounds back to its double helix structure. The open promoter is

like a bubble that propagate to 3‟ direction always maintaining its size

between 12 to 17 DNA. Also, the rate of propagation is not constant, it may

slow down, pause, reaccelerate or even go backwards destroying the

ribonucleotides. The RNA itself is synthesized in the 5‟-3‟ direction. The

length of the actual transcript is longer than the length of the gene because

the +1 position is about 20 to 600 nucleotide upstream from the beginning of

the gene. This part of the RNA transcript is called a leader segment.

Similarly, the transcription extends beyond the gene creating a similar trailer

segment.

The termination of RNA transcription is signalled by the presence of a

complementary palindrome. ( A palindrome reads the same sequence in

both forward and backward direction viz ATAGCGATA )

Transcription in Eukaroytes

The transcription in Eukaryotes is similar to that in E.Coli but is much

more complex. The RNA polymerase has an attachment site rather

than a promoter sequence plus other promoter sequences distributed

over several hundred base pairs all upstream from the genes. These

promoters regulate the gene expression by turning on or off the

transcription process. Understanding these regulatory processes is

by itself a whole new research field.

The attachment site is referred to as -25 TATA box (5‟-TATAAAT-3‟)

and the RNA polymerase called RNA Polymerase II. The attachment

is helped by a set of proteins called transcription factors (TF II A,

TF II B , D, E and F) which ultimately makes the transcription

complex ready to start the synthesis process of RNA. The next slide

gives a schematic representation.

The exact details of the termination of the transcription process is not

very well understood. The termination trail seem to be longer ( about

1000 to 2000 bp downstream the gene. The exact termination point

is still a matter of research.

TATA gene

Regulatory factors

RNA Polymerase II

TF II A and D

TF II B

RNA Polymerase II

TF IIE

F

A D BE

F

Completed transcription

complex

gene

Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC

Motif Logo

• Motifs can mutate on

non important bases

• The five motifs in five

different genes have

mutations in position 3

and 5

• Representations called

motif logos illustrate the

conserved and variable

regions of a motif

TGGGGGA

TGAGAGA

TGGGGGA

TGAGAGA

TGAGGGA

Random Sampleatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Implanting Motif AAAAAAGGGGGGG

with Four Mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Challenge Problem

– Find a motif in a sample of

- 20 “random” sequences (e.g. 600 nt long)

- each sequence containing an implanted

pattern of length 15,

- each pattern appearing with 4 mismatches

as (15,4)-motif.

Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||

Combinatorial Gene Regulation

• A microarray experiment showed that

when gene X is knocked out, 20 other

genes are not expressed

– How can one gene have such drastic

effects?

The Motif Finding Problem

• Given a random sample of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the pattern that is implanted in each of the individual sequences, namely, the motif

The Motif Finding Problem (cont‟d)

• Additional information:

– The hidden sequence is of length 8

– The pattern is not exactly the same in each

array because random point mutations

may occur in the sequences


• The patterns revealed with no mutations:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat


aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt


ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

acgtacgt

Consensus String


• The patterns with 2 point mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc


• The patterns with 2 point mutations:






Can we still find the motif, now that we have 2 mutations?

Defining Motifs

• To define a motif, lets say we know where the motif starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Motifs: Profiles and Consensus

a G g t a c T tC c A t a c g t

Alignment a c g t T A g ta c g t C c A tC c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Consensus

• Think of consensus as an “ancestor”

motif, from which mutated motifs emerged

• The distance between a real motif and the

consensus sequence is generally less

than that for two real motifs

Consensus (cont‟d)

Parameters






l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60s

DNA

n = 69

1

2

3

t-1

tst

s1

s2

s3

st-1

A T C C A G C T

G G G C A A C T

A T G G A T C T

A A G C A A C C

T T G G A A C T

A T G C C A T T

A T G G C A C T

A 5 1 0 0 5 5 0 0

T 1 5 0 0 0 1 1 6

G 1 1 6 3 0 1 0 0

C 0 0 1 4 2 0 6 1

A T G C A A C T

Alignment Matrix

Profile matrix P(s)

Consensus Sequence

(“most popular”)

Given a group of t DNA sequences,

each of length n. Find most frequently

occurrng l-mers (which are close but differ

slightly due to mutations.)

Let s=(s1,s2,s3,…si,..st ) be the start positions in

t sequences, 1 <= si <= n-l+1. There are

(n-l+1)t such starting positions.

Highly conservedNot conserved

Definition:

The score of the alignment

for a specific set of positions

s in the t DNA sequences,

denoted as Score(s,DNA),

is the sum of maximum

numbers corresponding to

the consensus sequence. For

this example this sum is

5+5+6+4+5+5+6+6=42

The Motif Finding Problem: Brute Force Solution I

(data driven approach)

The maximum possible Score(s,DNA)= lt if each column has the same nucleotide and the minimum score is (lt/4) when each column has t/4 A,C,G and T.

– Compute the scores for each possible combination of starting positions s

– The best score will determine the best profile and the consensus pattern in DNA

– The goal is to maximize Score(s,DNA) by varying the starting positions si

Evaluating Motifs

• We have a guess about the consensus

sequence, but how “good” is this

consensus?

• Need to introduce a scoring function to

compare different guesses and choose the

“best” one.

Defining Some Terms

• t - number of sample DNA sequences

• n - length of each DNA sequence

• DNA - sample of DNA sequences (t xn

array)

• l - length of the motif (l-mer)

• si - starting position of an l-mer in

sequence i

• s=(s1, s2,… st) - array of motif‟s starting

positions

Scoring Motifs

• Given s = (s1, … st) and DNA:

Score(s,DNA) =

a G g t a c T t

C c A t a c g t

a c g t T A g t

a c g t C c A t

C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0

C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1

T 0 0 0 5 1 0 1 4

_________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

l

t

l

i GCTAk

ikcount1 },,,{

),(max

The Motif Finding Problem

• If starting positions s=(s1, s2,… st) are

given, finding consensus is easy even with

mutations in the sequences because we

can simply construct the profile to find the

motif (consensus)

• But… the starting positions are usually

not given. How can we find the “best”

profile matrix?

The Motif Finding Problem: Formulation

• Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score

• Input: A t x n matrix of DNA, and l the length of the pattern to find

• Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA)

BruteForceMotifSearch

1. BruteForceMotifSearch(DNA, t, n, l)

2. bestScore 0

3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)

4. if (Score(s,DNA) > bestScore)

5. bestScore score(s, DNA)

6. bestMotif (s1,s2 , . . . , st)

7. return bestMotif

Line 3 above enumerates all possible (n-l+1)t t-tuples of

position indices in t DNA sequences. A systematic method to

generate these position indices will be

discussed later.

Running Time of BruteForceMotifSearch

• Varying (n - l + 1) positions in each of tsequences, we‟re looking at (n - l + 1)t sets of starting positions

• For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1)t = O(l nt)

• That means that for t = 8, n = 1000, l = 10 we must perform approximately 1020 computations – it will take billions years

The Median String Problem

• Given a set of t DNA sequences find a

pattern that appears in all t sequences

with the minimum number of mutations

• This pattern will be the motif

Another alternative view onto this problem

is to reframe the motif finding problem as the

Problem of finding a median string.

Hamming Distance

• Hamming distance:

– dH(v,w) is the number of nucleotide pairs

that do not match when v and w are

aligned. For example:

dH(AAAAAA,ACAAAC) = 2

Total Distance: Example

• Given v = “acgtacgt” and s

acgtacgt

cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat

acgtacgt


acgtacgt

aaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

acgtacgt


acgtacgt

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc

v is the sequence in red, x is the sequence in blue

• TotalDistance(v,DNA) = 1+0+2+0+1 = 4

dH(v, x) = 2

dH(v, x) = 1

dH(v, x) = 0

dH(v, x) = 0

dH(v, x) = 1

Total Distance: Definition

– Given a l -mer v , for each DNA sequence i, compute all dH(v, x), where x is an l-mer with starting position si

(1 < si < n – l + 1)

– Find minimum of dH(v, x) among all l-mers in sequence i

– TotalDistance (v, DNA) is the sum of the minimum Hamming distances for each DNA sequence i

– TotalDistance (v, DNA) = mins dH(v, s), where s is the set of starting positions s1, s2,… st

mins dH(v, s)= min sum of Hamming distances for all choices of l -mers (there are 4l such l -mers ( min of all choices of positions s= s1, s2,… st))

The Median String Problem: Formulation

• Goal: Given a set of DNA sequences, find

a median string

• Input: A t x n matrix DNA, and l, the length

of the pattern to find

• Output: A string v of l nucleotides that

minimizes TotalDistance(v,DNA) over all

strings of that length

Median String Search Algorithm

1. BF_MedianStringSearch (DNA, t, n, l)

2. bestWord AAA…A

3. bestDistance ∞( or a large number)

4. for each l-mer word from AAA…A to

TTT…T if TotalDistance(word,DNA) <

bestDistance

bestDistanceTotalDistance(word,DNA)

1. bestWord word

2. return bestWord

Line 4 dominates the computational complexity: we

need to enumerate 4l DNA sequences and for each

such sequence ,compute the Hamming distance n-l+1

times taking l comparison operation: l(n-l+1)4l

which again is exponential.

Given the „word‟ , we can find

The TotalDistance(word , DNA) in a single pass over

DNA, that is in time O(nt), rather than considering all

possible combinations of start positions. Thus, BF

median search much less time compared to BF takes

O(4lnt) time motif search. Typical value of l=8 to 15

and length of upstream motif area is about 500 to

1000 nucleotides.

Motif Finding Problem == Median String Problem

• The Motif Finding is a maximization problem while

Median String is a minimization problem

• However, the Motif Finding problem and Median

String problem are computationally equivalent

• Need to show that minimizing

TotalDistance is equivalent to maximizing

Score

We are looking for the same thing

a G g t a c T t

C c A t a c g t

Alignment a c g t T A g t

a c g t C c A t

C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0

Profile C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1

T 0 0 0 5 1 0 1 4

_________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4

TotalDistance 2+1+1+0+2+1+2+1

Sum 5 5 5 5 5 5 5 5

• At any column iScorei + TotalDistancei = t

• Because there are l columns

Score + TotalDistance = l * t

• Rearranging:Score = l * t - TotalDistance

• l * t is constant. The minimization of the right side is equivalent to the maximization of the left side

l

t

Motif Finding Problem vs.

Median String Problem

• Why bother reformulating the Motif Finding

problem into the Median String problem?

– The Motif Finding Problem needs to

examine all the combinations for s. That is (n - l + 1)t combinations!!!

– The Median String Problem needs to

examine all 4l combinations for v. This

number is relatively smaller

Motif Finding: Improving the Running Time

Recall the BruteForceMotifSearch:

1. BruteForceMotifSearch(DNA, t, n, l)

2. bestScore 0

3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)


5. bestScore Score(s, DNA)

6. bestMotif (s1,s2 , . . . , st)

7. return bestMotif

Structuring the Search

• How can we perform the line

for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) ?

• We need a method for efficiently structuring

and navigating the many possible motifs

• This is not very different than exploring all t-

digit numbers

Median String: Improving the Running Time

1. MedianStringSearch (DNA, t, n, l)

2. bestWord AAA…A

3. bestDistance ∞

4. for each l-mer s from AAA…A to TTT…T

if TotalDistance(v,DNA) < bestDistance

5. bestDistanceTotalDistance(s,DNA)

6. bestWord v

7. return bestWord

While computing TotalDistance(v,DNA) , it implicitly determines the

start positions in the linear scan taking O(nt) time

Structuring the Search

– For BruteForce method, we need to sear all possible (n - l + 1)t start locations.

– For the Median String Problem we need to consider all 4l possible l-mers:

aa… aa

aa… ac

aa… ag

aa… at

.

.

tt… tt

How to organize this search?

Alternative Representation of the Search Space

• Let A = 1, C = 2, G = 3, T = 4

• Then the sequences from AA…A to TT…T become:

11…11

11…12

11…13

11…14

.

.

44…44

• Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit.

l

Linked List

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

• Need to visit all the predecessors of a sequence

before visiting the sequence. It does not allow a subset of the l –mers to be searched based on

some special criteria (viz. Branch and Bound

Algorithm). A more efficient data structure is a

tree.

Start

Suppose l = 2

Linked List (cont‟d)

• Linked list is not the most efficient data structure for motif

finding

• Let‟s try grouping the sequences by their prefixes


A Search Tree with and k=4 showing all 2-

mers with alphabet (a,c,t,g). The number of levels equals l +1

a- c- g- t-


--

root

Level 0

Level 1

Level 2

The nodes at level j (0<=j<=j-1)

represent prefixes of the leaf nodes

of length j+1.

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

111- 112- 121- 122- 211- 212- 221- 222-

11-- 12-- 21-- 22--

1--- 2---

- - - -

1 2

1 1

11

1

11

1 1 1 1 1 1 1

2 2

2222

2 2 2 2 2 2 2 2

All 4-mers (l =4) with two letter alphabet (1,2) are represented by the leaves of the tree.

An internal node represents all the 4-mers in the leaf nodes of the subtree. The nodeat level v (0 <= v <=4) is designated by v-letter prefix of its children. There are 2 l +1

nodes in the tree.

Level=0

Level=1

Level=2

Level=3

Level=4

Analyzing Search Trees

• Characteristics of the search trees:

– The sequences are contained in its leaves

– The parent of a node is the prefix of its

children

– To represent all start positions in the Motif

Finding problem , we can construct a tree

with l +1=t +1 levels and each node having

k=n- l +1 children for each vertex.

– For Median String Problem we have all l –

mers but k=4

• How can we move through the tree?

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

111- 112- 121- 122- 211- 212- 221- 222-

11-- 12-- 21-- 22--

1--- 2---

- - - -

1 2

1 1

11

1

11

1 1 1 1 1 1 1

2 2

2222

2 2 2 2 2 2 2 2

All 4-mers (l =4) with two letter alphabet (1,2) are represented by the leaves of the tree.

An internal node represents all the 4-mers in the leaf nodes of the subtree. A nodeat level v (0 <= v <=3) is designated by v-letter prefix of its children. There are 2 l +1

nodes in the tree.

Level=0

Level=1

Level=2

Level=3

Level=4

Moving through the Search Trees

• Four common moves in a search tree that we are about to explore:

– Move to the next leaf

– Visit all the leaves

– Visit the next node

– Bypass the children of a node

– In general, we want to consider all k l

l -mers. For motif finding problem

k= n - l + 1. For median string problem k=4.

Moving through the Search Trees

• Four common moves in a search tree that we are about to explore:

– Move to the next leaf

– Visit all the leaves

– Visit the next node

– Bypass the children of a node

– In general, we want to consider all kL

L-mers. For motif finding problem

k= n - l + 1. For median string problem k=4.

Visit the Next Leaf

NextLeaf (a=((a1,a2,…aL), L, k)// a :the array of digits. L: length of the array (it same as l ). k :

max digit value//

1. a a+ 1 // radix k addition//2. if a=(1,1,….,1) exit // no next leaf //

else return a

(1,1, …,1,1) (4,4, …,3,3) (4,4, …,4,3) (1,1, …,1,2) (4,4, …,3,3) (4,4, …,4,4) (1,1, …,1,3) (4,4, …,3,4) (1,1, …,1,1) // no next leaf// (1,1, …,1,4) (4,4, …,4,1) //reset to 1111 //(1,1, …,2,1)…………….. …………….

Given a current leaf a , we need to compute the “next” leaf:

Visit the Next Leaf

1. NextLeaf( a,L, k ) // a :the array of digits2. for i L to 1 // L: length of the array3. if ai < k // k : max digit value4. ai ai + 15. return a6. ai 17. return a

This slide is same as the previous slide but it showsthe actual computations performed. If L were 10, it Would be like counting decimal numbers except it

uses digits 1 to 10 rather than 0 to 9

Given a current leaf a , we need to compute the “next” leaf:

NextLeaf (cont‟d)

• The algorithm is addition in radix k

• Increment the least significant digit

• “Carry the one” to the next digit position when the digit is at maximal value

NextLeaf: Example

• Moving to the next leaf:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Current Location

NextLeaf: Example (cont‟d)

• Moving to the next leaf:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location

Visit All Leaves

• Printing all permutations in ascending

order:

1. AllLeaves(L,k) // L: length of the sequence

2. a (1,...,1) // k : max digit value

3. while forever // a : array of digits

4. output a

5. a NextLeaf(a,L,k)

6. if a = (1,...,1)

7. return

Visit All Leaves: Example

• Moving through all the leaves in order:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

--

Order of steps

Depth First Search

• So we can search leaves

• How about searching all vertices of the tree?

• We can do this with a depth first search.

• Depth First Search (DFS) is a pre-order traversal of a tree. (other traversals are Inorder , Postorder)

• Preorder (v) // v is a node of the tree//• Output v

• If v has children

• Preorder (left child of v)

• Preorder (right child of v)

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

111- 112- 121- 122- 211- 212- 221- 222-

11-- 12-- 21-- 22--

1--- 2---

1

2

4

2

3

Level=0

Level=1

Level=2

Level=3

Level=45 6

7

8 9

10

11

12 13

14

15 16

17

19

18

20 21 23 24

25

27

26

28

29

30 31

Preorder (v) // v is a node of the tree//Output vIf v has children

Preorder (left child of v)Preorder (right child of v)

22

Visit the Next Vertex

1. NextVertex(a,i,L,k)// a : (a1,a2,.., ai …aL) - the array of digits. If i<L, it is

an internal vertex.

1. if i < L // i : prefix length2. a i+1 1 // L: max length3. return ( a,i+1) // k : max digit value4. else5. for j L to 16. if aj < k 7. aj aj +18. return( a,j )9. return(a,0)

Explanation

When i <L , NextVertex moves down to next the

lower level and explores that subtree of a . The

example in two slides earlier show how the

vertices are traversed if the initial vertex specified

is 12– (the blue path)

If i=L, NextVertex either moves along the lowest

level until the L-th symbol becomes k and then

jumps back to the right child of the initial vertex if

it exists or its right sibling node. It returns to the

root (a,0) after it reaches the vertex (k,k,…,k).

Example

• Moving to the next vertex:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Current Location

Example

• Moving to the next vertices:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--

Location after 5

next vertex moves

Branch-and-Bound Search

This approach will allow us to skip any children

or grandchildren or great grandchildren etc if

these descendants cannot provide a better

score than the best leaf that has already been

explored.

At each vertex we calculate the most optimistic

bound or score of any leaf in the subtree

rooted at that vertex and then decide whether

to branch further or not. This is the reason why

this approach is called Branch-and-Bound .

Branch and Bound Algorithm for Motif Search

• Since each level of the

tree goes deeper into

search, discarding a prefix

discards all following

branchesThis saves us from looking at (n – l + 1)t-i

or 2L-i leaves.

• NextVertex() is not up to the

task.

– use ByPass() to navigate the

tree

Bypass Move

• Given a prefix (internal vertex), find next vertex after

skipping all its children. If we skip a vertex at level i of

the tree, we can just increment ai (unless all vertices at

level i has been traversed and we return to the root

vertex). The algorithm looks very similar to NextLeaf

but differ in a subtle way.

1. Bypass(a,i,L,k) // a: array of digits

2. for j i to 1 // i : prefix length

3. if aj < k // L: maximum length

4. aj aj +1 // k : max digit value

5. return(a, j )

6. return(a,0)

Bypass Move: Example

• Bypassing the descendants of “2-”:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Current Location

Example

• Bypassing the descendants of “2-”:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location

Revisiting Search Tree

--

Score: 24 15 3 10 15 5 11 7 18 14 22 17 30 25 8 16

24 15 22 30

30

If the scores of the leaf nodes are as given below then it does not make

Any sense to explore any subtree whose score is less than 24 (the red nodes)

Brute Force Search Again1. BruteForceMotifSearchAgain(DNA, t, n, l)

2. s (1,1,…, 1)

3. bestScore Score(s,DNA)

4. while forever

5. s NextLeaf (s, t, n- l +1) [Enumeration of (1,1,..,1) to (n-l+1,…,n-l+1)]



8. bestMotif (s1,s2 , . . . , st)

9. if s=(1,1,…,1)

10. return bestMotif

As we discussed earlier, we have to explore (n-l +1)t positions. For each position vector

s= s1,s2,…,st, the algorithm calculates Score(s,DNA) which takes O(l )operations. Thus,

the overall complexity of the algorithm is O(l nt).

Can We Do Better?• A set of start locations s=(s1, s2, …,st) may have a weak profile for

the first i positions (s1, s2, …,si). Then, it is not necessary to consider ant starting positions beginning si-1,si+1, …,st

• Define partial consensus score, Score(s,i,DNA) to be the consensus score of the ixl alignment matrix involving the first i rows of DNA corresponding to starting positions (s1, s2, …,si -,-..-).

• Every row of alignment may add at most l to Score. All subsequent (t-i) positions (si+1, …st) could add in the most optimistic situation

(t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to explore the remaining (t-i) positions in this choice of positions (s1, s2, …,si -,-..-).

• Using ByPass() , we could save looking at (n-l +1)t –i leaves

Pseudocode for Branch and Bound Motif Search

1. BranchAndBoundMotifSearch(DNA,t,n,l )

2. s (1,…,1)

3. bestScore 0

4. i 1

5. while i > 0

6. if i < t

7. optimisticScore Score(s, i, DNA) +(t – i ) * l

8. if optimisticScore < bestScore

9. (s, i) Bypass(s, i, t, n-l +1) //parameters (a, i, L, k) //

10. else

11. (s, i) NextVertex(s, i, t, n-l +1) //parameters (a, i, L, k)

12. else

13. if Score(s,DNA) > bestScore


15. bestMotif (s1, s2, s3, …, st)

16. (s, i) NextVertex(s, i, t, n-l + 1) //parameters (a, i, L, k)

17. return bestMotif

Median String Search Improvements

• Recall the computational differences between motif

search and median string search

– The Motif Finding Problem needs to examine all (n-l +1)t combinations for s.

– The Median String Problem needs to examine

4l combinations of v. This number is relatively

small

• We want to use median string algorithm with the

Branch and Bound trick!

Branch and Bound Applied to Median

String Search

• Note that if the total distance for a prefix is

greater than that for the best word so far:

TotalDistance (prefix, DNA) > BestDistance

there is no use exploring the remaining

part of the word

• We can eliminate that branch and

BYPASS exploring that branch further.

Level=0

Level=1

Level=2

Level=3

Level=4

A--- T--- C--- G---

AA--

AT-- AG--

AG-- GA--

GT-- GC--

GG--

AAA- AAG-

AAAA

GGA- GGG-

GGGG

Not all nodes are marked, not all child nodes are drawn

A vertex at level i in this tree represents a nucleotide string of length i which

Can be viewed as the i-long prefix of every leaf below that vertex. The

SIMPLE_MEDIAN_SEARCH pseudocode given in next page maps A,C,G,T to

numerals 1,2,3,4 respectively.

Pseudocode for Branch and Bound Median SearchSIMPLE_MEDIAN_SEARCH(DNA, t, n, l )

1. s (1,…,1)

2. bestDistance a very large number

3. i 1

4. while i > 0

5. if i < l

6. (s, i) NextVertex(s, i, l ,4) //parameters (a, i, L, k)

7. else

8. word nucleotide string corresponding to (s1,s2,…,s l )

9. if TotalDistance(word,DNA)< bestDiatance

10. bestDistance TotalDistance(word,DNA)

11. bestWord word

12. (s, i) NextVertex(s, i, l ,4) //parameters (a, i, L, k)

13. return bestWord

We find bound for TotalDistance(word,DNA) at each vertex. If the total

distance between the i-prefix of the word and DNA is larger than the smallest seen so far for one of the leaves (nucleotide strings of length l ), then there is

no point investigating subtrees of the vertex corresponding to that i-prefix of the word ; all extensions of this prefix into an l –mer will have at least the

same or probably more total distance. Of course, there could be some

extensions to the prefix that matches every string in the sample which will

make the value of total distance to be 0.

Bounded Median String Search1. BranchAndBoundMedianStringSearch(DNA,t,n,l )2. s (1,…,1)3. bestDistance ∞4. i 15. while i > 06. if i < l7. prefix string corresponding to the first i nucleotides of s8. optimisticDistance TotalDistance(prefix,DNA)9. if optimisticDistance > bestDistance10. (s, i ) Bypass(s,i, l, 4)11. else12. (s, i ) NextVertex(s, i, l, 4)13. else 14. word nucleotide string corresponding to s15. if TotalDistance(s,DNA) < bestDistance16. bestDistance TotalDistance(word, DNA)17. bestWord word18. (s,i ) NextVertex(s,i,l, 4)19. return bestWord

Improving the Bounds

• Given an l-mer w, divided into two parts at point i

– u : prefix w1, …, wi,

– v : suffix wi+1, ..., wl

• Find minimum distance for u in a sequence

• No instances of u in the sequence have distance less than the minimum distance

• Note this doesn‟t tell us anything about whether u is part of any motif. We only get a minimum distance for prefix u

Improving the Bounds (cont‟d)

• Repeating the process for the suffix v

gives us a minimum distance for v

• Since u and v are two substrings of w,

and included in motif w, we can assume

that the minimum distance of u plus

minimum distance of v can only be less

than the minimum distance for w

Better Bounds

Better Bounds (cont‟d)

• If d(prefix) + d(suffix) > bestDistance:

– Motif w (prefix.suffix) cannot give a better

(lower) score than d(prefix) + d(suffix)

– In this case, we can ByPass()

Better Bounded Median String Search1. ImprovedBranchAndBoundMedianString(DNA,t,n,l)

2. s = (1, 1, …, 1)

3. bestdistance = ∞

4. i = 1

5. while i > 0

6. if i < l

7. prefix = nucleotide string corresponding to (s1, s2, s3, …, si )

8. optimisticPrefixDistance = TotalDistance (prefix, DNA)

9. if (optimisticPrefixDistance < bestsubstring[ i ])

10. bestsubstring[ i ] = optimisticPrefixDistance

11. if (l - i < i )

12. optimisticSufxDistance = bestsubstring[l -i ]

13. else

14. optimisticSufxDistance = 0;

15. if optimisticPrefixDistance + optimisticSufxDistance > bestDistance

16. (s, i ) = Bypass(s, i, l, 4)

17. else

18. (s, i ) = NextVertex(s, i, l,4)

19. else

20. word = nucleotide string corresponding to (s1,s2, s3, …, st)

21. if TotalDistance( word, DNA) < bestDistance

22. bestDistance = TotalDistance(word, DNA)

23. bestWord = word

24. (s,i) = NextVertex(s, i,l, 4)

25. return bestWord

CONSENSUS: Greedy Motif Search

• Find two closest l-mers in sequences 1 and 2 and forms

2 x l alignment matrix with Score(s,2,DNA)

• At each of the following t-2 iterations CONSENSUS finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing

Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Some Motif Finding Programs

• CONSENSUS

Hertz, Stromo (1989)

• GibbsDNA

Lawrence et al (1993)

• MEME

Bailey, Elkan (1995)

• RandomProjections

Buhler, Tompa (2002)

• MULTIPROFILER

Keich, Pevzner (2002)

• MITRA

Eskin, Pevzner (2002)

• Pattern Branching

Price, Pevzner (2003)

• PRUNER II

Ravi Vijaya Satya and

A. Mukherjee (2005)

4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Documents

Transcript of 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)