4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

97
4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Transcript of 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Page 1: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

4. Finding Regulatory Motifs in

DNA Sequences

(Chapter 4 and 12)

Page 2: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Outline

• Implanting Patterns in Random Text

• Gene Regulation

• Regulatory Motifs

• The Motif Finding Problem

• Brute Force Motif Finding

• The Median String Problem

• Search Trees

• Branch-and-Bound Motif Search

• Branch-and-Bound Median String Search

• Consensus and Pattern Branching: Greedy Motif Search

• PMS: Exhaustive Motif Search

Page 3: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Identifying Motifs

Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site

Genes are turned on or off by regulatory proteins

• These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase

• Regulatory protein transcription factor (TF) binds to a short DNA sequence called a motif or (TFBS: transcription factor binding site)

• So finding the same motif in multiple genes‟ regulatory regions suggests a regulatory relationship amongst those genes

Page 4: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Identifying Motifs: Complications

• We do not know the motif sequence

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next

• How to discern it from “random” motifs?

Page 5: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Transcription Factor Binding Sites

(TFBS)

• A TFBS can be located anywhere within

the Regulatory Region.

• TFBS may vary slightly across different

regulatory regions since non-essential

bases could mutate

Page 6: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Transcription for E.Coli

The transcription process in reality even for a simple E.Coli is much more

complex than what we have described. The process is divided into three

phases: initiation, elongation and termination.

Initiation: The RNA polymerase initiates the operation and it must

transcribe not any arbitrary part of DNA but only the gene. For this the

polymerase first „bind‟ to a location upstream of the gene. This site is

called a promoter sequence. The promoter is a short DNA sequence

which can be bind to the polymerase. In E.Coli, the promoter sequence

consists of two distinct sequences at a distance -10 and -35 upstream

from the position at which transcription starts. The actual sequences may

vary from gene to gene but they are related to the following two

consensus sequences both located in the non-template strand:

-35 box 5’-TTGACA-3’

-10 box 5’-TATAAT-3’

5’ 3’gene-10 box-35 box

Page 7: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

3’

5’ gene-10 box-35 box

5’

3’

+1-10

+1

RNA polymerase

holoenzymeSigma

subunit

Closed promoter

complex

Open promoter

complex

Initiation:

First two

ribonucletides

-35

Core enzyme

Elongation

3’

5’

5’RNA Transcript

Direction

of transcription

+1

Page 8: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Steps of RNA Transcription

The sigma subunit within the polymerase recognizes the promoter sequence

and a closed promoter complex is formed. The enzyme covers about 60

bp of the double helix. In the next phase, the double helix starts „melting‟ at -

10 box unwinding the DNA into single strands in the region under the core

enzyme. The -10 box consistsof entirely AT pairs which have only two

hydrogen bonds for each bp. This makes it makes it easier to melting to take

place compared to the situation with CG base pairs which have 3 hydrogen

bonds. This configuration is called open promoter complex, the sigma

subunit ejects out of the holoenzyme converting it to a core enzyme. At the

same time, the first two ribonucleotides are sealed in the template strand at

positions +1 and +2 with a phosphodiester bond.

In the next elongation step, the polymerase moves downstream with

relative ease, unzipping the DNA molecule and attaching new

ribonucleotides to the 3‟ end of the growing RNA. At the same time, the DNA

behind it rebounds back to its double helix structure. The open promoter is

Page 9: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

like a bubble that propagate to 3‟ direction always maintaining its size

between 12 to 17 DNA. Also, the rate of propagation is not constant, it may

slow down, pause, reaccelerate or even go backwards destroying the

ribonucleotides. The RNA itself is synthesized in the 5‟-3‟ direction. The

length of the actual transcript is longer than the length of the gene because

the +1 position is about 20 to 600 nucleotide upstream from the beginning of

the gene. This part of the RNA transcript is called a leader segment.

Similarly, the transcription extends beyond the gene creating a similar trailer

segment.

The termination of RNA transcription is signalled by the presence of a

complementary palindrome. ( A palindrome reads the same sequence in

both forward and backward direction viz ATAGCGATA )

Page 10: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Transcription in Eukaroytes

The transcription in Eukaryotes is similar to that in E.Coli but is much

more complex. The RNA polymerase has an attachment site rather

than a promoter sequence plus other promoter sequences distributed

over several hundred base pairs all upstream from the genes. These

promoters regulate the gene expression by turning on or off the

transcription process. Understanding these regulatory processes is

by itself a whole new research field.

The attachment site is referred to as -25 TATA box (5‟-TATAAAT-3‟)

and the RNA polymerase called RNA Polymerase II. The attachment

is helped by a set of proteins called transcription factors (TF II A,

TF II B , D, E and F) which ultimately makes the transcription

complex ready to start the synthesis process of RNA. The next slide

gives a schematic representation.

The exact details of the termination of the transcription process is not

very well understood. The termination trail seem to be longer ( about

1000 to 2000 bp downstream the gene. The exact termination point

is still a matter of research.

Page 11: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

TATA gene

Regulatory factors

RNA Polymerase II

TF II A and D

TF II B

RNA Polymerase II

TF IIE

F

A D BE

F

Completed transcription

complex

gene

Page 12: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC

Page 13: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Motif Logo

• Motifs can mutate on

non important bases

• The five motifs in five

different genes have

mutations in position 3

and 5

• Representations called

motif logos illustrate the

conserved and variable

regions of a motif

TGGGGGA

TGAGAGA

TGGGGGA

TGAGAGA

TGAGGGA

Page 14: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Random Sampleatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Page 15: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Page 16: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Page 17: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Implanting Motif AAAAAAGGGGGGG

with Four Mutations

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Page 18: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Page 19: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Challenge Problem

– Find a motif in a sample of

- 20 “random” sequences (e.g. 600 nt long)

- each sequence containing an implanted

pattern of length 15,

- each pattern appearing with 4 mismatches

as (15,4)-motif.

Page 20: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||

Page 21: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Combinatorial Gene Regulation

• A microarray experiment showed that

when gene X is knocked out, 20 other

genes are not expressed

– How can one gene have such drastic

effects?

Page 22: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem

• Given a random sample of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the pattern that is implanted in each of the individual sequences, namely, the motif

Page 23: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem (cont‟d)

• Additional information:

– The hidden sequence is of length 8

– The pattern is not exactly the same in each

array because random point mutations

may occur in the sequences

Page 24: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem (cont‟d)

• The patterns revealed with no mutations:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

acgtacgt

Consensus String

Page 25: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem (cont‟d)

• The patterns with 2 point mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

Page 26: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem (cont‟d)

• The patterns with 2 point mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

Can we still find the motif, now that we have 2 mutations?

Page 27: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Defining Motifs

• To define a motif, lets say we know where the motif starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Page 28: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Motifs: Profiles and Consensus

a G g t a c T tC c A t a c g t

Alignment a c g t T A g ta c g t C c A tC c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Page 29: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Consensus

• Think of consensus as an “ancestor”

motif, from which mutated motifs emerged

• The distance between a real motif and the

consensus sequence is generally less

than that for two real motifs

Page 30: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Consensus (cont‟d)

Page 31: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Parameters

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60s

DNA

n = 69

Page 32: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

1

2

3

t-1

tst

s1

s2

s3

st-1

A T C C A G C T

G G G C A A C T

A T G G A T C T

A A G C A A C C

T T G G A A C T

A T G C C A T T

A T G G C A C T

A 5 1 0 0 5 5 0 0

T 1 5 0 0 0 1 1 6

G 1 1 6 3 0 1 0 0

C 0 0 1 4 2 0 6 1

A T G C A A C T

Alignment Matrix

Profile matrix P(s)

Consensus Sequence

(“most popular”)

Given a group of t DNA sequences,

each of length n. Find most frequently

occurrng l-mers (which are close but differ

slightly due to mutations.)

Let s=(s1,s2,s3,…si,..st ) be the start positions in

t sequences, 1 <= si <= n-l+1. There are

(n-l+1)t such starting positions.

Highly conservedNot conserved

Definition:

The score of the alignment

for a specific set of positions

s in the t DNA sequences,

denoted as Score(s,DNA),

is the sum of maximum

numbers corresponding to

the consensus sequence. For

this example this sum is

5+5+6+4+5+5+6+6=42

Page 33: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem: Brute Force Solution I

(data driven approach)

The maximum possible Score(s,DNA)= lt if each column has the same nucleotide and the minimum score is (lt/4) when each column has t/4 A,C,G and T.

– Compute the scores for each possible combination of starting positions s

– The best score will determine the best profile and the consensus pattern in DNA

– The goal is to maximize Score(s,DNA) by varying the starting positions si

Page 34: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Evaluating Motifs

• We have a guess about the consensus

sequence, but how “good” is this

consensus?

• Need to introduce a scoring function to

compare different guesses and choose the

“best” one.

Page 35: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Defining Some Terms

• t - number of sample DNA sequences

• n - length of each DNA sequence

• DNA - sample of DNA sequences (t xn

array)

• l - length of the motif (l-mer)

• si - starting position of an l-mer in

sequence i

• s=(s1, s2,… st) - array of motif‟s starting

positions

Page 36: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Scoring Motifs

• Given s = (s1, … st) and DNA:

Score(s,DNA) =

a G g t a c T t

C c A t a c g t

a c g t T A g t

a c g t C c A t

C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0

C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1

T 0 0 0 5 1 0 1 4

_________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

l

t

l

i GCTAk

ikcount1 },,,{

),(max

Page 37: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem

• If starting positions s=(s1, s2,… st) are

given, finding consensus is easy even with

mutations in the sequences because we

can simply construct the profile to find the

motif (consensus)

• But… the starting positions are usually

not given. How can we find the “best”

profile matrix?

Page 38: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Motif Finding Problem: Formulation

• Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score

• Input: A t x n matrix of DNA, and l the length of the pattern to find

• Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA)

Page 39: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

BruteForceMotifSearch

1. BruteForceMotifSearch(DNA, t, n, l)

2. bestScore 0

3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)

4. if (Score(s,DNA) > bestScore)

5. bestScore score(s, DNA)

6. bestMotif (s1,s2 , . . . , st)

7. return bestMotif

Line 3 above enumerates all possible (n-l+1)t t-tuples of

position indices in t DNA sequences. A systematic method to

generate these position indices will be

discussed later.

Page 40: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Running Time of BruteForceMotifSearch

• Varying (n - l + 1) positions in each of tsequences, we‟re looking at (n - l + 1)t sets of starting positions

• For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1)t = O(l nt)

• That means that for t = 8, n = 1000, l = 10 we must perform approximately 1020 computations – it will take billions years

Page 41: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Median String Problem

• Given a set of t DNA sequences find a

pattern that appears in all t sequences

with the minimum number of mutations

• This pattern will be the motif

Another alternative view onto this problem

is to reframe the motif finding problem as the

Problem of finding a median string.

Page 42: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Hamming Distance

• Hamming distance:

– dH(v,w) is the number of nucleotide pairs

that do not match when v and w are

aligned. For example:

dH(AAAAAA,ACAAAC) = 2

Page 43: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Total Distance: Example

• Given v = “acgtacgt” and s

acgtacgt

cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat

acgtacgt

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

acgtacgt

aaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

acgtacgt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

acgtacgt

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc

v is the sequence in red, x is the sequence in blue

• TotalDistance(v,DNA) = 1+0+2+0+1 = 4

dH(v, x) = 2

dH(v, x) = 1

dH(v, x) = 0

dH(v, x) = 0

dH(v, x) = 1

Page 44: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Total Distance: Definition

– Given a l -mer v , for each DNA sequence i, compute all dH(v, x), where x is an l-mer with starting position si

(1 < si < n – l + 1)

– Find minimum of dH(v, x) among all l-mers in sequence i

– TotalDistance (v, DNA) is the sum of the minimum Hamming distances for each DNA sequence i

– TotalDistance (v, DNA) = mins dH(v, s), where s is the set of starting positions s1, s2,… st

mins dH(v, s)= min sum of Hamming distances for all choices of l -mers (there are 4l such l -mers ( min of all choices of positions s= s1, s2,… st))

Page 45: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

The Median String Problem: Formulation

• Goal: Given a set of DNA sequences, find

a median string

• Input: A t x n matrix DNA, and l, the length

of the pattern to find

• Output: A string v of l nucleotides that

minimizes TotalDistance(v,DNA) over all

strings of that length

Page 46: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Median String Search Algorithm

1. BF_MedianStringSearch (DNA, t, n, l)

2. bestWord AAA…A

3. bestDistance ∞( or a large number)

4. for each l-mer word from AAA…A to

TTT…T if TotalDistance(word,DNA) <

bestDistance

bestDistanceTotalDistance(word,DNA)

1. bestWord word

2. return bestWord

Page 47: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Line 4 dominates the computational complexity: we

need to enumerate 4l DNA sequences and for each

such sequence ,compute the Hamming distance n-l+1

times taking l comparison operation: l(n-l+1)4l

which again is exponential.

Given the „word‟ , we can find

The TotalDistance(word , DNA) in a single pass over

DNA, that is in time O(nt), rather than considering all

possible combinations of start positions. Thus, BF

median search much less time compared to BF takes

O(4lnt) time motif search. Typical value of l=8 to 15

and length of upstream motif area is about 500 to

1000 nucleotides.

Page 48: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Motif Finding Problem == Median String Problem

• The Motif Finding is a maximization problem while

Median String is a minimization problem

• However, the Motif Finding problem and Median

String problem are computationally equivalent

• Need to show that minimizing

TotalDistance is equivalent to maximizing

Score

Page 49: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

We are looking for the same thing

a G g t a c T t

C c A t a c g t

Alignment a c g t T A g t

a c g t C c A t

C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0

Profile C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1

T 0 0 0 5 1 0 1 4

_________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4

TotalDistance 2+1+1+0+2+1+2+1

Sum 5 5 5 5 5 5 5 5

• At any column iScorei + TotalDistancei = t

• Because there are l columns

Score + TotalDistance = l * t

• Rearranging:Score = l * t - TotalDistance

• l * t is constant. The minimization of the right side is equivalent to the maximization of the left side

l

t

Page 50: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Motif Finding Problem vs.

Median String Problem

• Why bother reformulating the Motif Finding

problem into the Median String problem?

– The Motif Finding Problem needs to

examine all the combinations for s. That is (n - l + 1)t combinations!!!

– The Median String Problem needs to

examine all 4l combinations for v. This

number is relatively smaller

Page 51: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Motif Finding: Improving the Running Time

Recall the BruteForceMotifSearch:

1. BruteForceMotifSearch(DNA, t, n, l)

2. bestScore 0

3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)

4. if (Score(s,DNA) > bestScore)

5. bestScore Score(s, DNA)

6. bestMotif (s1,s2 , . . . , st)

7. return bestMotif

Page 52: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Structuring the Search

• How can we perform the line

for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1) ?

• We need a method for efficiently structuring

and navigating the many possible motifs

• This is not very different than exploring all t-

digit numbers

Page 53: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Median String: Improving the Running Time

1. MedianStringSearch (DNA, t, n, l)

2. bestWord AAA…A

3. bestDistance ∞

4. for each l-mer s from AAA…A to TTT…T

if TotalDistance(v,DNA) < bestDistance

5. bestDistanceTotalDistance(s,DNA)

6. bestWord v

7. return bestWord

While computing TotalDistance(v,DNA) , it implicitly determines the

start positions in the linear scan taking O(nt) time

Page 54: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Structuring the Search

– For BruteForce method, we need to sear all possible (n - l + 1)t start locations.

– For the Median String Problem we need to consider all 4l possible l-mers:

aa… aa

aa… ac

aa… ag

aa… at

.

.

tt… tt

How to organize this search?

Page 55: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Alternative Representation of the Search Space

• Let A = 1, C = 2, G = 3, T = 4

• Then the sequences from AA…A to TT…T become:

11…11

11…12

11…13

11…14

.

.

44…44

• Notice that the sequences above simply list all numbers as if we were counting on base 4 without using 0 as a digit.

l

Page 56: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Linked List

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

• Need to visit all the predecessors of a sequence

before visiting the sequence. It does not allow a subset of the l –mers to be searched based on

some special criteria (viz. Branch and Bound

Algorithm). A more efficient data structure is a

tree.

Start

Suppose l = 2

Page 57: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Linked List (cont‟d)

• Linked list is not the most efficient data structure for motif

finding

• Let‟s try grouping the sequences by their prefixes

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Page 58: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

A Search Tree with and k=4 showing all 2-

mers with alphabet (a,c,t,g). The number of levels equals l +1

a- c- g- t-

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

--

root

Level 0

Level 1

Level 2

The nodes at level j (0<=j<=j-1)

represent prefixes of the leaf nodes

of length j+1.

Page 59: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

111- 112- 121- 122- 211- 212- 221- 222-

11-- 12-- 21-- 22--

1--- 2---

- - - -

1 2

1 1

11

1

11

1 1 1 1 1 1 1

2 2

2222

2 2 2 2 2 2 2 2

All 4-mers (l =4) with two letter alphabet (1,2) are represented by the leaves of the tree.

An internal node represents all the 4-mers in the leaf nodes of the subtree. The nodeat level v (0 <= v <=4) is designated by v-letter prefix of its children. There are 2 l +1

nodes in the tree.

Level=0

Level=1

Level=2

Level=3

Level=4

Page 60: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Analyzing Search Trees

• Characteristics of the search trees:

– The sequences are contained in its leaves

– The parent of a node is the prefix of its

children

– To represent all start positions in the Motif

Finding problem , we can construct a tree

with l +1=t +1 levels and each node having

k=n- l +1 children for each vertex.

– For Median String Problem we have all l –

mers but k=4

• How can we move through the tree?

Page 61: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

111- 112- 121- 122- 211- 212- 221- 222-

11-- 12-- 21-- 22--

1--- 2---

- - - -

1 2

1 1

11

1

11

1 1 1 1 1 1 1

2 2

2222

2 2 2 2 2 2 2 2

All 4-mers (l =4) with two letter alphabet (1,2) are represented by the leaves of the tree.

An internal node represents all the 4-mers in the leaf nodes of the subtree. A nodeat level v (0 <= v <=3) is designated by v-letter prefix of its children. There are 2 l +1

nodes in the tree.

Level=0

Level=1

Level=2

Level=3

Level=4

Page 62: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Moving through the Search Trees

• Four common moves in a search tree that we are about to explore:

– Move to the next leaf

– Visit all the leaves

– Visit the next node

– Bypass the children of a node

– In general, we want to consider all k l

l -mers. For motif finding problem

k= n - l + 1. For median string problem k=4.

Page 63: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Moving through the Search Trees

• Four common moves in a search tree that we are about to explore:

– Move to the next leaf

– Visit all the leaves

– Visit the next node

– Bypass the children of a node

– In general, we want to consider all kL

L-mers. For motif finding problem

k= n - l + 1. For median string problem k=4.

Page 64: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Visit the Next Leaf

NextLeaf (a=((a1,a2,…aL), L, k)// a :the array of digits. L: length of the array (it same as l ). k :

max digit value//

1. a a+ 1 // radix k addition//2. if a=(1,1,….,1) exit // no next leaf //

else return a

(1,1, …,1,1) (4,4, …,3,3) (4,4, …,4,3) (1,1, …,1,2) (4,4, …,3,3) (4,4, …,4,4) (1,1, …,1,3) (4,4, …,3,4) (1,1, …,1,1) // no next leaf// (1,1, …,1,4) (4,4, …,4,1) //reset to 1111 //(1,1, …,2,1)…………….. …………….

Given a current leaf a , we need to compute the “next” leaf:

Page 65: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Visit the Next Leaf

1. NextLeaf( a,L, k ) // a :the array of digits2. for i L to 1 // L: length of the array3. if ai < k // k : max digit value4. ai ai + 15. return a6. ai 17. return a

This slide is same as the previous slide but it showsthe actual computations performed. If L were 10, it Would be like counting decimal numbers except it

uses digits 1 to 10 rather than 0 to 9

Given a current leaf a , we need to compute the “next” leaf:

Page 66: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

NextLeaf (cont‟d)

• The algorithm is addition in radix k

• Increment the least significant digit

• “Carry the one” to the next digit position when the digit is at maximal value

Page 67: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

NextLeaf: Example

• Moving to the next leaf:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Current Location

Page 68: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

NextLeaf: Example (cont‟d)

• Moving to the next leaf:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location

Page 69: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Visit All Leaves

• Printing all permutations in ascending

order:

1. AllLeaves(L,k) // L: length of the sequence

2. a (1,...,1) // k : max digit value

3. while forever // a : array of digits

4. output a

5. a NextLeaf(a,L,k)

6. if a = (1,...,1)

7. return

Page 70: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Visit All Leaves: Example

• Moving through all the leaves in order:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

--

Order of steps

Page 71: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Depth First Search

• So we can search leaves

• How about searching all vertices of the tree?

• We can do this with a depth first search.

• Depth First Search (DFS) is a pre-order traversal of a tree. (other traversals are Inorder , Postorder)

• Preorder (v) // v is a node of the tree//• Output v

• If v has children

• Preorder (left child of v)

• Preorder (right child of v)

Page 72: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

111- 112- 121- 122- 211- 212- 221- 222-

11-- 12-- 21-- 22--

1--- 2---

1

2

4

2

3

Level=0

Level=1

Level=2

Level=3

Level=45 6

7

8 9

10

11

12 13

14

15 16

17

19

18

20 21 23 24

25

27

26

28

29

30 31

Preorder (v) // v is a node of the tree//Output vIf v has children

Preorder (left child of v)Preorder (right child of v)

22

Page 73: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Visit the Next Vertex

1. NextVertex(a,i,L,k)// a : (a1,a2,.., ai …aL) - the array of digits. If i<L, it is

an internal vertex.

1. if i < L // i : prefix length2. a i+1 1 // L: max length3. return ( a,i+1) // k : max digit value4. else5. for j L to 16. if aj < k 7. aj aj +18. return( a,j )9. return(a,0)

Page 74: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Explanation

When i <L , NextVertex moves down to next the

lower level and explores that subtree of a . The

example in two slides earlier show how the

vertices are traversed if the initial vertex specified

is 12– (the blue path)

If i=L, NextVertex either moves along the lowest

level until the L-th symbol becomes k and then

jumps back to the right child of the initial vertex if

it exists or its right sibling node. It returns to the

root (a,0) after it reaches the vertex (k,k,…,k).

Page 75: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Example

• Moving to the next vertex:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Current Location

Page 76: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Example

• Moving to the next vertices:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--

Location after 5

next vertex moves

Page 77: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Branch-and-Bound Search

This approach will allow us to skip any children

or grandchildren or great grandchildren etc if

these descendants cannot provide a better

score than the best leaf that has already been

explored.

At each vertex we calculate the most optimistic

bound or score of any leaf in the subtree

rooted at that vertex and then decide whether

to branch further or not. This is the reason why

this approach is called Branch-and-Bound .

Page 78: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Branch and Bound Algorithm for Motif Search

• Since each level of the

tree goes deeper into

search, discarding a prefix

discards all following

branchesThis saves us from looking at (n – l + 1)t-i

or 2L-i leaves.

• NextVertex() is not up to the

task.

– use ByPass() to navigate the

tree

Page 79: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Bypass Move

• Given a prefix (internal vertex), find next vertex after

skipping all its children. If we skip a vertex at level i of

the tree, we can just increment ai (unless all vertices at

level i has been traversed and we return to the root

vertex). The algorithm looks very similar to NextLeaf

but differ in a subtle way.

1. Bypass(a,i,L,k) // a: array of digits

2. for j i to 1 // i : prefix length

3. if aj < k // L: maximum length

4. aj aj +1 // k : max digit value

5. return(a, j )

6. return(a,0)

Page 80: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Bypass Move: Example

• Bypassing the descendants of “2-”:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Current Location

Page 81: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Example

• Bypassing the descendants of “2-”:

1- 2- 3- 4-

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

--Next Location

Page 82: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Revisiting Search Tree

--

Score: 24 15 3 10 15 5 11 7 18 14 22 17 30 25 8 16

24 15 22 30

30

If the scores of the leaf nodes are as given below then it does not make

Any sense to explore any subtree whose score is less than 24 (the red nodes)

Page 83: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Brute Force Search Again1. BruteForceMotifSearchAgain(DNA, t, n, l)

2. s (1,1,…, 1)

3. bestScore Score(s,DNA)

4. while forever

5. s NextLeaf (s, t, n- l +1) [Enumeration of (1,1,..,1) to (n-l+1,…,n-l+1)]

6. if (Score(s,DNA) > bestScore)

7. bestScore Score(s, DNA)

8. bestMotif (s1,s2 , . . . , st)

9. if s=(1,1,…,1)

10. return bestMotif

As we discussed earlier, we have to explore (n-l +1)t positions. For each position vector

s= s1,s2,…,st, the algorithm calculates Score(s,DNA) which takes O(l )operations. Thus,

the overall complexity of the algorithm is O(l nt).

Page 84: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Can We Do Better?• A set of start locations s=(s1, s2, …,st) may have a weak profile for

the first i positions (s1, s2, …,si). Then, it is not necessary to consider ant starting positions beginning si-1,si+1, …,st

• Define partial consensus score, Score(s,i,DNA) to be the consensus score of the ixl alignment matrix involving the first i rows of DNA corresponding to starting positions (s1, s2, …,si -,-..-).

• Every row of alignment may add at most l to Score. All subsequent (t-i) positions (si+1, …st) could add in the most optimistic situation

(t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to explore the remaining (t-i) positions in this choice of positions (s1, s2, …,si -,-..-).

• Using ByPass() , we could save looking at (n-l +1)t –i leaves

Page 85: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Pseudocode for Branch and Bound Motif Search

1. BranchAndBoundMotifSearch(DNA,t,n,l )

2. s (1,…,1)

3. bestScore 0

4. i 1

5. while i > 0

6. if i < t

7. optimisticScore Score(s, i, DNA) +(t – i ) * l

8. if optimisticScore < bestScore

9. (s, i) Bypass(s, i, t, n-l +1) //parameters (a, i, L, k) //

10. else

11. (s, i) NextVertex(s, i, t, n-l +1) //parameters (a, i, L, k)

12. else

13. if Score(s,DNA) > bestScore

14. bestScore Score(s, DNA)

15. bestMotif (s1, s2, s3, …, st)

16. (s, i) NextVertex(s, i, t, n-l + 1) //parameters (a, i, L, k)

17. return bestMotif

Page 86: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Median String Search Improvements

• Recall the computational differences between motif

search and median string search

– The Motif Finding Problem needs to examine all (n-l +1)t combinations for s.

– The Median String Problem needs to examine

4l combinations of v. This number is relatively

small

• We want to use median string algorithm with the

Branch and Bound trick!

Page 87: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Branch and Bound Applied to Median

String Search

• Note that if the total distance for a prefix is

greater than that for the best word so far:

TotalDistance (prefix, DNA) > BestDistance

there is no use exploring the remaining

part of the word

• We can eliminate that branch and

BYPASS exploring that branch further.

Page 88: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Level=0

Level=1

Level=2

Level=3

Level=4

A--- T--- C--- G---

AA--

AT-- AG--

AG-- GA--

GT-- GC--

GG--

AAA- AAG-

AAAA

GGA- GGG-

GGGG

Not all nodes are marked, not all child nodes are drawn

A vertex at level i in this tree represents a nucleotide string of length i which

Can be viewed as the i-long prefix of every leaf below that vertex. The

SIMPLE_MEDIAN_SEARCH pseudocode given in next page maps A,C,G,T to

numerals 1,2,3,4 respectively.

Page 89: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Pseudocode for Branch and Bound Median SearchSIMPLE_MEDIAN_SEARCH(DNA, t, n, l )

1. s (1,…,1)

2. bestDistance a very large number

3. i 1

4. while i > 0

5. if i < l

6. (s, i) NextVertex(s, i, l ,4) //parameters (a, i, L, k)

7. else

8. word nucleotide string corresponding to (s1,s2,…,s l )

9. if TotalDistance(word,DNA)< bestDiatance

10. bestDistance TotalDistance(word,DNA)

11. bestWord word

12. (s, i) NextVertex(s, i, l ,4) //parameters (a, i, L, k)

13. return bestWord

We find bound for TotalDistance(word,DNA) at each vertex. If the total

distance between the i-prefix of the word and DNA is larger than the smallest seen so far for one of the leaves (nucleotide strings of length l ), then there is

no point investigating subtrees of the vertex corresponding to that i-prefix of the word ; all extensions of this prefix into an l –mer will have at least the

same or probably more total distance. Of course, there could be some

extensions to the prefix that matches every string in the sample which will

make the value of total distance to be 0.

Page 90: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Bounded Median String Search1. BranchAndBoundMedianStringSearch(DNA,t,n,l )2. s (1,…,1)3. bestDistance ∞4. i 15. while i > 06. if i < l7. prefix string corresponding to the first i nucleotides of s8. optimisticDistance TotalDistance(prefix,DNA)9. if optimisticDistance > bestDistance10. (s, i ) Bypass(s,i, l, 4)11. else12. (s, i ) NextVertex(s, i, l, 4)13. else 14. word nucleotide string corresponding to s15. if TotalDistance(s,DNA) < bestDistance16. bestDistance TotalDistance(word, DNA)17. bestWord word18. (s,i ) NextVertex(s,i,l, 4)19. return bestWord

Page 91: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Improving the Bounds

• Given an l-mer w, divided into two parts at point i

– u : prefix w1, …, wi,

– v : suffix wi+1, ..., wl

• Find minimum distance for u in a sequence

• No instances of u in the sequence have distance less than the minimum distance

• Note this doesn‟t tell us anything about whether u is part of any motif. We only get a minimum distance for prefix u

Page 92: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Improving the Bounds (cont‟d)

• Repeating the process for the suffix v

gives us a minimum distance for v

• Since u and v are two substrings of w,

and included in motif w, we can assume

that the minimum distance of u plus

minimum distance of v can only be less

than the minimum distance for w

Page 93: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Better Bounds

Page 94: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Better Bounds (cont‟d)

• If d(prefix) + d(suffix) > bestDistance:

– Motif w (prefix.suffix) cannot give a better

(lower) score than d(prefix) + d(suffix)

– In this case, we can ByPass()

Page 95: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Better Bounded Median String Search1. ImprovedBranchAndBoundMedianString(DNA,t,n,l)

2. s = (1, 1, …, 1)

3. bestdistance = ∞

4. i = 1

5. while i > 0

6. if i < l

7. prefix = nucleotide string corresponding to (s1, s2, s3, …, si )

8. optimisticPrefixDistance = TotalDistance (prefix, DNA)

9. if (optimisticPrefixDistance < bestsubstring[ i ])

10. bestsubstring[ i ] = optimisticPrefixDistance

11. if (l - i < i )

12. optimisticSufxDistance = bestsubstring[l -i ]

13. else

14. optimisticSufxDistance = 0;

15. if optimisticPrefixDistance + optimisticSufxDistance > bestDistance

16. (s, i ) = Bypass(s, i, l, 4)

17. else

18. (s, i ) = NextVertex(s, i, l,4)

19. else

20. word = nucleotide string corresponding to (s1,s2, s3, …, st)

21. if TotalDistance( word, DNA) < bestDistance

22. bestDistance = TotalDistance(word, DNA)

23. bestWord = word

24. (s,i) = NextVertex(s, i,l, 4)

25. return bestWord

Page 96: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

CONSENSUS: Greedy Motif Search

• Find two closest l-mers in sequences 1 and 2 and forms

2 x l alignment matrix with Score(s,2,DNA)

• At each of the following t-2 iterations CONSENSUS finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing

Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Page 97: 4. Finding Regulatory Motifs in DNA Sequences (Chapter 4 and 12)

Some Motif Finding Programs

• CONSENSUS

Hertz, Stromo (1989)

• GibbsDNA

Lawrence et al (1993)

• MEME

Bailey, Elkan (1995)

• RandomProjections

Buhler, Tompa (2002)

• MULTIPROFILER

Keich, Pevzner (2002)

• MITRA

Eskin, Pevzner (2002)

• Pattern Branching

Price, Pevzner (2003)

• PRUNER II

Ravi Vijaya Satya and

A. Mukherjee (2005)