Algorithms for Bioinformatics · Algorithms for Bioinformatics Lecture 2: Exhaustive search and...

62
Algorithms for Bioinformatics Lecture 2: Exhaustive search and randomized algorithms for motif discovery 13.9.2017 These slides are based on previous years’ slides by Alexandru Tomescu, Leena Salmela and Veli M¨ akinen These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php

Transcript of Algorithms for Bioinformatics · Algorithms for Bioinformatics Lecture 2: Exhaustive search and...

  • Algorithms for Bioinformatics

    Lecture 2: Exhaustive search and randomized algorithms for motifdiscovery

    13.9.2017

    These slides are based on previous years’ slides byAlexandru Tomescu, Leena Salmela and Veli Mäkinen

    These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php

    http://bix.ucsd.edu/bioalgorithms/slides.php

  • Outline

    DNA Sequence Motifs

    Motif Finding Problem

    Brute Force Motif Finding

    Median String Problem

    Greedy Motif Search

    Randomized Algorithms

    2 / 59

  • DNA Sequence Motif

    I A DNA sequence motif is a recurring pattern in DNA presumed tohave some biological funtion.

    I Often the occurences of the motif in DNA are binding sites whereproteins such as transcription factors can attach.

    3 / 59

  • Transcription Factors

    I Every gene contains a regulatory region typically stretching 100-1000bp upstream of the transcriptional start site.

    I Transcription factor (TF) is a regulatory protein that can bind to aregulatory region and promote or inhibit the transcription of the gene.

    I A single TF typically regulates multiple (co-regulated) genes.

    4 / 59

  • Transcription Factor Binding Sites and Motifs

    I Transcription Factor Binding Site (TFBS) is the DNA location wherethe TF binds to.

    I May be located anywhere in the regulatory region.

    I A TF binds to a specific kind of DNA sequence.I Similar but not identical in different co-regulated genes.I Some positions are more important and thus conserved while other

    positions allow variation.

    I Motif is a pattern that describes the TFBS sequences of a given TF.

    5 / 59

  • TBFSs

    gene

    gene

    gene

    gene

    gene

    ATCCCG

    ATCCCG

    GGTCCT

    AT GCCG

    AT GCC C

    6 / 59

  • Motif Representations

    I Set of sequences

    TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

    I Consensus sequence TGAGGGA

    I Profile matrix

    A 0 0 3 0 2 0 5C 0 0 0 0 0 0 0G 0 5 2 5 3 5 0T 5 0 0 0 0 0 0

    I Motif logo

    7 / 59

  • Outline

    DNA Sequence Motifs

    Motif Finding Problem

    Brute Force Motif Finding

    Median String Problem

    Greedy Motif Search

    Randomized Algorithms

    8 / 59

  • Identifying Motifs

    I Identify a set of co-regulated genes.

    I Extract their regulatory regions.

    I Find a pattern shared by all of the regions: the motif.

    I The motif can be used for finding other (previously unknown) genesregulated by the same TF.

    cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

    agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

    aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

    agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

    ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

    9 / 59

  • The Motif Finding Problem: Formulation

    I Goal: Find the best motif in a set of DNA sequences

    I Input: A collection of t strings Dna and an integer kI Output: The collection of t k-mers, one from each input string,

    forming the best motif.I We will later consider what is the “best” motif.

    k = 8

    t = 5

    cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

    agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

    aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

    agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

    ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc︸ ︷︷ ︸n = 69

    10 / 59

  • Planted (k , d)-Motif Problem

    Artificially generated problem instance

    1. Generate a random k-mer M (the motif)

    2. Generate t random sequences of length n

    3. Implant M into a random position in each of the t sequences

    4. In each implanted copy of M, make d random mutations

    11 / 59

  • Example: Generate sequences

    gatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc

    tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtac

    cctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccg

    ttggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag

    gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttata

    tgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgca

    gcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttca

    ttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt

    ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaa

    caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagc

    12 / 59

  • Example: Implant motif AAAAAAAAGGGGGGG

    gatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc

    tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGG

    cctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccg

    ttggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag

    gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttata

    tgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgca

    gcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttca

    ttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt

    ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaa

    caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGG

    13 / 59

  • Where are the implanted motifs?

    gatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc

    tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggg

    cctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg

    ttggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag

    gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttata

    tgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgca

    gcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttca

    ttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt

    ttggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaa

    caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggg

    14 / 59

  • Example: Add four mutations per motif occurrence

    gatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc

    tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGG

    cctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccg

    ttggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag

    gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttata

    tgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgca

    gcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttca

    ttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt

    ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaa

    caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGG

    15 / 59

  • Where are the implanted motifs???

    gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc

    tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg

    cctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg

    ttggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag

    gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttata

    tgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgca

    gcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttca

    ttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt

    ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaa

    caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgg

    16 / 59

  • Why finding (15,4)-motifs is hard?

    gatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc

    tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGG

    I Occurrences can differ in 8 positionsAgAAgAAAGGttGGG

    ..|..|||.|..|||

    cAAtAAAAcGGcGGG

    I Difficult to recognize similarity even if you know the positions

    17 / 59

  • Planted (k , d)-Motif Problem

    I Common benchmark or challenge problem for motif finding

    I Adjustable difficulty

    I For example, the planted (15, 4)-motif problem is difficult but notimpossible

    Real data is noisier and more complex

    I No upper bound d on mutation

    I Some positions have little variation, others have a lot

    18 / 59

  • Outline

    DNA Sequence Motifs

    Motif Finding Problem

    Brute Force Motif Finding

    Median String Problem

    Greedy Motif Search

    Randomized Algorithms

    19 / 59

  • Scoring Motifs

    I The output of motif finding is a set of k-mers, one from each inputsequence

    I In a desired output, the k-mers are as similar to each other aspossible, representing a coherent motif

    I How to measure the goodness of the motif?

    20 / 59

  • Consensus Motif and Score

    a G g t a c T tC c A t a c g t

    Motif occurrences a c g t T A g ta c g t C c A tC c g t a c g G

    A 3 0 1 0 3 1 1 0Counts C 2 4 0 0 1 4 0 0

    G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

    Consensus A C G T A C G T

    Score 2+1+1+0+2+1+2+1=10

    I t motifs occurrences,one from each sequence

    I Count symbols in eachcolumn

    I Consensus formed bymost frequent symbols

    I Score is the number ofdifferences fromconsensus

    21 / 59

  • Consensus Motif and Score

    I Good model for planted motif problem

    I In real data, mutations should be expensive in conserved positions butmuch cheaper in variable positions

    22 / 59

  • BruteForceMotifSearch

    I Compute the score for every possible combination of motifs

    I Output the set of motifs with the smallest score

    23 / 59

  • Running Time of BruteForceMotifSearch

    I (n − k + 1) different k-mers in each sequenceI (n − k + 1)t different combinations of motifsI kt time to compute score for one set of motifs

    I kt(n − k + 1)t = O(ktnt) time in totalI E.g. for t = 20, n = 600, k = 15 we must perform approximately

    1058 computations — it would take billions of years

    24 / 59

  • Outline

    DNA Sequence Motifs

    Motif Finding Problem

    Brute Force Motif Finding

    Median String Problem

    Greedy Motif Search

    Randomized Algorithms

    25 / 59

  • The Median String Problem

    I Given a set of t DNA sequences find a pattern that appears in all tsequences with the minimum number of total mismatches

    I This pattern will be the shared motif

    26 / 59

  • Hamming Distance

    I The Hamming distance d(v ,w) the number of mismatches betweentwo k-mers v and w

    I For example:d(AAAAAA,ACAAAC) = 2

    27 / 59

  • Computing Score

    a G g t a c T t 2C c A t a c g t 2

    Motifs a c g t T A g t 2a c g t C c A t 2C c g t a c g G 2

    Score(Motifs) 2+1+1+0+2+1+2+1=10

    Consensus(Motifs) A C G T A C G T

    I Score is the number ofmismatching symbols

    I Can be computedcolumn by column orrow by row

    I Row sums are Hammingdistances

    28 / 59

  • Computing Score

    Define

    I Motifs = {Motif 1,Motif 2, . . . ,Motif t}I d(Pattern,Motifs) =

    ∑ti=1 d(Pattern,Motif i )

    Then

    I Score(Motifs) = d(Consensus(Motifs),Motifs)

    29 / 59

  • Best Match Distance

    I Assume |String | > |Pattern| = kI The best match distance d(Pattern, String) is the smallest Hamming

    distance d(Pattern,Motif ) between Pattern and any k-mer Motif inString

    I Example: d(ACGTACGT, gcaaaAGGTACTTccaa) = 2

    Generalize for a set of strings

    I Dna = {Dna1,Dna2, . . . ,Dnat}I d(Pattern,Dna) =

    ∑ti=1 d(Pattern,Dnai )

    30 / 59

  • The Median String Problem

    I Goal: Given a set of DNA sequences, find a median string

    I Input: A collection of strings Dna and an integer k

    I Output: A k-mer Pattern minimizing d(Pattern,Dna) among allk-mers Pattern

    31 / 59

  • Motif Finding Problem = Median String Problem

    I Input: Dna, k

    I Motifs: output of Motif Finding (t k-mers)

    I Pattern: output of Median String

    I Score(Motifs) = d(Pattern,Dna)

    Why?

    I If Score(Motifs) < d(Pattern,Dna), we could chooseConsensus(Motifs) as a better Pattern

    I If Score(Motifs) > d(Pattern,Dna), we could choose the best matchoccurrences of Pattern as better Motifs

    32 / 59

  • Motif Finding Problem = Median String Problem

    I Input: Dna, k

    I Motifs: output of Motif Finding (t k-mers)

    I Pattern: output of Median String

    I Score(Motifs) = d(Pattern,Dna)

    Why?

    I If Score(Motifs) < d(Pattern,Dna), we could chooseConsensus(Motifs) as a better Pattern

    I If Score(Motifs) > d(Pattern,Dna), we could choose the best matchoccurrences of Pattern as better Motifs

    32 / 59

  • Median String Algorithm

    MedianString(DNA, k)

    1: BestPattern← AAA...A2: for each k-mer Pattern from AAA...A to TTT...T do3: if d(Pattern,DNA) < d(BestPattern,DNA) then4: BestPattern← Pattern5: return BestPattern

    33 / 59

  • Running Time of MedianString

    I 4k different k-mers

    I O(k · n) time to compute the best match distance to one stringI O(knt4k) time in total

    I E.g. for t = 20, n = 600, k = 15 this is about about 1013

    — still a lot but much less than 1058

    I Reformulating a problem can help!

    34 / 59

  • Running Time of MedianString

    I 4k different k-mers

    I O(k · n) time to compute the best match distance to one stringI O(knt4k) time in total

    I E.g. for t = 20, n = 600, k = 15 this is about about 1013

    — still a lot but much less than 1058

    I Reformulating a problem can help!

    34 / 59

  • Outline

    DNA Sequence Motifs

    Motif Finding Problem

    Brute Force Motif Finding

    Median String Problem

    Greedy Motif Search

    Randomized Algorithms

    35 / 59

  • Search Space

    I BruteForceMotifSearch and MedianString algorithms haveexponential running time

    I This is because the search space, the set of possible solutions, isexponential

    I nt different ways to choose MotifsI 4k different ways to choose Pattern

    36 / 59

  • Exploring Only Part of Search Space

    Branch and bound algorithms (covered in study groups)

    I Avoid regions that cannot improve solution

    I Still exponential in the worst case

    Greedy algorithms

    I Search the most promising directions

    I No guarantee of finding an optimal solution

    Randomized algorithms

    I Add randomness to greedy search

    I Avoids getting stuck in a dead end

    37 / 59

  • Profile Matrixa G g t a c T tC c A t a c g t

    Motifs a c g t T A g ta c g t C c A tC c g t a c g G

    A 3 0 1 0 3 1 1 0Count(Motifs) C 2 4 0 0 1 4 0 0

    G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

    A .6 0 .2 0 .6 .2 .2 0Profile(Motifs) C .4 .8 0 0 .2 .8 0 0

    G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8

    Consensus(Motifs) A C G T A C G T

    I Profile represents theprobability of eachnucleotide in eachposition

    I More detailed summaryof the set of motifs thanconsensus

    38 / 59

  • k-Mer Probabilities

    A .6 0 .2 0 .6 .2 .2 0Profile C .4 .8 0 0 .2 .8 0 0

    G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8

    The probability of a k-mer given a profile

    I Pr(AGGTACTT | Profile) = .6 · .2 · .8 · 1 · .6 · .8 · .2 · .8 = 0.0073728I Measure how well the k-mer matches the motif

    I Does 0.0073728 imply a good match?

    39 / 59

  • Profile-Most Probable k-mer

    I The k-mer with the highest probability in a string

    I Considered the best matching motif

    I Example: The Profile-most probable 8-mer ingcaaaAGGTACTTccaa is AGGTACTT

    I Pr(AGGTACTT | Profile) = 0.0073728

    A .6 0 .2 0 .6 .2 .2 0Profile C .4 .8 0 0 .2 .8 0 0

    G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8

    40 / 59

  • Problem: Zero Probabilities

    A .6 0 .2 0 .6 .2 .2 0Profile C .4 .8 0 0 .2 .8 0 0

    G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8

    Consensus A C G T A C G T

    Pr(TCGTACGT | Profile) = 0 · .8 · .8 · 1 · .6 · .8 · .6 · .8 = 0I Only one mismatch compared to consensus

    I Should this probability really be 0?

    41 / 59

  • Pseudocounts

    I Add one to all counts

    I Avoids zero counts

    A 3 0 1 0 3 1 1 0Count C 2 4 0 0 1 4 0 0

    G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

    A 4 1 2 1 4 2 2 1PseudoCount C 3 5 1 1 2 5 1 1

    G 1 2 5 1 1 1 4 2T 1 1 1 6 2 1 2 5

    42 / 59

  • Laplace’s Rule of Succession

    I Use pseudocounts instead of counts to compute probabilities

    I As if we had seen one occurrence of each symbol before the main data

    A 4 1 2 1 4 2 2 1PseudoCount C 3 5 1 1 2 5 1 1

    G 1 2 5 1 1 1 4 2T 1 1 1 6 2 1 2 5

    A 4/9 1/9 2/9 1/9 4/9 2/9 2/9 1/9Profile C 3/9 5/9 1/9 1/9 2/9 5/9 1/9 1/9

    G 1/9 2/9 5/9 1/9 1/9 1/9 4/9 2/9T 1/9 1/9 1/9 6/9 2/9 1/9 2/9 5/9

    Pr(TCGTACGT | Profile) = 1/9 · 5/9 · 5/9 · 6/9 · 4/9 · 5/9 · 4/5 · 5/9 =60000/43046721 = 0.001393834

    43 / 59

  • Greedy Motif Search

    Solve Motif Finding problemI For each input string, choose the profile-most probable k-mer as the

    motifI Greedy choice

    I The profile is computed from motifs of earlier strings

    I In first string, try all k-mers

    44 / 59

  • Greedy Motif Search

    GreedyMotifSearch(DNA, k , t)

    1: BestMotifs ← the first k-mer of each string in DNA2: for each k-mer Motif in the first string in DNA do3: Motif 1 ← Motif4: for i ← 2 to t do5: form Profile from Motif 1, . . . ,Motif i−16: Motif i ← Profile-most probable k-mer in the i-th string in DNA7: Motifs ← Motif 1, . . . ,Motif t8: if Score(Motifs) < Score(BestMotifs) then9: BestMotifs ← Motifs

    10: return BestMotifs

    45 / 59

  • Performance of GreedyMotifSearch

    I Running time O(n · t · k · (t + n))I polynomial not exponential

    I May not find the best motifsI Early choices may lead to a wrong direction

    46 / 59

  • Outline

    DNA Sequence Motifs

    Motif Finding Problem

    Brute Force Motif Finding

    Median String Problem

    Greedy Motif Search

    Randomized Algorithms

    47 / 59

  • Randomized Algorithms

    I Make random choices during computation

    I Use random number generator to “toss coins” or to “roll dice”

    Why Randomness Helps?

    I If a greedy algorithm fails for some input,it will always fail for that input

    I If a randomized algorithm fails,it is unlikely to fail again in the same way

    I We can run it many times and choose the best output

    48 / 59

  • Monte Carlo and Las Vegas Algorithms

    Monte Carlo algorithm

    I May return an incorrect or inoptimal result

    I Returns a correct answer or a good approximation with highprobability (if repeated sufficiently many times)

    Las Vegas algorithm

    I Always returns a correct/optimal result

    I Very long runtime is possible but very unlikely

    49 / 59

  • Turning Monte Carlo into Las Vegas

    1. Run the Monte Carlo algorithm

    2. If the result is good, stop. Otherwise return to Step 1.

    I Requires that a correct or optimal result can be easily recognizedI This is not the case with the Motif Finding problem

    I The following algoriths are Monte Carlo algorithms

    50 / 59

  • Turning Monte Carlo into Las Vegas

    1. Run the Monte Carlo algorithm

    2. If the result is good, stop. Otherwise return to Step 1.

    I Requires that a correct or optimal result can be easily recognizedI This is not the case with the Motif Finding problem

    I The following algoriths are Monte Carlo algorithms

    50 / 59

  • Randomized Motif Search

    Improving a set of motifsI Starting with a set of motifs (one from each sequence)

    1. Compute a profile from the motifs2. Find the profile-most probable motif in each sequence

    I The result is a potentially better set of motifs

    I Repeat this as long as the set of motifs keeps improving

    Randomization

    I Start with a random set of motifs

    51 / 59

  • Randomized Motif Search

    RandomizedMotifSearch(DNA, k, t)

    1: randomly select k-mers Motifs = (Motif 1, . . . ,Motif t), one from eachstring in DNA

    2: BestMotifs ← Motifs3: while forever do4: Profile ← Profile(Motifs)5: for i ← 1 to t do6: Motif i ← Profile-most probable k-mer in the i-th string in DNA7: Motifs ← Motif 1, . . . ,Motif t8: if Score(Motifs) < Score(BestMotifs) then9: BestMotifs ← Motifs

    10: else11: return BestMotifs

    52 / 59

  • Why Randomized Motif Search Works?

    I If Motifs is a random set, the expectation is that Profile(Motifs) hasabout the same probability 0.25 for each symbol in each column

    I If Motifs contains some of the true motifs, it is not random andProfile(Motifs) reflects this

    I Then Profile(Motifs) is more likely to match the other true motifs

    I Thus we might need just a few of the true motifs in the initial set

    I This will happen eventually if repeated many times (may requirethousands of repeats)

    53 / 59

  • Gibbs Sampler

    I Gibbs Sampler is a more refined randomized algorithmI Compared to Randomized Motif Search Gibbs Sampler is

    I More cautiousI More randomized

    54 / 59

  • Gibbs Sampler Is More Cautious

    I Randomized Motif Search might get some true motifs right but throwthem all away in the next round

    I Gibbs Sampler changes just one motif in each round

    55 / 59

  • Gibbs Sampler Is More Randomized

    I Randomized Motif Search uses randomness only in the beginningI Gibbs Sampler uses randomness in every round

    I Choose a random motif to discardI Replace it with a random motif (from the same sequence)I The second random choice is biased: a profile-randomly generated

    k-mer

    56 / 59

  • Profile-Randomly Generated k-Mer

    I Given a Profile and a String

    1. Compute probabilities of all k-mers in String2. Choose one of the k-mers randomly but biased by the probabilities

    I The probabilities with respect to Profile do not usually sum up to 1and have to be normalized: Replace p1, . . . , pn with p1/C , . . . p1/C ,where C =

    ∑ni=1 pi

    I ExampleI p1 = 0.1, p2 = 0.2, p3 = 0.3I C = 0.1 + 0.2 + 0.3 = 0.6I p1/C = 1/6, p2/C = 1/3, p3/C = 1/2I p1/C + p2/C + p3/C = 1/6 + 1/3 + 1/2 = 1

    57 / 59

  • Gibbs Sampler

    GibbsSampler(DNA, k , t,N)

    1: randomly select k-mers Motifs = (Motif 1, . . . ,Motif t) in each stringin Dna

    2: BestMotifs ← Motifs3: for j ← 1 to N do4: i ← Random(t)5: Profile ← profile matrix constructed from all strings in Motifs

    except for Motif i6: Motif i ← profile-randomly generated k-mer in the i-th sequence in

    DNA7: if Score(Motifs) < Score(BestMotifs) then8: BestMotifs ← Motifs9: return BestMotifs

    58 / 59

  • Gibbs Sampler

    I Because of randomness in every round, Gibbs Sampler can keep onrunning without getting stuck to single solution

    I However, it may end up exploring the same small set of solutionsrepeatedly: It gets stuck in a local optimum

    I This can be corrected by restarting from a new random set of motifsevery now and then

    59 / 59

    DNA Sequence MotifsMotif Finding ProblemBrute Force Motif FindingMedian String ProblemGreedy Motif SearchRandomized Algorithms