Guiding Motif Discovery by Iterative Pattern Refinement

31
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University

description

Guiding Motif Discovery by Iterative Pattern Refinement. Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University. Outline. Introduction and motivation Our framework for motif discovery Initial pattern discovery Build seed motif Extract subsequences - PowerPoint PPT Presentation

Transcript of Guiding Motif Discovery by Iterative Pattern Refinement

Page 1: Guiding Motif Discovery by Iterative Pattern Refinement

Guiding Motif Discovery by Iterative Pattern Refinement

Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University

Page 2: Guiding Motif Discovery by Iterative Pattern Refinement

Outline

Introduction and motivation Our framework for motif discovery

1. Initial pattern discovery2. Build seed motif3. Extract subsequences4. Motifs discovery 5. Iterative refinementExperiment and ResultDiscussion and Future work

Page 3: Guiding Motif Discovery by Iterative Pattern Refinement

Introduction – motifs & their applications

Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of protein structures.

1. Enzyme catalytic sites2. Regions involved in binding a molecule (ADP/ATP, DNA…) or

another protein.3. A fold important for general 3D structure.

Distinguish protein groups based on such patterns.Classify a sequenced protein to a specific family of proteins.

Page 4: Guiding Motif Discovery by Iterative Pattern Refinement

Introduction - motif discovery

PROSITE: find patterns manuallyDeterministic algorithm, expectation maximization based: MEME (time consuming)Stochastic algorithm (Gibbs sampling algorithm), random jumps in the search space: Gibbs Sampler; AlignACEThe performance varies with the input sequences’ characteristics. (For example, all known motifs in disease resistance genes in Arabidopsis thaliana were successfully found using MEME after splitting the sequences into two distinct categories of resistance genes, but no motifs were found by inputting all disease resistance genes as a single input file to MEME.)

Page 5: Guiding Motif Discovery by Iterative Pattern Refinement

Motivation

Motif discover is, in a sense, to compare two models: a model for the pattern (signal model) and a model for negative examples (noise model). Input sequences determine the background noise model. The performance of motif discovery algorithms can be significantly improved by clustering input sequences into smaller groups.Thus Motivation for our research is to use subsequences, instead of using whole sequences, for motif discovery. However, it is quite difficult to select correct subsequence regions without prior knowledge, e.g., genes of the same type. We use an iterative algorithm to solve this problem.

Page 6: Guiding Motif Discovery by Iterative Pattern Refinement

Motivation – an example

Figure 1, Motif logo for the multiple sequence alignment

of a family

Figure 2, Motif logo for conserved subsequences of the protein family.

PS00343 (L-P-x-T-G-[STGAVDE])

Page 7: Guiding Motif Discovery by Iterative Pattern Refinement

Outline

Introduction and motivationOur framework for motif discovery

1. Initial pattern discovery2. Build seed motif3. Extract subsequences4. Motifs discovery 5. Iterative refinementExperiment and ResultDiscussion and Future work

Page 8: Guiding Motif Discovery by Iterative Pattern Refinement

Test Data Preparation

1. Download PROSITE pattern and sequence databases.

2. Parse all true positive sequences for each PROSITE ID and store them as a PROSITE family.

3. All sequences of one family contain the same PROSITE pattern.

4. We used PROSITE families to discovery motifs and test the performance of our framework.

Page 9: Guiding Motif Discovery by Iterative Pattern Refinement

Framework Overview 1

STEP1. Extract a set S of subsequences around a set of motifs M.STEP2. Input S to a motif discovery algorithm, producing a new set of motifs M’.STEP3. Search entire sequences for more occurrences of M’, producing M’. Set M’ to M and go to step 1.

Page 10: Guiding Motif Discovery by Iterative Pattern Refinement

Framework Overview 2

Page 11: Guiding Motif Discovery by Iterative Pattern Refinement

Outline

Introduction and motivationOur framework for motif discovery

Initial pattern discovery2. Build seed motif3. Extract subsequences4. Motifs discovery 5. Iterative refinementExperiment and ResultDiscussion and Future work

Page 12: Guiding Motif Discovery by Iterative Pattern Refinement

Initial Pattern Discovery - thresholds

Three thresholds for pattern discovery:1. length of patterns (L=3, exact patterns longer than

3 do not occur frequently even in the conserved motif regions).

2. log-odd value of 1st Markov model to random model (statistically significant patterns occur more frequently than random patterns ).

3. support value (patterns should be present in a certain number of sequences ).

Page 13: Guiding Motif Discovery by Iterative Pattern Refinement

Initial Pattern Discovery - algorithm

1. Use thresholds to scan the sequences in one set of sequences, find out qualified patterns in each sequence.

2. Rank the sequences according to how many qualified patterns each sequence has.

3. Save the qualified patterns in the top half sequences and eliminate these sequences.

4. Repeat this algorithm on the rest half set of sequences (go to step 1) until no more patterns can be found. The saved patterns will be used later.

Page 14: Guiding Motif Discovery by Iterative Pattern Refinement

Initial Pattern Discovery - example

Qualified Patterns (p1, p2, p3)

Page 15: Guiding Motif Discovery by Iterative Pattern Refinement

Outline

Introduction and motivationOur framework for motif discovery

1. Initial pattern discoveryBuild seed motif

3. Extract subsequences4. Motifs discovery 5. Iterative refinementExperiment and ResultDiscussion and Future work

Page 16: Guiding Motif Discovery by Iterative Pattern Refinement

Build Seed Motif

1. Start from the pattern with maximal support, use it as the seed motif.

2. Calculate the scores of the candidate patterns (in sequences not covered by the seed motif) to the seed motif.

Si = ΣSi-jWj (j = 1… n)Si: score of candidate pattern i to seed motif Si-j: score of candidate pattern to jth pattern in the seed motifWj: the weight (support ratio) of jth pattern in the seed motif

3. Add the pattern which has the highest score (also larger than a score threshold) to the seed motif.

4. Go to step 2, until no more patterns can be added to the seed motif.

Page 17: Guiding Motif Discovery by Iterative Pattern Refinement

Build Seed Motif - example

Calculate pattern scores (threshold = 5)Pattern Sequence Support (suppose no

shared sequences)Weight Score to

motifP1 CLG 4 W1 = 1P2 CLN 2 13P3 ALG 2 10P4 ALN 2 4

S2-1 = 9+4+0 = 13; S2 = S2-1W1 = 13P1 C L G 9 4 0 P2 C L N

Page 18: Guiding Motif Discovery by Iterative Pattern Refinement

Build Seed Motif - example

Calculate pattern scores (threshold = 5)Pattern Sequence Support (suppose no

shared sequences)Weight Score to

motifP1 CLG 4 W1 = 4 / (4+2)P2 CLN 2 W2 = 2 / (4+2)P3 ALG 2 8P4 ALN 2 6

S3-1 = 10, S3-2 = 4 S3 = S3-1W1 + S3-2W2 = 8 > 5S4-1 = 4, S4-2 = 10 S4 = S4-1W1 + S4-2W2 = 6 > 5

Page 19: Guiding Motif Discovery by Iterative Pattern Refinement

Build Seed Motif - example

Calculate pattern scores (threshold = 5)Pattern Sequence Support (suppose no

shared sequences)Weight Score to

motifP1 CLG 4 W1 = 4 / 8P2 CLN 2 W2 = 2 / 8P3 ALG 2 W3 = 2 / 8P4 ALN 2 9

S4-1 = 4, S4-2 = 10, S4-3 = 8

S4 = S4-1W1 + S4-2W2 + S4-3W3 = 9 > 5

Page 20: Guiding Motif Discovery by Iterative Pattern Refinement

Build Seed Motif

Page 21: Guiding Motif Discovery by Iterative Pattern Refinement

Outline

Introduction and motivationOur framework for motif discovery

1. Initial pattern discovery2. Build seed motif

Extract subsequencesMotifs discovery Iterative refinement

Experiment and ResultDiscussion and Future work

Page 22: Guiding Motif Discovery by Iterative Pattern Refinement

Extract Subsequences

Page 23: Guiding Motif Discovery by Iterative Pattern Refinement

Motifs Discovery

MEME

Page 24: Guiding Motif Discovery by Iterative Pattern Refinement

Iterative refinementsub-sequences

MASTmotif

sub-sequences

MEME

entire protein family

Stable?

motif discovery

no

yes

Page 25: Guiding Motif Discovery by Iterative Pattern Refinement

Iterative refinement

Page 26: Guiding Motif Discovery by Iterative Pattern Refinement

Outline

Introduction and motivationOur framework for motif discovery

1. Initial pattern discovery2. Build seed motif3. Extract subsequences4. Motifs discovery 5. Iterative refinement

Experiment and ResultDiscussion and Future work

Page 27: Guiding Motif Discovery by Iterative Pattern Refinement

Experiment

1. We used 108 PROSITE families as test data.2. Ran MEME directly on these families and got

the best motif for each of them.3. Ran our framework and got the best motif

(Because of time constraints, our motif framework performed only single iteration. )

4. Compared the results.

Page 28: Guiding Motif Discovery by Iterative Pattern Refinement

PerformanceThe result of the comparison. http://biokdd.informatics.indiana.edu/zhipwang/paper/appendix

63 patterns

( PS00010 PS00011…)

23 patterns

(PS00014PS00033…)

15 patterns

(PS00019PS00035…

)

7 patterns

(PS01345PS01286…)

Framework × ×MEME × ×

Page 29: Guiding Motif Discovery by Iterative Pattern Refinement

Performance

The result of the comparison.

Page 30: Guiding Motif Discovery by Iterative Pattern Refinement

Discussion

To make our experiment more rigorous, we choose only the top motif reported by both MEME and our framework. Among the 22 failed cases, our framework did discover 21 of them, though their rank was not top. One flaw: Local optimaThis framework is general enough to include any motif discovery and search algorithms that report multiple motifs with a statistical score.

Page 31: Guiding Motif Discovery by Iterative Pattern Refinement

Future Work

On the theoretical side, we are interested in formalizing and understanding the role of noise. How likely subsequences induced by our initial pattern

discovery algorithm can include true motifs? Is convergence to true motif regions guaranteed once the

initial set of subsequences contain true motifs? For empirical study, we plan to perform multiple iterations using the whole PROSITE pattern set; embed different motif discovery and search programs into our framework.