DNA Motif Finding 2010

75
DNA Motif Finding Stewart MacArthur Bioinformatics Core March 11th, 2010 Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 1 / 33

description

The DNA Motif finding talk given in March 2010 at the CRUK CRI. Cambridge, UKIt was designed to introduce wet-lab researchers to using web-based tools for doing DNA motif finding, such as on promoters of differentially expressed genes from a microarray experiment.

Transcript of DNA Motif Finding 2010

Page 1: DNA Motif Finding 2010

DNA Motif Finding

Stewart MacArthur

Bioinformatics Core

March 11th, 2010

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 1 / 33

Page 2: DNA Motif Finding 2010

Introduction

What is a DNA Motif?

DNA motifs are short, recurring patterns that are presumed to have abiological function.

• sequence-specific binding sites• nucleases

• ribosome binding• mRNA processing

• splicing• editing• polyadenylation

• transcription termination

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33

Page 3: DNA Motif Finding 2010

Introduction

What is a DNA Motif?

DNA motifs are short, recurring patterns that are presumed to have abiological function.

• sequence-specific binding sites• transcription factors• nucleases

• ribosome binding• mRNA processing

• splicing• editing• polyadenylation

• transcription termination

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33

Page 4: DNA Motif Finding 2010

Introduction

What is a DNA Motif?

DNA motifs are short, recurring patterns that are presumed to have abiological function.

• sequence-specific binding sites• transcription factors• nucleases

• ribosome binding• mRNA processing

• splicing• editing• polyadenylation

• transcription termination

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 2 / 33

Page 5: DNA Motif Finding 2010

Representing a motif

How to represent a DNA motif?How can we represent the binding specificity of a protein, such that wecan reliably predict its binding to any given sequence?Restriction enzymes sites can be written as simple DNA sequence,e.g. GAATTC for EcoRI

5’-G A A T T C-3’3’-C T T A A G-5’

These sequences can incorporate ambiguity, e.g. GTYRAC for HincII,using the IUPAC code.

GTYRACY = C or TR = A or C

All matching sites will be cut by the restriction enzyme

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 3 / 33

Page 6: DNA Motif Finding 2010

Representing a motif

Transcription Factors are different...

• Regulatory motifs are often degenerate,variable but similar.• Transcription factors are often pleiotropic, regulating several

genes, but they may need to be expressed at different levels.• A side effect of this degeneracy is spurious binding, where the

protein has affinity at positions in the genome other than theirfunctional sites.

• Degeneracy in restriction enzyme binding would be lethal• Non-specific binding competes for protein and requires more

protein to be produced than would be required otherwise

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 4 / 33

Page 7: DNA Motif Finding 2010

Representing a motif Consensus

The Consensus Sequence• A consensus binding site is often used to represent transcription

factor binding• Refers to a sequence that matches all examples of the binding

site closely but not exactly• There is a trade-off between the ambiguity in the consensus and

its sensitivity

TACGATTATAATTATAATGATACTTATGATTATGTT

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 5 / 33

Page 8: DNA Motif Finding 2010

Representing a motif Consensus

The Consensus Sequence• A consensus binding site is often used to represent transcription

factor binding• Refers to a sequence that matches all examples of the binding

site closely but not exactly• There is a trade-off between the ambiguity in the consensus and

its sensitivity

TACGATTATAATTATAATGATACTTATGATTATGTT

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 5 / 33

Page 9: DNA Motif Finding 2010

Representing a motif Consensus

The Consensus Sequence : Example

TACGATTATAATTATAATTATACTTATGATTATGTTTATAAT

Allowing 0 mismatches finds 2/6 Sites1 site every 4kb

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33

Page 10: DNA Motif Finding 2010

Representing a motif Consensus

The Consensus Sequence : Example

TACGATTATAAT*TATAAT*TATACTTATGATTATGTTTATAAT

Allowing 0 mismatches finds 2/6 Sites1 site every 4kb

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33

Page 11: DNA Motif Finding 2010

Representing a motif Consensus

The Consensus Sequence : Example

TACGATTATAAT*TATAAT*TATACTTATGAT*TATGTTTATAAT

Allowing at most 1 mismatch finds 3/6 Sites1 site every 200bp

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33

Page 12: DNA Motif Finding 2010

Representing a motif Consensus

The Consensus Sequence : Example

TACGAT*TATAAT*TATAAT*TATACT*TATGAT*TATGTT*TATAAT

Allowing up to 2 mismatches finds 6/6 Sites1 site every 30bp

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 6 / 33

Page 13: DNA Motif Finding 2010

Representing a motif IUPAC

IUPAC codesA AdenineC CytosineG GuanineT ThymineR A or GY C or TS G or CW A or TK G or TM A or CB C or G or TD A or G or TH A or C or TV A or C or GN any base

. or - gapStewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 7 / 33

Page 14: DNA Motif Finding 2010

Representing a motif IUPAC

The Consensus Sequence : Example

TACGATTATAATTATAATTATACTTATGATTATGTTTATRNT

Allowing 0 mismatches finds 2/6 Sites

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33

Page 15: DNA Motif Finding 2010

Representing a motif IUPAC

The Consensus Sequence : Example

TACGATTATAAT*TATAAT*TATACTTATGAT*TATGTT*TATRNT

Exact match finds 4/6 Sites - 1 site every 500bp

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33

Page 16: DNA Motif Finding 2010

Representing a motif IUPAC

The Consensus Sequence : Example

TACGAT*TATAAT*TATAAT*TATACT*TATGAT*TATGTT*TATRNT

Up to one mismatch finds 6/6 Sites - 1 site every 30bp

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 8 / 33

Page 17: DNA Motif Finding 2010

Representing a motif Matrix

The Matrix• A position weight matrix (PWM)

• also called position-specific weight matrix (PSWM)• also called position-frequency matrix (PFM)• also called position-specific scoring matrix (PSSM)• or just matrix

• Alternative to the consensus.• There is a matrix element for all possible bases at every position.

1 2 3 4 5 6 7 8 9 10 11A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 9 / 33

Page 18: DNA Motif Finding 2010

Representing a motif Matrix

The Matrix• A position weight matrix (PWM)

• also called position-specific weight matrix (PSWM)• also called position-frequency matrix (PFM)• also called position-specific scoring matrix (PSSM)• or just matrix

• Alternative to the consensus.• There is a matrix element for all possible bases at every position.

1 2 3 4 5 6 7 8 9 10 11A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 9 / 33

Page 19: DNA Motif Finding 2010

Representing a motif Matrix

Matrix FormatsCounts

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5

Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33

Page 20: DNA Motif Finding 2010

Representing a motif Matrix

Matrix FormatsCounts

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5

Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33

Page 21: DNA Motif Finding 2010

Representing a motif Matrix

Matrix FormatsCounts

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5

Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 10 / 33

Page 22: DNA Motif Finding 2010

Representing a motif Matrix

Sequence Logos

• A visual representation of themotif

• Each column of the matrix isrepresented as a stack ofletters whose size isproportional to thecorresponding residuefrequency

• The total height of eachcolumn is proportional to itsinformation content.

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 11 / 33

Page 23: DNA Motif Finding 2010

Information theory

Information Theory

• Information theory is a branch of applied mathematics involvedwith the quantification of information

• It has been applied to DNA motifs in order to determine theamount of uncertainly at each position in a site

• Uncertainly is measured in bits of information, which is on a log2scale.

• Information is a decrease in uncertainty

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 12 / 33

Page 24: DNA Motif Finding 2010

Information theory

Information theory

• 1 base occurs every time - 2 bits• 2 bases occur 50% of time - 1bit• 4 bases occur equally - 0 bits

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

Example

Ii = 2 +∑

fb,i log2 fb,i

1 = 2 + 0.5× log2(0.5) + 0.5× log2(0.5)

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 13 / 33

Page 25: DNA Motif Finding 2010

Information theory

Information theory

• 1 base occurs every time - 2 bits• 2 bases occur 50% of time - 1bit• 4 bases occur equally - 0 bits

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

Example

Ii = 2 +∑

fb,i log2 fb,i

1 = 2 + 0.5× log2(0.5) + 0.5× log2(0.5)

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 13 / 33

Page 26: DNA Motif Finding 2010

Information theory

Why do we want to find them?

Expression Microarrays• Find co-regulated genes• Suggest Pathways

ChIP seq/chip• Determine binding

preferences• Find co-factors

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 14 / 33

Page 27: DNA Motif Finding 2010

Information theory

Why do we want to find them?

Expression Microarrays• Find co-regulated genes• Suggest Pathways

ChIP seq/chip• Determine binding

preferences• Find co-factors

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 14 / 33

Page 28: DNA Motif Finding 2010

Information theory

Two Methods

Pattern MatchingFinding known motifs

• Does protein X bind upstreamof my genes?

• Does it bind more thanexpected by chance?

Pattern DiscoveryFinding unknown motifs

• What motifs are upstream ofmy genes?

• What are these motifs

e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ...

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33

Page 29: DNA Motif Finding 2010

Information theory

Two Methods

Pattern MatchingFinding known motifs

• Does protein X bind upstreamof my genes?

• Does it bind more thanexpected by chance?

Pattern DiscoveryFinding unknown motifs

• What motifs are upstream ofmy genes?

• What are these motifs

e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ...

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33

Page 30: DNA Motif Finding 2010

Information theory

Two Methods

Pattern MatchingFinding known motifs

• Does protein X bind upstreamof my genes?

• Does it bind more thanexpected by chance?

Pattern DiscoveryFinding unknown motifs

• What motifs are upstream ofmy genes?

• What are these motifs

e.g. Patser, Pscan, Mast.. e.g. MEME, Weeder, MDScan ...

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 15 / 33

Page 31: DNA Motif Finding 2010

Databases of Motifs

Where can we find known motifs?

Online databases• Multicellular Eukaryotes

• Jaspar• Transfac• Pazar

• Yeast• Yeastract• SCPD

• Prokaryotes• RegulonDB• Prodoric

• Other• UniProbe

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33

Page 32: DNA Motif Finding 2010

Databases of Motifs

Where can we find known motifs?Online databases• Multicellular Eukaryotes

• Jaspar• Transfac• Pazar

• Yeast• Yeastract• SCPD

• Prokaryotes• RegulonDB• Prodoric

• Other• UniProbe

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33

Page 33: DNA Motif Finding 2010

Databases of Motifs

Where can we find known motifs?Online databases• Multicellular Eukaryotes

• Jaspar• Transfac• Pazar

• Yeast• Yeastract• SCPD

• Prokaryotes• RegulonDB• Prodoric

• Other• UniProbe

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 16 / 33

Page 34: DNA Motif Finding 2010

Finding known motifs

How do we find them?

TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAACACATGTCTCATGTACTGGACCATGTCTAAGGGGTGTAAGGGTACTAACGAATCGTAGCATGTCCAGAGGTGCGGAGTACGTAAGGAGGGTGCCCATACATGTCCGTTTCATATGAGCCTGCATTAATGTACCAACCTTCAACCATGTCTCAACATGTCGCGGGTGTGCCTCCACGTACGAGCCGGAAGTCGACTCGCATGTCTGTCAGTATTATCCAAAGCATGTCGACCTCTTCATGTCAGCGAACGCAAGATCTTCATATGAGCCTGCATTAATGTACC

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 17 / 33

Page 35: DNA Motif Finding 2010

Finding known motifs

Pattern MatchingCounts

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5

Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33

Page 36: DNA Motif Finding 2010

Finding known motifs

Pattern MatchingCounts

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5

Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33

Page 37: DNA Motif Finding 2010

Finding known motifs

Pattern MatchingCounts

A 4 13 5 3 0 0 0 0 17 0 6C 4 1 2 0 0 0 0 0 0 1 0G 3 3 0 0 18 0 0 0 1 4 3T 7 1 11 15 0 18 18 18 0 13 9

FrequencyA 0.2 0.7 0.3 0.2 0.0 0.0 0.0 0.0 0.9 0.0 0.3C 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0G 0.2 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.1 0.2 0.2T 0.4 0.1 0.6 0.8 0.0 1.0 1.0 1.0 0.0 0.7 0.5

Weight (log odds)A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 18 / 33

Page 38: DNA Motif Finding 2010

Finding known motifs

Pattern Matching

A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

TATATTGTTTATTTTCATGACTTCATGTCGCATGTATTGTTAATTAA

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33

Page 39: DNA Motif Finding 2010

Finding known motifs

Pattern Matching

A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

T A T A T T G T T T A

TATATTGTTTA TTTTCATGACTTCATGTCGCATGTATTGTTAATTAA

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33

Page 40: DNA Motif Finding 2010

Finding known motifs

Pattern Matching

A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

A T A T T G T T T A T

T ATATTGTTTAT TTTCATGACTTCATGTCGCATGTATTGTTAATTAA

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33

Page 41: DNA Motif Finding 2010

Finding known motifs

Pattern Matching

A -0.1 1.0 0.1 -0.4 -2.9 -2.9 -2.9 -2.9 1.3 -2.9 0.3C -0.1 -1.3 -0.7 -2.9 -2.9 -2.9 -2.9 -2.9 -2.9 -1.3 -2.9G -0.4 -0.4 -2.9 -2.9 1.3 -2.9 -2.9 -2.9 -1.3 -0.1 -0.4T 0.4 -1.3 0.9 1.2 -2.9 1.3 1.3 1.3 -2.9 1.0 0.7

T A T T G T T T A T T

TA TATTGTTTATT TTCATGACTTCATGTCGCATGTATTGTTAATTAA

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 19 / 33

Page 42: DNA Motif Finding 2010

Finding known motifs

Pattern Matching

TA TATTGTTTATT TTCATGACTTCATGTCGCATG TATTGTTAATT AAStewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 20 / 33

Page 43: DNA Motif Finding 2010

Pattern Discovery

Introduction to de-novo motif finding

de-novo or ab-initio motif finding refers to finding motifs “from thebeginning”, i.e. without previous knowledge

Various Methods• Word-based algorithms e.g. Oligo-Analysis, Weeder• Expectation-Maximization methods e.g. MEME• Gibbs sampling methods e.g. Gibbs sampler, MotifSampler

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 21 / 33

Page 44: DNA Motif Finding 2010

Pattern Discovery

Guidelines

• If possible, remove repeat patterns from the target sequences• Use multiple motif prediction algorithms.• Run probabilistic algorithms multiple times• Return multiple motifs• Try a range of motif widths and expected number of sites

“... we do not recommend to trust pattern discoveryresults with vertebrate genomes. ”

Jacques van Helden

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 22 / 33

Page 45: DNA Motif Finding 2010

Pattern Discovery

Guidelines

• If possible, remove repeat patterns from the target sequences• Use multiple motif prediction algorithms.• Run probabilistic algorithms multiple times• Return multiple motifs• Try a range of motif widths and expected number of sites

“... we do not recommend to trust pattern discoveryresults with vertebrate genomes. ”

Jacques van Helden

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 22 / 33

Page 46: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT

• Pscan• Galaxy• MotifMogul

Pattern Discovery

• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 47: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan

• Galaxy• MotifMogul

Pattern Discovery

• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 48: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy

• MotifMogul

Pattern Discovery

• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 49: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery

• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 50: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery

• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 51: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery• RSAT

• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 52: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery• RSAT• MEME

• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 53: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery• RSAT• MEME• Weeder

• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 54: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 55: DNA Motif Finding 2010

Recommended Tools

Recommended Tools

Pattern Matching• RSAT• Pscan• Galaxy• MotifMogul

Pattern Discovery• RSAT• MEME• Weeder• WebMOTIFS

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 23 / 33

Page 56: DNA Motif Finding 2010

Recommended Tools RSA Tools

Regulatory Sequence Analysis Toolshttp://rsat.ulb.ac.be/rsat/

Modular computer programs specifically designed for the detection ofregulatory signals in non-coding sequences.

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 24 / 33

Page 57: DNA Motif Finding 2010

Recommended Tools RSA Tools

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 25 / 33

Page 58: DNA Motif Finding 2010

Recommended Tools RSA Tools

Regulatory Sequence Analysis Tools

Nature Protocols Series: Volume 3 No 10 2008

• Using RSAT to scan genome sequences for transcription factor bindingsites and cis-regulatory modules

• Using RSAT oligo-analysis and dyad-analysis tools to discoverregulatory signals in nucleic sequences

• Analyzing multiple data sets by interconnecting RSAT programs viaSOAP Web services - an example with ChIP-chip data

• Network Analysis Tools: from biological networks to clusters andpathways

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 26 / 33

Page 59: DNA Motif Finding 2010

Recommended Tools RSA Tools

Example Workflow

ProblemI have some differentially expressed genes from a microarrayexperiment. I would like to know if P53 binds in their promoter regions,and if so where.

Workflow• BioMart: Convert Gene IDs, if necessary• RSAT: retrieve sequence• JASPAR: Get PWM (MA0106.1)• RSAT: matrix-scan• RSAT: feature map

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 27 / 33

Page 60: DNA Motif Finding 2010

Recommended Tools Pscan

Pscan“Finding over-represented transcriptionfactor binding site motifs in sequences fromco-regulated or co-expressed genes”

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 28 / 33

Page 61: DNA Motif Finding 2010

Recommended Tools Pscan

Example Workflow

ProblemI have some differentially expressed genes from a microarrayexperiment. I would like to know which transcription factors bind totheir promoters.

Workflow• BioMart: Convert Gene IDs, if necessary• Pscan: retrieve sequence

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 29 / 33

Page 62: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 63: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools

• Modular• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 64: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular

• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 65: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows

• Saved Histories

• Reproducible analysis• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 66: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 67: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows• Saved Histories

• Reproducible analysis

• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 68: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories

• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 69: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories• In house version

• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 70: DNA Motif Finding 2010

Recommended Tools Galaxy

Galaxyhttp://main.g2.bx.psu.edu

“Galaxy allows you to do analyses you cannot do anywhereelse without the need to install or download anything. You cananalyze multiple alignments, compare genomic annotations, profilemetagenomic samples and much much more...”

• Collection of online tools• Modular• Can create workflows• Saved Histories

• Reproducible analysis• Shared histories• In house version• Easily extendable

http://kinchie/galaxy

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 30 / 33

Page 71: DNA Motif Finding 2010

Recommended Tools MEME Suite

MEME SuiteSuite of web based tools for motif discovery

• MEME - de-novo motif finding

• MAST - find matches to knownmotifs (MEME output)

• TOMTOM - Compare motifs toTRANSFAC and Jaspar

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33

Page 72: DNA Motif Finding 2010

Recommended Tools MEME Suite

MEME SuiteSuite of web based tools for motif discovery

• MEME - de-novo motif finding• MAST - find matches to known

motifs (MEME output)

• TOMTOM - Compare motifs toTRANSFAC and Jaspar

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33

Page 73: DNA Motif Finding 2010

Recommended Tools MEME Suite

MEME SuiteSuite of web based tools for motif discovery

• MEME - de-novo motif finding• MAST - find matches to known

motifs (MEME output)• TOMTOM - Compare motifs to

TRANSFAC and Jaspar

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 31 / 33

Page 74: DNA Motif Finding 2010

Further Reading

Further Reading

• Stormo GD. DNA binding sites: representation and discovery.Bioinformatics. 2000 Jan;16(1):16-23. Review. PubMed PMID:10812473.

• D’haeseleer P. How does DNA sequence motif discovery work?Nat Biotechnol. 2006 Aug;24(8):959-61. Review. PubMed PMID:16900144.

• Das MK, Dai HK. A survey of DNA motif finding algorithms. BMCBioinformatics. 2007 Nov 1;8 Suppl 7:S21. Review. PubMedPMID: 18047721; PubMed Central PMCID: PMC2099490.

• Tompa M, Li N et.al. Assessing computational tools for thediscovery of transcription factor binding sites. Nat Biotechnol.2005 Jan;23(1):137-44. PubMed PMID: 15637633.

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 32 / 33

Page 75: DNA Motif Finding 2010

Practical

Practical Session

Stewart MacArthur (Bioinformatics Core) DNA Motif Finding March 11th, 2010 33 / 33