Multiple sequence alignments and motif discovery
description
Transcript of Multiple sequence alignments and motif discovery
Multiple sequence alignments and motif discovery
Tutorial 5
• Multiple sequence alignment– ClustalW– Muscle
• Motif discovery– MEME– Jaspar
Multiple sequence alignments and motif discovery
• More than two sequences– DNA– Protein
• Evolutionary relation– Homology Phylogenetic tree– Detect motif
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
• Dynamic Programming– Optimal alignment– Exponential in #Sequences
• Progressive– Efficient– Heuristic
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
ClustalW
“CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, J D Thompson et al
ClustalW
• Progressive– At each step align two existing alignments or
sequences– Gaps present in older alignments remain fixed
-TGTTAAC-TGT-AAC-TGT--ACATGT---CATGT-GGC
ClustalW - Inputhttp://www.ebi.ac.uk/Tools/clustalw2/index.html
Input sequences
Gap scoring
Scoring matrix
Email address
Output format
ClustalW - Output
Match strength in decreasing order: * : .
ClustalW - Output
ClustalW - Output
ClustalW - Output
ClustalW - Output
Pairwise alignment scores
Building alignment
Final score
Building tree
ClustalW - Output
ClustalW Output
Sequence names Sequence positions
Match strength in decreasing order: * : .
ClustalW - Output
ClustalW - Output
Branch length
ClustalW - Output
ClustalW - Output
http://www.ebi.ac.uk/Tools/muscle/index.html
Muscle
Muscle - output
What’s the difference between Muscle and ClustalW?
ClustalW Muscle
http://www.megasoftware.net/index.html
Can we find motifs using multiple sequence alignment?
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0
1 3 5 7 9..YDEEGGDAEE....YDEEGGDAEE....YGEEGADYED....YDEEGADYEE....YNDEGDDYEE....YHDEGAADEE.. * :** *:
MotifA widespread pattern with a biological significance
Can we find motifs using multiple sequence alignment?
YES! NO
MEME – Multiple EM* for Motif finding
• http://meme.sdsc.edu/• Motif discovery from unaligned sequences
– Genomic or protein sequences• Flexible model of motif presence (Motif can be absent in
some sequences or appear several times in one sequence)
*Expectation-maximization
MEME - InputEmail addres
s
Input file (fasta file)
How many times in each
sequence?
How many motifs?
How many sites?
Range of motif
lengths
MEME - Output
Motif score
MEME - Output
Motif length
Number of times
Motif score
MEME - Output
Low uncertainty
=
High information content
MEME - Output
Multilevel Consensus
Sequence names
Position in sequence
Strength of match
Motif within sequence
MEME - Output
Overall strength of motif matches
Motif location in the input sequence
MEME - OutputSequence names
MAST
• Searches for motifs (one or more) in sequence databases:– Like BLAST but motifs for input– Similar to iterations of PSI-BLAST
• Profile defines strength of match– Multiple motif matches per sequence– Combined E value for all motifs
• MEME uses MAST to summarize results: – Each MEME result is accompanied by the MAST result for
searching the discovered motifs on the given sequences.
http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi
MEME - InputEmail
address
Input file (motifs)
Database
JASPAR
• Profiles – Transcription factor binding sites– Multicellular eukaryotes– Derived from published collections of experiments
• Open data accesss
JASPAR• profiles
– Modeled as matrices.– can be converted into PSSM for scanning genomic
sequences.
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0
Search profile
http://jaspar.genereg.net/
scoreorganism logoName of gene/protein