Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for...
-
Upload
aldous-farmer -
Category
Documents
-
view
219 -
download
1
Transcript of Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for...
Design and creation of Design and creation of multiple sequence multiple sequence
alignmentsalignmentsUnit 15Unit 15
BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for
BiotechnologyBiotechnologyIrene Gabashvili, PhD
IPA 6.0 licenseIPA 6.0 license
Need a list of e-mails to create Need a list of e-mails to create accountsaccounts
Will have a 6 weeks license (instead Will have a 6 weeks license (instead of 2 weeks)of 2 weeks)
Problem Set 3 is Pathway Analysis, Problem Set 3 is Pathway Analysis, Lab of March 19 will be on using IPA Lab of March 19 will be on using IPA too too
Problem Set 2 ReviewProblem Set 2 Review
Sensitivity and SpecificitySensitivity and Specificity Parameters for Multiple Alignment Parameters for Multiple Alignment
(Databases, Search Terms, Scores)(Databases, Search Terms, Scores) TransfacTransfac DotplotsDotplots
Gene prediction Gene prediction flowchartflowchart
Evaluation of Splice Site Prediction
Fig 5.11Baxevanis & Ouellette 2005
What do measures really mean?
Note typo in B&O
ROC curves (plots of (1-Sn) ROC curves (plots of (1-Sn) vs Sp)vs Sp)
A A receiver operating characteristicreceiver operating characteristic ((ROCROC), or simply ), or simply ROC curveROC curve, is a , is a graphical plot of the plot of the sensitivity vs. (1 - vs. (1 - specificity) for a ) for a binary classifier system system as its discrimination threshold is varied.as its discrimination threshold is varied.
The sensitivity and specificity of a The sensitivity and specificity of a diagnostic test depends on more than diagnostic test depends on more than just the "quality" of the test--they also just the "quality" of the test--they also depend on the definition of what depend on the definition of what constitutes an abnormal test.constitutes an abnormal test.
Evaluation of Splice Site Prediction
• Normalized specificity:
1
1
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity: rAN
AP
• Misclassification rates: FN
AP
FP
AN
• Sensitivity: = Coverage
Careful: different definitions for "Specificity"
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity:
cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN)
Sp: Specificity = TN/(TN+FP) = Sp-
AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1
Other measures? Predictive Values, Correlation Coefficient
Brendel definitions
9
Best measures for comparing different methods?
• ROC curves (Receiver Operating Characteristic?!!)
http://www.anaesthetist.com/mnm/stats/roc/
"The Magnificent ROC" - has fun applets & quotes:
"There is no statistical test, however intuitive and simple, which will not be abused by medical researchers"
• Correlation Coefficient(Matthews correlation coefficient (MCC)
MCC = 1 for a perfect prediction 0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
Just FYI
10
PromotersPromotersWhat signals are there?What signals are there?
Simple ones in prokaryotesSimple ones in prokaryotes
Prokaryotic promoters Prokaryotic promoters RNA polymerase complexRNA polymerase complex recognizes recognizes
promoter sequences located very close to & promoter sequences located very close to & on 5’ side (“upstream”) of initiation site on 5’ side (“upstream”) of initiation site
RNA polymerase complexRNA polymerase complex binds directlybinds directly to to these. with no requirement for “transcription these. with no requirement for “transcription factors”factors”
Prokaryotic promoter sequences are highly Prokaryotic promoter sequences are highly conservedconserved
-10 region -10 region -35 region-35 region
Simpler view of complex promoters in eukaryotes:
Fig 5.12Baxevanis & Ouellette 2005
13
Eukaryotic genes are transcribed by Eukaryotic genes are transcribed by 3 different RNA polymerases3 different RNA polymerases
Recognize different types of promoters & enhancers:
14
Eukaryotic promoters & Eukaryotic promoters & enhancers enhancers
PromotersPromoters located “relatively” close to initiation located “relatively” close to initiation sitesite
(but can be located within gene, rather than upstream!)(but can be located within gene, rather than upstream!)
Enhancers Enhancers also required for regulated transcriptionalso required for regulated transcription(these control expression in specific cell types, developmental stages, in (these control expression in specific cell types, developmental stages, in response to environment)response to environment)
RNA polymerase complexes do notRNA polymerase complexes do not specifically specifically recognize promoter sequences directlyrecognize promoter sequences directly
TTranscription factorsranscription factors bind first and serve as bind first and serve as “landmarks” for recognition by RNA polymerase “landmarks” for recognition by RNA polymerase complexescomplexes
15
Eukaryotic transcription Eukaryotic transcription factors factors
Transcription factorsTranscription factors (TFs) are DNA binding (TFs) are DNA binding proteins that also interact with RNA polymerase proteins that also interact with RNA polymerase complex to activate or repress transcriptioncomplex to activate or repress transcription
TFs contain characteristic TFs contain characteristic “DNA binding “DNA binding motifs”motifs”
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039
TFs recognize specific short DNA sequence TFs recognize specific short DNA sequence motifs motifs “transcription factor binding sites”“transcription factor binding sites”
Several databases for these, e.g.Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac
Zinc finger-containing Zinc finger-containing transcription factors transcription factors
• Common in eukaryotic proteins
• Estimated 1% of mammalian genes encode zinc-finger proteins
• In C. elegans, there are 500!
• Can be used as highly specific DNA binding modules
• Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy
Promoter prediction: Eukaryotes vs Promoter prediction: Eukaryotes vs prokaryotesprokaryotes
Promoter prediction is easier in microbial genomes
Why? Highly conservedSimpler gene structuresMore sequenced genomes!
(for comparative approaches)
Methods? Previously: mostly HMM-based Now: similarity-based. comparative
methodsbecause so many genomes
available
18
Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies
Closely related to gene prediction! • Obtain genomic sequence• Use sequence-similarity based comparison
(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-conserved than coding regions
• Locate ORFs • Identify TSS (if possible!)• Use promoter prediction programs • Analyze motifs, etc. in sequence (TRANSFAC)
Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies
Identify TSS --if possible?• One of biggest problems is determining exact TSS!
Not very many full-length cDNAs!• Good starting point? (human & vertebrate genes)
Use FirstEFfound within UCSC Genome Browseror submit to FirstEF web server
Fig 5.10Baxevanis & Ouellette 2005
Automated promoter prediction Automated promoter prediction strategiesstrategies
1)Pattern-driven algorithms
2)Sequence-driven algorithms
3)Combined "evidence-based"
BEST RESULTS? Combined, sequential
Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms
• Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO)
• Tend to produce huge numbers of FPs
• Why? • Binding sites (BS) for specific TFs often variable• Binding sites are short (typically 5-15 bp)• Interactions between TFs (& other proteins) influence affinity &
specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to
organism/cell/stage/environmental condition
Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms
Solutions to problem of too many FP predictions?
• Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common
• Prokaryotes: knowledge of factors helps• Probability of "real" binding site increases if annotated
transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!)
& Only a small fraction of TSSs have been experimentally mapped
• Do the wet lab experiments! • But: Promoter-bashing is tedious
Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms
• Assumption: common functionality can be deduced from sequence conservation• Alignments of co-regulated genes should highlight elements
involved in regulationCareful: How determine co-regulation?
• Orthologous genes from difference species• Genes experimentally determined to be
co-regulated (using microarrays??)• Comparative promoter prediction:
"Phylogenetic footprinting" - more later….
Problems:• Need sets of co-regulated genes• For comparative (phylogenetic) methods
• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations, inversions in order of functional
elements• If background conservation of entire region is highly
conserved, comparison is useless• Not enough data (Prokaryotes >>> Eukaryotes)
• Biology is complex: many (most?) regulatory elements are not conserved across species!
Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms
Examples of promoter Examples of promoter prediction/characterization prediction/characterization
softwaresoftwareLab: used MATCH, MatInspector
TRANSFACMEME & MASTBLAST, etc.
Others?FIRST EFDragon Promoter Finder
also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc)JASPAR
TRANSFAC matrix entry: for TRANSFAC matrix entry: for TATA TATA boxbox
Fields:• Accession & ID •Brief description•TFs associated with this entry•Weight matrix •Number of sites used to build (How many here?)•Other info
Fig 5.13Baxevanis & Ouellette 2005
Global alignment of human & mouse Global alignment of human & mouse obese gene promoters (200 bp obese gene promoters (200 bp
upstream from TSS)upstream from TSS)
Fig 5.14Baxevanis & Ouellette 2005
GenBank IDs and GenBank IDs and AccessionsAccessions
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions RefSeq/key.html#accessions (Accession Formats: RefSeq)(Accession Formats: RefSeq)
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html Sitemap/samplerecord.html (GenBank Sample Record)(GenBank Sample Record)
Why we do multiple alignments?Why we do multiple alignments?
– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;
– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.
An example of Multiple An example of Multiple AlignmentAlignment
VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--
Visualization exampleVisualization example
Other multiple alignment Other multiple alignment programsprograms
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
Other multiple alignment Other multiple alignment programsprograms
ClustalW / ClustalX
pileup
multalign
multal
saga
hmmt
DIALIGN
SBpima
MLpima
T-Coffee
...
ClustalW- for multiple ClustalW- for multiple alignmentalignment
ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.
Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate
- fast/approximate- fast/approximate
Running ClustalW Running ClustalW [~]% clustalw
************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************
1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees
S. Execute a system command H. HELP X. EXIT (leave program)
Your choice:
Running ClustalWRunning ClustalW
The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.
Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters 6. Multiple alignment parameters
7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options
S. Execute a system command H. HELP or press [RETURN] to go back to main menu
Your choice:
Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment
HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *
ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:
1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB
Fast/Approximate alignments:
5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4
9. Toggle Slow/Fast pairwise alignments = SLOW
H. HELPEnter number (or [RETURN] to exit):
ClustalW optionsClustalW optionsYour choice: 6
********* MULTIPLE ALIGNMENT PARAMETERS *********
1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %
4. DNA Transitions Weight :0.50
5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF
8. Protein Gap Parameters
H. HELP
Enter number (or [RETURN] to exit):
Blocks database and toolsBlocks database and tools
Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.
The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.
They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.
The BLOCKS web The BLOCKS web serverserver
At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/
The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.
The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.
The Blocks Searcher The Blocks Searcher tooltool
For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.
This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.
The Blocks Searcher toolThe Blocks Searcher tool
Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.
The BLOCKS DatabaseThe BLOCKS Database
The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.
The Block Maker ToolThe Block Maker Tool
Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.
Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.
Input sequences must be in FastA format.Input sequences must be in FastA format.
Results are returned by e-mail.Results are returned by e-mail.
Progressive ApproachesProgressive Approaches
CLUSTALWCLUSTALW Perform pairwise alignmentsPerform pairwise alignments Construct a tree, joining most similar Construct a tree, joining most similar
sequences first (sequences first (guide treeguide tree)) Align sequences sequentially, using the Align sequences sequentially, using the
phylogenetic treephylogenetic tree PILEUPPILEUP
Similar to CLUSTALWSimilar to CLUSTALW Uses UPGMA to produce tree (chapter 6)Uses UPGMA to produce tree (chapter 6)
Clustal method
Higgins and Sharp 1988 Higgins and Sharp 1988 ref: CLUSTAL: a package for performing multiple sequence ref: CLUSTAL: a package for performing multiple sequence
alignment on a microcomputer. alignment on a microcomputer. GeneGene, , 7373, 237–244. [Medline], 237–244. [Medline]
ProgressiveProgressive alignment method alignment method
An approximation strategy (An approximation strategy (heuristic heuristic algorithmalgorithm) yields a possible ) yields a possible alignment, but not necessarily the alignment, but not necessarily the best onebest one
ABCD
AA BB CC DD
AA
BB 1111
CC 33 11
DD 22 22 1010
Compute the pairwise Compute the pairwise alignments for alignments for all all
against allagainst all (6 pairwise (6 pairwise alignments)alignments)
the similarities are the similarities are stored in a tablestored in a table
First step:
50
AA BB CC DD
AA
BB 1111
CC 33 11
DD 22 22 1010
A
D
C
B
cluster the sequences to create cluster the sequences to create a tree (a tree (guide treeguide tree):):
•Represents the order in which Represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned•Highly similar sequences are Highly similar sequences are neighbors in the tree neighbors in the tree •Highly distant sequences are Highly distant sequences are distant from each other in the treedistant from each other in the tree
Second step:
A
D
C
B
Align most similar Align most similar pairspairs
Align the alignments as Align the alignments as if each of them was a if each of them was a single sequence (with single sequence (with the use of a consensus the use of a consensus sequence or a profile)sequence or a profile)
Third step:
52
Clustal programs
ClustalVClustalV ClustalClustalWW
Thompson et al., 1994 Thompson et al., 1994 Uses: sequence weighting, positions-Uses: sequence weighting, positions-
specific gap penalties and weight specific gap penalties and weight matrix choicematrix choice
W stands for weight sequences W stands for weight sequences clustalclustalXX - windows implementation - windows implementation
53
ClustalW method rules (1)
sequence weighting Each sequence is weighted Each sequence is weighted
according to how different it is from according to how different it is from the other sequences. the other sequences. For the case where one specific For the case where one specific
subfamily is overrepresented in the subfamily is overrepresented in the datadata
54
ClustalW method rules (2)
weight matrix choice
The substitution matrix used for The substitution matrix used for each alignment step depends on the each alignment step depends on the similarity of the sequences. similarity of the sequences.
55
ClustalW method rules (3)
positions-specific gap penalties
Gaps found in initial alignments Gaps found in initial alignments remain fixed through the process remain fixed through the process (ends gap)(ends gap)
Hydrophobic residues have higher Hydrophobic residues have higher gap penalties than hydrophilicgap penalties than hydrophilic they are more likely to be in the they are more likely to be in the
hydrophobic core, where gaps hydrophobic core, where gaps should not occur. should not occur.
56
ClustalW method shortcomings
(1) (1) Sequences that are similar Sequences that are similar only in only in sub- regions sub- regions
ClustalW forces a global alignments, not local. ClustalW forces a global alignments, not local.
(2) (2) A sequence that contains a A sequence that contains a large large insertion/deletion compared insertion/deletion compared to the rest to the rest will extremely affect will extremely affect the alignment the alignment
(again global not local).(again global not local).
ClustalW method shortcomings
(3) (3) A sequence that contains a A sequence that contains a repetitive repetitive element (such as a domain), element (such as a domain), whereas whereas all other sequences all other sequences only contain one only contain one copy.copy.
Comments Pairwise alignment is an Pairwise alignment is an optimaloptimal
algorithmalgorithm
Multiple alignment is Multiple alignment is not an optimal not an optimal algorithm – only a heuristic. Better algorithm – only a heuristic. Better alignments may exist!alignments may exist!
The algorithm yields a possible alignment, The algorithm yields a possible alignment, but not necessarily the best one.but not necessarily the best one.
ClustalW in the web server
Global multiple sequence alignment Global multiple sequence alignment program for DNA or proteins program for DNA or proteins
Available from a number of sitesAvailable from a number of sites EMBL-EBIEMBL-EBI
ResultsResults
61
Results
Alignment with colors
identity similarty
CLUSTAL format
CLUSTAL W(1.82) multiple sequence alignmentCLUSTAL W(1.82) multiple sequence alignment
YPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSESKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSES
* *. * *.
YPK1 -----MQKQFYPK1 -----MQKQFYPK2 ----N-QKQFYPK2 ----N-QKQFKPCA_HUMAN D--O--QSDFKPCA_HUMAN D--O--QSDFKPCZ_HUMAN D-----QSEFKPCZ_HUMAN D-----QSEFKAPA -D----FRDFKAPA -D----FRDFKAPC -D----MKEFKAPC -D----MKEFKAPB --P---FQDFKAPB --P---FQDFKS6_HUMAN A-----NQVFKS6_HUMAN A-----NQVF
ClustalW at EMBL - Jalview
conservation
Jalview is a multiple alignment editor
Jalview
color menu:color menu: TaylorTaylor colorscolors (each amino acid is colored (each amino acid is colored
differently)differently) Zappo colorsZappo colors (amino acids are colored (amino acids are colored
according to their physico-chemical according to their physico-chemical properties)properties)
Hydrophobicity colorsHydrophobicity colors (colors amino aids (colors amino aids according to a certain score scale that according to a certain score scale that represents hydrophobicity)represents hydrophobicity)
Coloring residues above a percentage Coloring residues above a percentage identity thresholdidentity threshold
User defined color schemesUser defined color schemes
Example - Zappo colors
physico-chemical properties color-physico-chemical properties color-code:code:
67
Guide Tree
68
ClustalX
ClustalX provides a window-based ClustalX provides a window-based user interface to the ClustalW user interface to the ClustalW program.program.
It uses the developed by the NCBI as It uses the developed by the NCBI as
part of their part of their NCBI SOFTWARE NCBI SOFTWARE DEVELOPEMENT TOOLKIT.DEVELOPEMENT TOOLKIT.
69
T-coffee
Another MSA program Another MSA program Protein & nucleotide MSA programProtein & nucleotide MSA program Uses principles similar to ClustalWUses principles similar to ClustalW More accurate but longer running More accurate but longer running
timestimes Limits the number of sequences it Limits the number of sequences it
can align (~100)can align (~100) T-coffee at EMBnetT-coffee at EMBnet
70
71
T-coffee results
72
Phylip format 5 995 99
Cabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGICabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGI
GGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNFGGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNF GGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNFGGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNF GGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNFGGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNF GGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNFGGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNF GGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNFGGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNF
The Biology WorkBenchThe Biology WorkBench
http://workbench.sdsc.edu/http://workbench.sdsc.edu/ http://www.ngbw.org/http://www.ngbw.org/
Nucleic Acid Sequence Tools, Nucleic Acid Sequence Tools, including BLAST, CLUSTALW, including BLAST, CLUSTALW, MFOLD, PRIMER3MFOLD, PRIMER3
74
Muscle
Protein & nucleotide MSA programProtein & nucleotide MSA program Improvements in both accuracy and Improvements in both accuracy and
speedspeed exploiting a range of existing and new exploiting a range of existing and new
algorithmic techniques algorithmic techniques combination of progressive and iterative combination of progressive and iterative
alignment strategies alignment strategies details of the method details of the method web serverweb server downloads: Windows, Linux, Macdownloads: Windows, Linux, Mac
75
Muscle web server
76
Editing MSA There are a variety of tools that can be used to There are a variety of tools that can be used to
modify a multiple alignment (SeaView, BioEdit, modify a multiple alignment (SeaView, BioEdit, JalView)JalView)
These programs can be very useful in formatting These programs can be very useful in formatting and annotating an alignment for publication. and annotating an alignment for publication.
An editor can also be used to make modifications An editor can also be used to make modifications by hand to improve biologically significant by hand to improve biologically significant regions in a multiple alignment created by one of regions in a multiple alignment created by one of the automated alignment programs. the automated alignment programs.
77
MSA approaches Progressive approach Progressive approach
CLUSTALW (CLUSTALX), PileUp, CLUSTALW (CLUSTALX), PileUp, T-COFFEE, MAFFT, MUSCLET-COFFEE, MAFFT, MUSCLE
Iterative approach: Iterative approach: Repeatedly realign subsets of Repeatedly realign subsets of sequences.sequences.
MultAlin, DiAlig, MAFFT, MultAlin, DiAlig, MAFFT, MUSCLE,ProbConsMUSCLE,ProbCons
Genetic algorithmGenetic algorithmSAGASAGA
Graph algorithm Graph algorithm POAPOA
Conclusion There is no single method that There is no single method that
always generates the best alignmentalways generates the best alignment
It may thus be wise to use more than It may thus be wise to use more than one methodone method
Alignment editors can be used to Alignment editors can be used to correct the alignmentscorrect the alignments