Alignments and database searches - Göteborgs...

41
Alignments and database searches Common biological problem: We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein ?? * Pattern search - PROSITE * Profile search - Pfam * Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) * Sequence homology - BLAST, FASTA, SSEARCH Simple example: unknown human protein is highly homologous to a protein with known function from another organism => The human protein has the same function (it’s an ortholog or a paralog)

Transcript of Alignments and database searches - Göteborgs...

Page 1: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Alignments and database searches

Common biological problem: We have a novel protein sequence. What can we inferfrom this sequence about the biological function of theprotein ??

* Pattern search - PROSITE* Profile search - Pfam* Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) * Sequence homology - BLAST, FASTA, SSEARCH

Simple example: unknown human protein is highly homologous to aprotein with known function from another organism => The human protein has the same function (it’s anortholog or a paralog)

Page 2: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Comparing non-identical sequencesProtein sequence comparison - basic concepts

When two protein sequences are being compared and the similarity isconsidered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:

Orthologs Proteins that carry out the same function in different species

Paralogs Proteins that perform different but related functions within one organism

Proteins are homologous if they are related by divergence from a common ancestor.

Page 3: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

X

X

X1

X

X2

Speciation

What are orthologs?

Ancestral organism

Organism A

Organism A

Organism B

Organism B

Orthologs

Page 4: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

X

X

Xa

X

Xb

Gene duplication

What are paralogs?

Paralogs

Page 5: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Mouse trypsin -- orthologs -- Human trypsin | | |paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin

Page 6: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Pairwise alignments:

-Gap creation penalty-Gap extension penalty-Substitution matrix (proteins)

Global alignmentConsiders similarity across the full extent of the sequences xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | | ||||||| | |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Local alignment Considers regions of similarity in parts of the sequences only. xxxxxxx ||||||| xxxxxxx region of similarity

Page 7: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 8: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Searching databases with BLAST

Improvement of speed as compared to local alignment algorithm:

Initial search is for short words.Word hits are then extended in either direction.

Page 9: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq (75 letters)

Database: nr 457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

Page 10: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74 GSbjct: 64 G 64

Page 11: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G

60%ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G69%

M V R I Q K I N E K G A L L A G38%

Q V R I Q K I Y E K G A L L A A19% (‘twilight zone’)

Q V R I Q K I Y E K T A L L F A6% (‘midnight zone’)

Evolution of protein genes

Page 12: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Blast report

Sequences producing significant alignments: (bits) Value

pir||F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdC)... 462 e-129gb|AAD31675.1| (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase ... 233 1e-060sp|P39383|YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046emb|CAA67409.1| (X98916) orf6 [Methanopyrus kandleri] 170 1e-041gb|AAF13150.1|AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033pir||A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030pir||A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029gb|AAC23928.1| (U75363) benzoyl-CoA reductase subunit [Rhodopseu... 117 1e-025pir||S04476 hypothetical protein (hdgA 5' region) - Acidaminococ... 104 1e-021sp|P27542|DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005gb|AAC15473.1| (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036pir||F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082pir||F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18sp|P42373|DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18emb|CAA10035.1| (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31sp|P56836|DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41gb|AAF39496.1| (AE002336) dnaK protein [Chlamydia muridarum] 36 0.41pir||B70189 rod shape-determining protein (mreB-1) homolog - Lym... 36 0.41sp|O57716|GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54sp|O33522|DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54ref|NP_012874.1| Ykl050cp >gi|549677|sp|P35736|YKF0_YEAST HYPOTH... 36 0.54emb|CAA53420.1| (X75781) D513 [Saccharomyces cerevisiae] >gi|158... 36 0.54sp|P30722|DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi|99... 36 0.54pir||A40158 dnaK-type molecular chaperone - Chlamydia trachomati... 34 1.2gb|AAF07742.1|AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6gb|AAF07521.1|AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6gb|AAF38963.1| (AE002276) cell shape-determining protein MreB [C... 34 2.1gb|AAG08147.1|AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7dbj|BAB03215.1| (AB017035) dnaK [Bacillus thermoglucosidasius] 33 2.7sp|P43736|DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|P45554|DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|Q58303|FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7gb|AAG08239.1|AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1

Page 13: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60

61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120

121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180

181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240

241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300

301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360

361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ

Low complexity sequence tends to1) increase the number of non-specific hits to database sequences2) correspond to regions in proteins not associated with a knownbiological function (typically unstructured parts of the protein)

Therefore, low complexity sequence is filtered out by default in BLAST searches

Page 14: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Query Database

blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA

Page 15: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Databases at NCBI available for BLAST searches

Protein sequence databases

nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF

swissprot the last major release of SWISS-PROT

DNA sequence Databases

nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences(but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)

dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions

htgs htgs unfinished High Throughput Genomic Sequences

Page 16: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Rules of database searches (like BLAST)

? Database sequence searches involving proteins should be carried out at theprotein level and not at the DNA level *? Use of smallest possible database (not too small though)? Sequence statistics should be used rather than percent identity/similarity ascriterion for homology? Consider different scoring matrices and gap penalties

* 1) DNA sequences encoding the same protein sequence can be very different, due tothe degeneracy of the genetic code.

TTTCGATTCTCAACAAGAAGC** * ** ** * *TTCAGGTTTAGCACGCGGTCC F R F S T R S

2) Amino acid substitution matrices may be taken into account.

Page 17: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 18: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Multiple alignments - applications

Identify conserved motifs - patterns (PROSITE)Profiles (Pfam)Phylogenetic studiesPrediction of protein secondary structureExperimental : design of probes

Page 19: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 20: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Example of experimental application of msa.Probes may be designed for PCR to amplify a region of DNA.

GCCCAAATTACGGGGACAACGGATCTTGGGTTATC......CGGGACGGG GCCCAAATTACGGGGTACTTACGCGGGGACTTTAT......CGGGACGGG GCCCAAATTACGGGGACGGACTTAGC...............CGGGACGGG GCCCAAATTACGGGGCGAGTCTATCTTTTACTTATCTTT..CGGGACGGG GCCCAAATTACGGGGCGGACTTTACTTATCTTTTTCTTT..CGGGACGGG GCCCAAATTACGGGGACGGACGGCGATCGAGCGATCG....CGGGACGGG GCCCAAATTACGGGGACGACGTACGTGAGCC..........CGGGACGGG GCCCAAATTACGGGGACAATTTATCTATCTTTATC......CGGGACGGG GCCCAAATTACGGGGACAACGATCGTGACTGACTG......CGGGACGGG GCCCAAATTACGGGGACAATACGGGACTTATCGGGCTTCC.CGGGACGGG GCCCAAATTACGGGGCGGAGCGGAGCGAGCGGGACGGGCG.CGGGACGGG GCCCAAATTACGGGGACGAGCGGCATCTACTTCGCGCTA..CGGGACGGG GCCCAAATTACGGGGAAAACAATTCTATCTTTATCGCAAAACGGGACGGG

Page 21: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Multiple sequence alignment

PILEUP

PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final multiple alignment. A cluster consists of two or more already-aligned sequences.

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:

Page 22: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 23: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

PileUp aligns the two most related sequences to each other in order to produce the first cluster. It then aligns the next most related sequence to this cluster or the next two most-related sequences to each other in order to produce another cluster. A series of such pairwise alignments that includes increasingly dissimilar sequences and clusters of sequences at each iteration produces the final alignment.

In the above example, Seq1 and Seq2 are aligned first. Next, Seq3 and Seq4 are aligned. The cluster of Seq1-aligned-to-Seq2 is then aligned to the cluster of Seq3-aligned-to-Seq4. Finally, Seq5 is aligned to the cluster that now contains Seq1 through Seq4 to generate the final alignment of Seq1 through Seq5.

Each pairwise alignment in PileUp uses the method of Needleman and Wunsch (Journal of Molecular Biology 48; 443-453 (1970)), that is extended for use with clusters of aligned sequences rather than only individual sequences. For a pairwise alignment of individual sequences, the comparison score between any two sequence symbols is found in a scoring matrix. For a pairwise alignment of clusters of sequences, the comparison score between any two positions in those clusters is simply the arithmetic average of the scores for all possible symbol comparisons at those positions. When gaps are inserted into a cluster to produce an alignment, they are inserted at the same position in all of the sequences of the cluster.

Page 24: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

CLUSTAL

Clustalw creates a multiple alignment, using methods reminiscent of those of Pileup. One difference between the programs: During the multiple alignment, terminal gaps are penalised in Clustal but not in PILEUP. This will make the PILEUP alignments better when the sequences are of very different lengths (has no effect if there are no large terminal gaps).

CLUSTALW (W = weighting , different weigths to sequences and parameters at different positions in alignments) ftp://ftp.sunet.se/pub/molbio/align/clustal

See documentation in file clustalv.doc

Page 25: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Multiple alignment software

Pileup (GCG)

Clustalw / Clustalx

MSA (program that in principle finds the true optimal multiple alignment by thedynamic programming method)

T-coffee

Multiple alignment editors/viewers

SeqLab (GCG)MACAW (search for motifs, blocks)JalviewCINEMAGenedocBioeditBoxshadeMview

Page 26: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Clustalx

njplot

Page 27: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Multiple sequence alignment formatting Jalview

Page 28: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Multiple sequence alignment formatting Boxshade

hth01 TPRQKVAIIYDVGVSTLYKRFPhth02 IPRKQVAIIYDVAVSTLYKKFPhth03 HPRQQLAIIFGIGVSTLYRYFPhth04 GSKTKLAQAAGIRLASLYSWKGhth05 TTFKQIALESGLSTGTISSFINhth06 IPYQEFAKLIGKSTGAVRRMIDhth07 VTLQQFAELEGVSERTAYRWTThth08 FTYNQYAQMMNISRENAYGVLAhth09 LGASHISKTMNIARSTYVKVINhth10 TGATEIAHQLSIARSTVYKILEhth11 ISISAIAREFNTTRQTILRVKAhth12 GNISALADAENISRKIITRCINhth13 MVLADIAQAVEMHESTISRVTThth14 LVLHDIAEAVGMHESTISRVTThth15 LNLRIVADAIKMHESTVSRVTShth16 MTRGDIGNYLGLTVETISRLLGhth17 LSLSALSRQFGYAPTTLANALEhth18 MSLAELGRSNGLSSSTLKNALDhth19 FDIASVAQHVCLSPSRLSHLFRhth20 LRIDEVARHVCLSPSRLAHLFRhth21 VTLEALADQVAMSPFHLHRLFKhth22 VLYPDIAKKFNTTASRVERAIR

Page 29: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Multiple sequence alignment formatting Mview

Page 30: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Sequence editors

SeqLab / GCG

Page 31: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Genedoc

Page 32: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

TPRQKVAIIY DVGVSTLYKR FP IPRKQVAIIY DVAVSTLYKK FP HPRQQLAIIF GIGVSTLYRY FP GSKTKLAQAA GIRLASLYSW KG TTFKQIALES GLSTGTISSF IN IPYQEFAKLI GKSTGAVRRM ID VTLQQFAELE GVSERTAYRW TT FTYNQYAQMM NISRENAYGV LA LGASHISKTM NIARSTYVKV IN TGATEIAHQL SIARSTVYKI LE ISISAIAREF NTTRQTILRV KA GNISALADAE NISRKIITRC IN MVLADIAQAV EMHESTISRV TT LVLHDIAEAV GMHESTISRV TT LNLRIVADAI KMHESTVSRV TS MTRGDIGNYL GLTVETISRL LG LSLSALSRQF GYAPTTLANA LE MSLAELGRSN GLSSSTLKNA LD FDIASVAQHV CLSPSRLSHL FR LRIDEVARHV CLSPSRLAHL FR VTLEALADQV AMSPFHLHRL FK VLYPDIAKKF NTTASRVERA IR

Profiles : Example with HTH (helix turn helix) motif

Page 33: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

A B C D E F G H I K L M N P Q R S T V W X Y Z -25 -89 -44 -89 -74 -6 -79 -65 31 -63 33 31 -66 -79 -57 -66 -41 -9 28 -62 -40 -27 -74-11 -23 -62 -23 -27 -81 -29 -51 -56 -27 -59 -41 0 3 -27 -39 19 29 -37 -97 -34 -72 -27-25 -102 -52 -102 -68 -7 -103 -53 18 -38 36 14 -72 -89 -46 -24 -50 -37 3 -62 -42 -3 -68 2 -9 -67 -9 13 -82 -28 -18 -78 10 -69 -38 6 -27 26 -3 29 0 -57 -90 -25 -57 13 -7 28 -82 28 51 -85 -48 -16 -76 11 -74 -42 -8 -38 44 -10 7 -26 -55 -97 -23 -58 51-32 -116 -40 -116 -97 14 -130 -93 96 -87 72 39 -105 -101 -86 -95 -71 -32 77 -78 -49 -13 -97198 -100 -12 -100 -54 -112 17 -106 -66 -54 -66 -60 -94 -57 -54 -57 60 -3 -15 -161 -49 -112 -54-40 -3 -89 -3 22 -78 -70 -14 -60 21 -54 -28 -3 -53 31 20 -18 -33 -52 -95 -29 -52 22 6 -49 -52 -49 -12 -46 -64 -20 -24 -21 -19 -8 -39 -54 0 -27 -7 -18 -13 -76 -29 -27 -12-25 -75 -52 -75 -50 19 -88 -50 24 -56 11 17 -64 -81 -47 -68 -40 -28 24 -52 -36 8 -50

-17 -1 -65 -1 -30 -100 78 -41 -111 -36 -113 -82 40 -65 -38 -46 12 -39 -91 -97 -43 -89 -30-28 -100 -38 -100 -78 -10 -114 -82 65 -58 60 54 -81 -82 -56 -67 -50 -13 56 -71 -41 -27 -78 30 -27 -47 -27 -16 -70 -3 -7 -71 -18 -70 -45 8 -44 -15 -25 68 22 -58 -94 -27 -55 -16-14 -39 -78 -39 4 -84 -70 -45 -46 -4 -48 -29 -36 -1 -5 0 -19 -6 -25 -102 -34 -65 4 19 -10 -61 -10 6 -74 -4 -38 -83 1 -79 -44 16 -46 3 -25 93 15 -73 -102 -26 -70 6 -4 -46 -55 -46 -36 -85 -77 -56 -50 -28 -49 -41 4 -52 -31 -15 30 142 -20 -95 -33 -76 -36-17 -121 -34 -121 -94 -8 -124 -98 84 -79 84 43 -103 -96 -77 -85 -63 -28 68 -85 -48 -27 -94-10 -55 -56 -55 -26 -9 -59 6 -45 -23 -42 -30 -26 -64 -16 -34 11 -15 -38 -40 -28 45 -26-36 -49 -105 -49 3 -102 -51 14 -109 66 -83 -46 23 -65 24 110 -9 -33 -102 -110 -35 -64 3 -8 -99 -26 -99 -69 -14 -87 -73 24 -52 21 17 -86 -79 -53 -61 -48 -25 30 -35 -41 -16 -69

-38 -93 -50 -93 -79 37 -109 -72 32 -60 35 14 -76 -94 -71 -68 -44 -7 16 -50 -43 -5 -79-15 0 -79 0 4 -92 -21 -31 -84 -2 -81 -56 16 -8 -6 -4 9 2 -65 -101 -31 -75 4

Profile based on HTH motif alignment

Page 34: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Profilemake Makes a profile from a multiple sequence alignment.Profilesearch Compares a profile to a sequence database. Finds

the sequences that best fit the profile.Profilescan Compares a sequence to a library of profiles.

Finds the profile that best fit the sequenceProfilegap Compares a sequence and a profile, producing a

sequence-profile alignmentProfilesegments Aligns a profile of the sequences found by

profilesearch.

MEME Finds conserved motifs in a group of unaligned sequences. MEME saves these motifs as a set of profiles. You can search a database of sequences with these profiles using the MotifSearch program.

Page 35: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

PSIBLAST

PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of the following steps:

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program . (2) The program constructs a multiple alignment, and then a profile, from any

significant local alignments found. The original query sequence servesas a template for the multiple alignment and profile, whose lengths are identical to that of the query.

(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.

(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale , and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.

(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number

of times or until convergence.

Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass.

Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

Page 36: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

1st BLAST round 2nd BLAST round

threshold

profile profile

3rd BLAST round

PSI-BLAST

Page 37: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg

Psiblast tutorial http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

Page 38: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 39: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 40: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg
Page 41: Alignments and database searches - Göteborgs universitetbio.lundberg.gu.se/courses/vt03/lecture2.pdf · hth14 lvlhdiaeavgmhestisrvtt hth15 lnlrivadaikmhestvsrvts hth16 mtrgdignylgltvetisrllg