BBioinformatics ioinform atics...

12
Bioinformatics Bioinformatics Explained Explained Bioinformatics explained: HMMER September 12, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com [email protected]

Transcript of BBioinformatics ioinform atics...

Page 1: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioinformatics Bioinformatics ExplainedExplained

Bioinformatics explained: HMMERSeptember 12, 2007

CLC bioGustav Wieds Vej 10 8000 Aarhus C DenmarkTelephone: +45 70 22 55 09 Fax: +45 70 22 55 19www.clcbio.com [email protected]

Page 2: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

Bioinformatics explained: HMMER

Similarity searches

Database searching is widely used in bioinformatics and there are a number of different waysto do e.g. protein database searches. Alignment algorithms like BLAST [Altschul et al., 1990]and Smith-Waterman [Smith and Waterman, 1981] compare two sequences and determine theirsimilarities by association of one single score for each given substitution of one amino acid withanother using standard substitution matrices and gap penalty scores. These kinds of sequence-based pairwise comparisons calculate similarity between two sequences to identify significantmatches. When two sequences are considered similar at a significant level, it indicates sharedbiological properties as common evolutionary origin, similar molecular structure, and similarfunctionality.

As specific positions and specific amino acids may not necessarily have the same conservationpatterns in different contexts, comparing protein sequences using standard substitution matricesis a very simplistic way of searching for similarity, and it may be better to search for family ordomain similarity rather than to search for sequence similarity. It may be more beneficial to searchfor similarity using substitution scores reflecting frequencies of individual amino acid positions ofmany sequences in a domain, rather than using standard substitution scores reflecting only oneamino acid being replaced with another, one by one along the sequences searched.

Profile hidden Markov models (profile HMMs)

"A hidden Markov model describes a probability distribution over a potentially infinite number ofsequences" [Eddy, 1998]. The HMM can be said to be a model generating sequences.

The profile HMMs improve the search for distantly related sequences by turning a multiple-sequence alignment into a probability based position-specific scoring systems [Eddy, 1998].

A profile HMM contains states for match, insert and delete which are used for modeling asequence family. Each state in the model has probability distributions and each transition has aprobability. So, if you have an amino acid commonly represented at a particular position in themultiple sequence alignment it gets a higher score. It is also a possibility to assign scores toinsertions and deletions in specific positions. A sequence is compared to the model by assigningthe sequence residues to the states in the HMM. The resulting score is a probability for thesequence to be related to the given model and the probability is used for finding an e-value forthe match.

HMMs were introduced to the field of computational biology in the late 1980s, and HMMs for useas profile models were introduced by Krogh et al. [Krogh et al., 1994] in the mid 1990s [Eddy,1998]. Examples of the use of HMMs within the field of biology are for gene finding, geneticlinkage mapping and protein secondary structure prediction.

The idea of using profile HMMs for database searching is to compare a sequence to a statisticalmodel describing a family or pattern of sequences contrary to a simple comparison of singleamino acids of two sequences. By comparing a sequence to a statistical model you can getsome extra information. For instance

• some sites may be conserved for specific residues while other sites represent considerablevariations

P. 2

Page 3: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

• some sites may be deleted without affecting functionality while other sites may not bedeleted without affecting functionality

• insertions may be acceptable at some sites while insertions may not be acceptable at othersites

Building upon this information, it may be easier to see if a sequence and a specific family arerelated. Distant relationships between sequences are also more likely to be identified whenusing statistical models instead of standard substitution matrices.

Pfam database

Profile HMM libraries are needed to search a query sequence for known domains and for therelatedness from the sequence to a protein family sharing e.g. functionality. One of the mostcomprehensive profile HMM libraries is the publicly available Pfam database (protein familydatabase).

The Pfam database consists of a multiple alignment for each protein family which has been usedas the basis for building a profile HMM. Researchers at the Sanger Center have released thiscollection [Bateman et al., 2002], and the database currently represents 9318 protein families,covering 74% of proteins (July 2007) [PFAM, ].

.

Figure 1: A part of an alignment for the Globin family from the Pfam website

Pfam is a classification of protein families according to families, domains, repeats and motifs.A family is the default class of proteins related to each other. The families in Pfam are allrepresented by a seed, which contains a representative number of family members, and a fullalignment containing all family members. Full family alignments contain up to 2500 sequences.Domains represent elements of structure or sequence which may be identified and relevant indifferent protein contexts. Repeats and motifs describe short parts of sequence [Bateman et al.,2002].

P. 3

Page 4: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

The Pfam database comes in two variants, Pfam-A and Pfam-B. Pfam-A is a well-annotateddatabase, which is curated by hand and thus contains high quality data. Pfam-B is anautomatically generated database and of lower quality. The Pfam-B is intended to incorporatedomains not already represented in Pfam-A [PFAM, ].

Both databases come in two variants: A fragment database (fs) which allows partial matches to adomain to be found, e.g. identifying a match to half a globin domain, and a full domain database(ls) which only allows matches to full domains. The full domain database is more specific thanthe fragment database and is only based on global models of HMMs [Bateman et al., 2002].

The Pfam database can be accessed from http://pfam.sanger.ac.uk (UK) or http://pfam.wustl.edu/ (US).

HMMER package

There are several software implementations using profile HMMs in computational biology, one ofthe most popular being HMMER [Eddy, 2003].

HMMER is a software implementation of profile HMMs for biological sequence analysis. Asequence is compared to a profile HMM by assigning the sequence residues to the states in theHMM, and the resulting score is a probability for the sequence to be related to the given model.E-values for the match are found using the probability of the sequence compared to a model.

The implementation of profile HMMs in the HMMER package contains programs for constructionand use of position specific scoring matrices. HMMER was written by Sean Eddy and colleaguesand was first released in 1995 [Eddy, 2003]. The HMMER package is accessible fromhttp://hmmer.janelia.org.

Programs in HMMER

Currently, the HMMER package contains nine programs. Two of these are programs for databasesearching:

• hmmpfam Search an HMM database for matches to a query sequence.

• hmmsearch Search a sequence database for matches to a single profile HMM.

The other programs in the package are:

• hmmalign Align sequences to an existing model.

• hmmbuild Build a model from a multiple sequence alignment.

• hmmcalibrate Takes an HMM and empirically determines parameters that are used to makesearches more sensitive, by calculating more accurate expectation value scores (E-values).

• hmmconvert Convert a model file into different formats, including a compact HMMER 2binary format, and "best effort" emulation of GCG profiles.

• hmmemit Emit sequences probabilistically from a profile HMM.

• hmmfetch Get a single model from an HMM database.

P. 4

Page 5: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

• hmmindex Index an HMM database.

[Eddy, 2003]

When using the Pfam database, a researcher would normally only have to use the two searchprograms since the database has already been built. Researchers seeking to construct their ownprofile HMMs should use the hmmalign, hmmbuild and hmmcalibrate programs.

Examples of HMMER usage

This section gives some examples of how to use the two database search programs, hmmpfamand hmmsearch.

The protein leghemoglobin is a plant globin binding oxygen and a member of the family of globins.The first hmmpfam example will show how the leghemoglobin 1 from a bean (Swiss-Prot accessionnumber P02232 lgb1_vicfa) is recognized to be related to the family. The hmmsearch shows ifany sequence in a given database matches an HMM, a protein family. In the second example,hmmsearch is used to identify members of the globin protein family among 1000 sequencesfrom Swiss-Prot.

hmmpfam The command line version of hmmpfam has two required parameters, the first is theprofile HMM database file and the second is a file with one or more sequences.

hmmpfam accepts a number of parameters, mainly for adjusting the cut-offs for the quality ofmatches to present.

Here is the example run (not all the output is shown, see the appendix for the full output):

localhost:~...hmmer% hmmpfam Pfam_fs.bin lgb1_vicfa.fastahmmpfam - search one or more sequences against HMM databaseHMMER 2.3.2 (Oct 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HMM file: Pfam_fs.binSequence file: lgb1_vicfa.fasta- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query sequence: P02232|LGB1_VICFAAccession: [none]Description: Leghemoglobin-1 - Vicia faba (Broad bean)

Scores for sequence family classification (score includes all domains):Model Description Score E-value N-------- ----------- ----- ------- ---Globin Globin 75.5 2.6e-21 1Herpes_UL42 DNA polymerase processivity factor (UL 1.3 7.8 1PPTA Protein prenyltransferase alpha subuni 2.8 8.2 1

...

Alignments of top-scoring domains:Globin: domain 1 of 1, from 7 to 141: score 75.5, E = 2.6e-21

*->dkalvkasWgkvkgtdnreelGaealarlFkayPdtktyFpkfgdls+ alv++s k+ n + + +++ ++ ++P +k++F+ f l+

P. 5

Page 6: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

P02232|LGB 7 QEALVNSSSQLFKQ--NPSNYSVLFYTIILQKAPTAKAMFS-F--LK 48sadaikgspkfkaHgkkVlaalgeavkhLgnddddgnlkaalkkLaarHa+++++ +spk+ aH++kV++++ + + +L+ ++ + k + H+

P02232|LGB 49 DSAGVVDSPKLGAHAEKVFGMVRDSAVQLR----ATGEVVLDGKDGSIHI 94erghvdpanFkllgeallIvvLaahlggeveftpevkaAWdkaldvvada++g v + +F +++eall ++++++ g+ ++++e aAW+ a+d +a+a

P02232|LGB 95 QKG-VLDPHFVVVKEALL-KTIKEASGD--KWSEELSAAWEVAYDGLATA 140l<-*+

P02232|LGB 141 I 141

...

Leghemoglobin 1 from a bean is the query sequence in the search for matches in the Pfamdatabase.

The output starts with some general information and then follows the results for each sequence,one at a time (often just one sequence). The results show a table with a list of profile HMMsmatching the sequence and information about the match quality. The e-value is similar to thee-value that BLAST reports, i.e. it is the number of matches of similar quality that is expected tooccur in a random database of the same size.

You also get a table describing the individual domain matches, including information about whichpart of the sequence matches which part of the profile HMM. This is shown in the full output ofthe appendix.

Finally, the individual domain alignments are presented. The profile HMM is represented by thetop line in the alignment. Each letter represents the most probable residue of each state in theHMM used in the match. Uppercase letters mean that the residue is very conserved in that state.Above is shown the alignment of the top-scoring domain in the hmmpfam search.

The conclusion of this search is that the leghemoglobin is definitely a globin. Looking at thee-values, there are no other Pfam domains matching the sequence well (the globin model is theonly match with an e-value below 1).

Notice how a binary version of the Pfam database is used. This reduces the amount of time spentreading from the disk, since the binary database format is more compact than the text format.

An HMM database file can be converted to binary using the hmmconvert program:

localhost:~...hmmer% hmmconvert -b Pfam_fs Pfam_fs.bin

hmmsearch Running the hmmsearch program is done in very much the same way as hmmpfam.The big difference is that the HMM file should contain only a single profile HMM and the sequencefile is a sizeable (typically) database of sequences to be searched with the HMM. The sequencedatabase is searched for matches to an HMM.

Again, the first argument is the HMM file and the second argument is the sequence database file.Here is a comparison of a globin HMM with a sequence file of 1000 randomly chosen sequencesfrom Swiss-Prot:

hmmsearch supports many of the same parameters as hmmpfam, mainly for adjusting thesignificance limits of the hits to be presented. In this example none of these parameters are set.

This is a summary of the hmmsearch (see the appendix for the full version)

P. 6

Page 7: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

localhost:~...hmmer% hmmsearch globin.hmm uniprot_sprot_1000.fastahmmsearch - search a sequence database with a profile HMMHMMER 2.3.2 (Oct 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HMM file: globin.hmm [Globin]Sequence database: uniprot_sprot_1000.fastaper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query HMM: GlobinAccession: PF00042.12Description: Globin[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---HBAD_AEGMO (P68059) Hemoglobin subunit alpha-D (Hemo 202.2 1.1e-58 1HMP_PHOPR (Q6LM37) Flavohemoprotein (Hemoglobin-lik 92.9 2.9e-27 1HMP_BACHK (Q6HLA6) Flavohemoprotein (Hemoglobin-lik 55.0 2.3e-16 1GLBB_CERLA (O76243) Body wall hemoglobin 22.9 4.1e-07 1NORA_STAAU (P0A0J7) Quinolone resistance protein nor 0.3 1.3 1PRI1_HUMAN (P49642) DNA primase small subunit (EC 2. -0.4 2 1HSP70_NEUCR (Q01233) Heat shock 70 kDa protein (HSP70 -1.4 4 1SM1L2_HUMAN (Q8NDV3) Structural maintenance of chromo -1.5 4.1 1RPOC_FUSNN (Q8RHI7) DNA-directed RNA polymerase beta -1.8 5.1 1LRC45_HUMAN (Q96CN5) Leucine-rich repeat-containing p -1.8 5.2 1IUNH_CRIFA (Q27546) Inosine-uridine preferring nucle -1.8 5.2 1SPP1_YEAST (Q03012) COMPASS component SPP1 (Complex -2.1 6.2 1ARGJ_CORDI (P62059) Glutamate N-acetyltransferase (E -2.1 6.4 1BRSK1_HUMAN (Q8TDC3) BR serine/threonine-protein kina -2.3 7.2 1SPC34_YEAST (P36131) DASH complex subunit SPC34 (Oute -2.5 8.2 1THIO_CALJA (Q9BDJ3) Thioredoxin -2.7 9.2 1UGT55_CAEEL (Q22180) Putative UDP-glucuronosyltransfe -2.7 9.4 1ENV_HV2UC (Q76638) Envelope glycoprotein gp160 prec -2.7 9.5 1

...

Again, the output starts with some general information, and then comes a table of sequencesmatching the HMM. This is followed by a table of the domain matches and the alignments of theactual matches (see the appendix).

The conclusion of this hmmsearch is that four of the sequences are definitely globins (e-valuesbelow 4.1e-07), while the remaining sequence matches are not too convincing as they have beenassigned e-values above 1.

Implementations and accelerations of HMMER

The HMMER package is accessible from http://hmmer.janelia.org. As profile HMMalgorithms are computationally demanding, different software and hardware implementationsaccelerating the HMM searches are also available. One example is the CLC Bioinformatics Cellaccelerating the HMMER programs hmmpfam and hmmsearch using SIMD (single instructionsmultiple data) technology [CLC bio, 2007].

P. 7

Page 8: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

Other useful resources

The HMMER package websitehttp://hmmer.janelia.org

Manual for HMMER packageftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf

Pfam database available in Europe (UK)http://pfam.sanger.ac.uk

Pfam database available in UShttp://pfam.wustl.edu/

CLC Bioinformatics Cellhttp://www.clccell.com/

P. 8

Page 9: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

References[Altschul et al., 1990] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.

(1990). Basic local alignment search tool. J Mol Biol, 215(3):403--410.

[Bateman et al., 2002] Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S. R.,Griffiths-Jones, S., Howe, K. L., Marshall, M., and Sonnhammer, E. L. L. (2002). The pfamprotein families database. Nucleic Acids Res, 30(1):276--280.

[Baxevanis and Ouellette, 2001] Baxevanis, A. and Ouellette, B. (2001). Bioinformatics. Wiley-Interscience.

[CLC bio, 2007] CLC bio (2007). CLC Bioinformatics Cell white paper.

[Eddy, 1998] Eddy, S. (1998). Profile hidden Markov models. Bioinformatics, 14:755--763.

[Eddy, 2003] Eddy, S. (2003). The HMMER User’s Guide (biological sequence analysis using profilehidden Markov models). Howard Hughes Medical Institute and Dept. of Genetics WashingtonUniversity School of Medicine, 660 South Euclid Avenue, Box 8232 Saint Louis, Missouri63110, USA, vesion 2.3.2 edition. http://hmmer.wustl.edu/.

[Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjölander, K., and Haussler, D. (1994).Hidden markov models in computational biology. applications to protein modeling. J Mol Biol,235(5):1501--1531.

[PFAM, ] PFAM. http://pfam.sanger.ac.uk.

[Smith and Waterman, 1981] Smith, T. F. and Waterman, M. S. (1981). Identification of commonmolecular subsequences. J Mol Biol, 147(1):195--197.

P. 9

Page 10: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

Appendix

Example hmmpfam

localhost:~...hmmer% hmmpfam Pfam_fs.bin lgb1_vicfa.fastahmmpfam - search one or more sequences against HMM databaseHMMER 2.3.2 (Oct 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HMM file: Pfam_fs.binSequence file: lgb1_vicfa.fasta- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query sequence: P02232|LGB1_VICFAAccession: [none]Description: Leghemoglobin-1 - Vicia faba (Broad bean)

Scores for sequence family classification (score includes all domains):Model Description Score E-value N-------- ----------- ----- ------- ---Globin Globin 75.5 2.6e-21 1Herpes_UL42 DNA polymerase processivity factor (UL 1.3 7.8 1PPTA Protein prenyltransferase alpha subuni 2.8 8.2 1

Parsed for domains:Model Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------Globin 1/1 7 141 .. 1 148 [] 75.5 2.6e-21PPTA 1/1 11 38 .. 3 31 .] 2.8 8.2Herpes_UL42 1/1 14 29 .. 149 164 .] 1.3 7.8

Alignments of top-scoring domains:Globin: domain 1 of 1, from 7 to 141: score 75.5, E = 2.6e-21

*->dkalvkasWgkvkgtdnreelGaealarlFkayPdtktyFpkfgdls+ alv++s k+ n + + +++ ++ ++P +k++F+ f l+

P02232|LGB 7 QEALVNSSSQLFKQ--NPSNYSVLFYTIILQKAPTAKAMFS-F--LK 48sadaikgspkfkaHgkkVlaalgeavkhLgnddddgnlkaalkkLaarHa+++++ +spk+ aH++kV++++ + + +L+ ++ + k + H+

P02232|LGB 49 DSAGVVDSPKLGAHAEKVFGMVRDSAVQLR----ATGEVVLDGKDGSIHI 94erghvdpanFkllgeallIvvLaahlggeveftpevkaAWdkaldvvada++g v + +F +++eall ++++++ g+ ++++e aAW+ a+d +a+a

P02232|LGB 95 QKG-VLDPHFVVVKEALL-KTIKEASGD--KWSEELSAAWEVAYDGLATA 140l<-*+

P02232|LGB 141 I 141PPTA: domain 1 of 1, from 11 to 38: score 2.8, E = 8.2

*->LelteklleldpkNysaWnyRrwlleklg<-*++ ++l +++p+Nys+ y +l+k++

P02232|LGB 11 VNSSSQLFKQNPSNYSVLFYTI-ILQKAP 38Herpes_UL42: domain 1 of 1, from 14 to 29: score 1.3, E = 7.8

*->mlsvvkhelnsytvfF<-*++++ k+ + +y+v+F

P02232|LGB 14 SSQLFKQNPSNYSVLF 29

Example hmmsearch

localhost:~...hmmer% hmmsearch globin.hmm uniprot_sprot_1000.fastahmmsearch - search a sequence database with a profile HMMHMMER 2.3.2 (Oct 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HMM file: globin.hmm [Globin]Sequence database: uniprot_sprot_1000.fastaper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none]

P. 10

Page 11: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query HMM: GlobinAccession: PF00042.12Description: Globin[HMM has been calibrated; E-values are empirical estimates]

Scores for complete sequences (score includes all domains):Sequence Description Score E-value N-------- ----------- ----- ------- ---HBAD_AEGMO (P68059) Hemoglobin subunit alpha-D (Hemo 202.2 1.1e-58 1HMP_PHOPR (Q6LM37) Flavohemoprotein (Hemoglobin-lik 92.9 2.9e-27 1HMP_BACHK (Q6HLA6) Flavohemoprotein (Hemoglobin-lik 55.0 2.3e-16 1GLBB_CERLA (O76243) Body wall hemoglobin 22.9 4.1e-07 1NORA_STAAU (P0A0J7) Quinolone resistance protein nor 0.3 1.3 1PRI1_HUMAN (P49642) DNA primase small subunit (EC 2. -0.4 2 1HSP70_NEUCR (Q01233) Heat shock 70 kDa protein (HSP70 -1.4 4 1SM1L2_HUMAN (Q8NDV3) Structural maintenance of chromo -1.5 4.1 1RPOC_FUSNN (Q8RHI7) DNA-directed RNA polymerase beta -1.8 5.1 1LRC45_HUMAN (Q96CN5) Leucine-rich repeat-containing p -1.8 5.2 1IUNH_CRIFA (Q27546) Inosine-uridine preferring nucle -1.8 5.2 1SPP1_YEAST (Q03012) COMPASS component SPP1 (Complex -2.1 6.2 1ARGJ_CORDI (P62059) Glutamate N-acetyltransferase (E -2.1 6.4 1BRSK1_HUMAN (Q8TDC3) BR serine/threonine-protein kina -2.3 7.2 1SPC34_YEAST (P36131) DASH complex subunit SPC34 (Oute -2.5 8.2 1THIO_CALJA (Q9BDJ3) Thioredoxin -2.7 9.2 1UGT55_CAEEL (Q22180) Putative UDP-glucuronosyltransfe -2.7 9.4 1ENV_HV2UC (Q76638) Envelope glycoprotein gp160 prec -2.7 9.5 1

Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------HBAD_AEGMO 1/1 6 136 .. 1 148 [] 202.2 1.1e-58HMP_PHOPR 1/1 6 131 .. 1 148 [] 92.9 2.9e-27HMP_BACHK 1/1 6 131 .. 1 148 [] 55.0 2.3e-16GLBB_CERLA 1/1 4 109 .] 19 144 .. 22.9 4.1e-07NORA_STAAU 1/1 248 285 .. 108 148 .] 0.3 1.3PRI1_HUMAN 1/1 235 251 .. 132 148 .] -0.4 2HSP70_NEUCR 1/1 565 586 .. 1 24 [. -1.4 4SM1L2_HUMAN 1/1 531 545 .. 134 148 .] -1.5 4.1RPOC_FUSNN 1/1 390 405 .. 133 148 .] -1.8 5.1LRC45_HUMAN 1/1 96 119 .. 16 39 .. -1.8 5.2IUNH_CRIFA 1/1 291 307 .. 132 148 .] -1.8 5.2SPP1_YEAST 1/1 118 138 .. 1 21 [. -2.1 6.2ARGJ_CORDI 1/1 195 212 .. 131 148 .] -2.1 6.4BRSK1_HUMAN 1/1 151 160 .. 139 148 .] -2.3 7.2SPC34_YEAST 1/1 162 177 .. 24 39 .. -2.5 8.2THIO_CALJA 1/1 5 21 .. 132 148 .] -2.7 9.2UGT55_CAEEL 1/1 214 228 .. 134 148 .] -2.7 9.4ENV_HV2UC 1/1 597 609 .. 136 148 .] -2.7 9.5

Alignments of top-scoring domains:HBAD_AEGMO: domain 1 of 1, from 6 to 136: score 202.2, E = 1.1e-58

*->dkalvkasWgkvkgtdnreelGaealarlFkayPdtktyFpkfgdlsdk+l++a+W+kv g + e++Gaeal+r+F++yP tktyFp+f dls

HBAD_AEGMO 6 DKKLIQATWDKVQG--HQEDFGAEALQRMFITYPPTKTYFPHF-DLS 49sadaikgspkfkaHgkkVlaalgeavkhLgnddddgnlkaalkkLaarHa

+gs +++ HgkkV++alg+avk + d nl +al++L+ +HaHBAD_AEGMO 50 -----PGSDQVRGHGKKVVNALGNAVKSM-----D-NLSQALSELSNLHA 88

erghvdpanFkllgeallIvvLaahlggeveftpevkaAWdkaldvvada++++vdp+nFkll++++ vvLa hlg+ e+tpev+aA+dk+l+ va++

HBAD_AEGMO 89 YNLRVDPVNFKLLSQCFQ-VVLAVHLGK--EYTPEVHAAFDKFLSAVAAV 135l<-*l

HBAD_AEGMO 136 L 136... [A number of alignments were removed for brevity] ...

ENV_HV2UC: domain 1 of 1, from 597 to 609: score -2.7, E = 9.5

*->AWdkaldvvadal<-*

P. 11

Page 12: BBioinformatics ioinform atics EExplainedxplainedVinuesa/Tlem/Pdfs/Bioinformatics_explained_HMMER.pdfB ioin f o r m ati cs E x p l aine d Bioinformatics explained: HMMER The Pfam database

Bioi

nfor

mat

ics

Expl

aine

dBioinformatics explained: HMMER

+W +a+++v++ENV_HV2UC 597 SWGCAFRQVCHTT 609

Histogram of all scores:score obs exp (one = represents 4 sequences)----- --- ----16 2 0|=-15 11 0|===-14 17 0|=====-13 25 11|==*====-12 79 88|==================== *-11 131 205|================================= *-10 192 237|================================================ *-9 195 187|==============================================*==-8 143 120|=============================*======-7 106 69|=================*=========-6 42 38|=========*=-5 20 20|====*-4 16 10|==*=-3 10 5|=*=-2 5 2|*=-1 1 1|*0 5 -|==

% Statistical details of theoretical EVD fit:mu = -9.7401

lambda = 0.6626chi-sq statistic = 83.2139P(chi-square) = 1.174e-13

Total sequences searched: 1000

Whole sequence top hits:tophits_s report:

Total hits: 74Satisfying E cutoff: 18Total memory: 22K

Domain top hits:tophits_s report:

Total hits: 74Satisfying E cutoff: 74Total memory: 33K

Creative Commons License

All CLC bio’s scientific articles are licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License. You are free to copy, distribute, display, and use the work for educationalpurposes, under the following conditions: You must attribute the work in its original form and"CLC bio" has to be clearly labeled as author and provider of the work. You may not use thiswork for commercial purposes. You may not alter, transform, nor build upon this work.

See http://creativecommons.org/licenses/by-nc-nd/2.5/ for more information onhow to use the contents.

P. 12