Molecular biology Annotationpublic.callutheran.edu/~revie/molbiol/AnnotationOther... ·...

Molecular biology Annotation We are continuing the learning process of annotation. Part 1: Learning exercise This part has been adapted from the “Chimp Chunks” exercise by GEP. The BLAST web server is available at http://www.ncbi.nlm.nih.gov/BLAST/The UCSC Genome Browser is available at http://genome.ucsc.eduThe GENSCAN web server is available at http://genes.mit.edu/GENSCAN.html Recent technological advances have dramatically increased the rate at which we can generate high-quality DNA sequence. However, characterization of features (genes, repetitive elements, etc) found within these raw sequences remains a slow and labor-intensive process. Various computational methods have been devised to improve the efficiency of sequence annotation. In this tutorial, we will focus on the identification of genes and pseudogenes in your contig. The GENSCAN and BLAST programs, along with the UCSC Genome Browser, will be used to facilitate the annotation process. GENSCAN is a program that predicts the locations of genes in DNA sequences. While potential genes identified by GENSCAN must still be experimentally confirmed, its predictions can nonetheless narrow the scope of the investigation by identifying regions where genes are most likely to be found. GENSCAN uses a probabilistic model to predict locations of promoters, exons, and polyadenylation signals. Discussion of the actual implementation of GENSCAN is beyond the scope of this tutorial. For additional information on how GENSCAN works and on how to use GENSCAN, please see the paper by Burge et al. (1997), or visit the GENSCAN website (http://genes.mit.edu/GENSCAN.html).

Figure 1: Example GENSCAN result The GENSCAN analysis of your contig is available in your file directory (in Analysis, GeneFinder, 1

http://www.ncbi.nlm.nih.gov/BLAST/

http://genome.ucsc.edu/

http://genes.mit.edu/GENSCAN.html

http://genes.mit.edu/GENSCAN.html

Genscan). To view the file, launch a web browser and open the pdf file. The repetitive sequences have been masked (including low complexity sequences) prior to running GENSCAN. You can run the GENSCAN analysis yourself by submitting your masked contig sequence to the GENSCAN web server. You will see a result like that of the above figure. A second file in the Genscan directory lists the results in a table (the file ends in .genscan). Open it with Word or Wordpad). The first part of the GENSCAN output is a table that summarizes the predicted genes within this contig. Explanations of the columns in the table are available at the end of the GENSCAN output. The rest of the output consists of the peptide (amino acid) and coding (nucleotide) sequences of the predicted genes. We will use these sequences to evaluate the validity of the GENSCAN predictions. Analysis of Predicted Gene 1:

Figure 2. blastp search against nr. To get an idea of what this predicted gene could be, we first run a protein-protein BLAST (blastp) against NCBI’s non-redundant (nr) database using default parameters. The nr database consists of all known and hypothetical protein sequences that have been submitted to GenBank, with groups of essentially identical sequences reduced to a single sequence (hence the term “non-redundant”). While the BLAST hits produced with the nr database contain relatively little information to help us interpret them, the names of these hits and the corresponding references provide us with a reasonable starting point.

Figure 3. In this example, the conserved domain search identifies “HMG Box”. Copy the peptide sequence of this gene (from the Genscan output) onto the clipboard. Open a new browser window, navigate to the BLAST server at NCBI (www.ncbi.nlm.nih.gov/BLAST/), and select blastp. Paste the contents of the clipboard into the search box and click “BLAST!” (Figure 3).

2

http://www.ncbi.nlm.nih.gov/BLAST/

Figure 4. Search for HMG-Box in the NCBI Conserved Domain database

Figure 5. Additional details on domain matches In the example figure, the initial response page from NCBI (Figure 3) showed a putative conserved domain called “HMG Box” in our predicted protein. To learn more about HMG Boxes, open a new browser window and navigate back to the NCBI home page. On the top search bar, change the database to “Conserved Domains” and search for “HMG-Box” (Figure 4). Click on the first link (Figure 5) to bring up the record for HMG-Box (Figure 6).

Figure 6. Record of the domain at the Conserved Domain Database (CDD)

Figure 7. Cn3D display for the HMG-box domain.

3

This page provides a summary and links for literature references to this domain. You can also see the 3D structure of the HMG box if you have Cn3D installed (Figure 7). The bottom of the page displays the local

multiple sequence alignment used to construct the Position Specific Score Matrix (PSSM) that is then used to search for conserved domains in the query sequence. For additional information on CDD, see the help page on NCBI http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml. If you would like to learn still more about the HMG Box, the page with the descriptions also contains a link with eleven references to the original research articles that described it. You can get the PubMed entries for these articles by following the links in the “Links” section (column on the left). In addition to the abstracts, many of the PubMed entries contain links to the full article. Now that we have a basic idea as to the potential function of the predicted gene, we are ready to look at the alignments from nr. For the example, the motif portion of the BLAST search suggests that the predicted gene is probably a real gene with a DNA binding function, or a pseudogene derived from a real gene. This could be a pseudogene if it only has one exon or a mispredicted gene. It also would not be terribly surprising if GENSCAN missed an exon or split two exons of the same transcript into distinct genes. You need to therefore analyze the region on either side of your exons. Go back to the initial BLAST response page and click “Format!” This page may take a while to come up (the time estimate is often inaccurate), so be patient.

Figure 8. Example BLAST hits to the predicted protein in the nr database. If there are numerous significant hits to the nr database that cover the entire length of the predicted protein, it supports the gene model of the predicted gene. You can confirm this by clicking on the score and looking at the alignment. However, if many top hits have accession numbers that start with XP, these are predictions of proteins that are unconfirmed and have no experimental evidence to support them. Generally, we do not want to rely on (possibly incorrect) computational predictions alone for any part of our annotation; instead, we will seek matches that are based on more direct evidence. The second hit shows a match to the human HMGB3 protein. Click on the score in the summary to see the alignment (Figure 9).

4

http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml

Figure 9. BLAST alignment of the human HMGB3 protein (subject) against our predicted protein (query). Note that there are several records of this subject sequence. Looking at the top of the alignment, we see that the full HMGB3 protein has a length of 200 amino acids. However, recall that our predicted peptide has only 140 amino acids. There are three potential explanations for this observation. First, GENSCAN may have missed part of the real protein. This scenario could happen if GENSCAN failed to identify one or more exons, or called them as part of another gene. Alternatively, the predicted gene can actually be a pseudogene that has acquired an in-frame stop codon. The stop codon would cause the GENSCAN prediction to end prematurely. A third hypothesis is that GENSCAN has accurately predicted a functional protein that is closely related to the human HMGB3 protein but lacks some part of it, such as one or more functional domains.

Figure 10. GenBank record of the HMGB3 protein We can get the Genbank entry for the match by clicking on the accession number. In this screen, you can display and export the Genbank record in a variety of formats (Figure 10). The most useful thing here is the PubMed link for the primary paper describing with the human HMGB3 protein. Now go back and click on the 'G' icon. This will bring you to the Entrez Gene page (Figure 11). Here, we can see other names (“aliases”) used for the HMGB3 protein, such as “HMG4.” More importantly, we learn that the protein is located on chromosome X at locus Xq28.

5

Figure 11. Entrez Gene entry for the human HMGB3 protein

Figure 12. Detailed Entrez Gene record of the human HMGB3 protein When we click on the name of the gene, we find lots of additional information about it (Figure 12). For example, in the “genomic regions” section, we find information on all known transcripts for this gene. In this case, only one such transcript is listed, but an alternatively spliced gene could produce multiple transcripts. Significantly, we find that the human protein is a multi-exon gene. In the “genomic context” section, we find information about other genes that are in the neighborhood of the HMGB3 protein. You can also see the location of the gene on the X chromosome graphically by clicking on “See HMGB3 in MapViewer.” To summarize our findings so far, we have learned that our predicted protein shows strong sequence similarity to the human HMGB3 protein but is truncated relative to that protein. We need to decide among three possible explanations for the truncation: either GENSCAN has failed to accurately predict the gene, or the predicted gene is a pseudogene, or the predicted gene may produce a functional product that lacks part of the human protein. To evaluate which of these hypotheses is most probable, we need to gather additional evidence using the UCSC Genome Browser. UCSC browser Go back to the GENSCAN output genscan file. Copy the first predicted coding sequence to the clipboard, and go to the UCSC Genome Browser (http://genome.ucsc.edu). Click on “BLAT” and select the D. erecta assembly. Paste the predicted sequence into the text box and click “submit”. You should see a result similar to Figure 13.

6

http://genome.ucsc.edu/

Home Genomes Tables Gene Sorter PCR Session FAQ Help

D. erecta BLAT Results

BLAT Search Results ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN --------------------------------------------------------------------------------------------------- browser details YourSeq 656 1 219 219 100.0% 4784 ++ 21996882 21997591 710

Figure 13: D. erecta example results The predicted sequence should match one or more loci in the D. erecta genome (Figure 13). The top hit will show an identity near or equal to 100.0% to the D. erecta genomic sequence over the length of the entire query. The next best match, if present, should have a significantly lower identity percent.

Figure 14. Track configuration options for the UCSC Genome Browser. Click on “browser” for the alignment with the best score, and the screen will show your sequence as a black bar under the title “Your Sequence from Blat Search”. Display options can be changed using the “configure” button. (For additional information, please see the tutorial on the UCSC Genome Browser.) First reinitialize the browser by clicking on “hide all” then configure the tracks to use the following options: Adjust the following track to “pack” mode: • Under “Mapping and Sequencing Track: Blat Sequence, adjust the following tracks to “dense” mode:

o Under “Gene and Gene Prediction Tracks”: Known Genes, RefSeq, Ensembl Genes, Twinscan, SGP Genes, Genscan Genes.

o Under “mRNA and EST Tracks”: Human mRNAs, Spliced ESTs, Human ESTs, Other ESTs. o Under Comparative Genomics: Mouse Net. o Under “Variations and Repeats”: RepeatMasker.

Now hit “refresh” and look at the new image. Zoom out 3X to get a broader view of the region (Figure 15).

7

Figure 15. UCSC genome browser for our region of interest In the example figure, we can see there are no known genes in this region, according to RefSeq and other curated gene lists. The only evidence we have for our prediction comes from hypothetical genes predicted by programs like SGP and GENSCAN. Interestingly, TWINSCAN, another gene prediction program, does not predict a gene in this region. SGP, which has the tendency to fuse genes, predicts that our putative gene is in fact part of a larger gene with two exons. (This hypothesis is plausible because the GENSCAN prediction is shorter than the matches to known proteins.) There are no human mRNAs and no human ESTs in the aligned region. Hence, if this is a real gene, it has not been picked up by any gene hunting projects that used transcript sequencing. There are no spliced human ESTs in the region, which is consistent with our prediction of a single-exon gene. There are, however, ESTs from other organisms, which are generally a more reliable guide to the presence of a real gene than are unspliced human ESTs. Unfortunately in D. erecta, there are no ESTs for confirmation of the gene. Part 2: Your fosmid

1. Finish your BLAST analysis from last time. 2. Go through the BLAT/UCSC analysis to analyze your fosmid. To see it, go to

http://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=90774068&clade=insect&org=D.+erecta&db=0 and enter the contig scaffold number as the “position or search term” (find it in the Project report text file in your main fosmid directory).

3. You should confirm/not confirm each predicted item (exon, intron, etc.). Keep track of the evidence for each item. Note that you may have only the beginning or ending of a gene if it extends to the next fosmid.

Eventually, we will need to submit the data with the following information using GFF Creator:

• Gene ID: Gene Name (i.e. mav-PA) • Transcript ID: Refseq mRNA ID (i.e. NM_001014690) • Source: Method used to identify gene (i.e. blastnNT) • Strand: Orientation of gene relative to contig • CDS coordinates: Translation start to end

(i.e. 102-200,219-330) • Exon coordinates: Transcription start to end

(i.e. 90-200,219-350,360-380) • Partial 5', Partial 3': Checked if gene model is incomplete at the 5' or 3' end

8

http://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=90774068&clade=insect&org=D.+erecta&db=0

GFF Creator

Specify Gene Model:

Project Name: Gene ID: Transcript ID: Source: Strand: Plus

CDS Coordinates:

Exon Coordinates:

Partial 5' Partial 3'

Submit Data

9

Molecular biology Annotationpublic.callutheran.edu/~revie/molbiol/AnnotationOther... ·...

Documents

Transcript of Molecular biology Annotationpublic.callutheran.edu/~revie/molbiol/AnnotationOther... ·...