Rhesus genome annotations
Rob NorgrenDepartment of Genetics, Cell Biology and AnatomyUniversity of Nebraska Medical Center
Conventional Approach to GeneChip Production
• Sequence millions of ESTs
• Obtain finished genomic sequences
• Cluster redundant ESTs
• Align EST clusters with genomic sequences
• Extract the last 571 bp of sequence from each transcript - probe selection region (PSR)
• Choose 11 to 16 probes that tile across the PSR
Problems with the conventional approaches for a rhesus macaque GeneChip
• Insufficient ESTs to cover most genes
• Little finished genomic sequence (in 2005)
Strategy for targeted amplification of rhesus genes
• Identify the terminal exon and flanking sequence for every human gene
• Design primers and amplify from monkey genomic DNA
• Obtain the rhesus PSR sequences
Terminal exon
PSRF R
Poly A
PSR: Probe selection regionF: forward primerR: reverse primer
Other sources for rhesus GeneChip PSRs
• Preliminary Baylor Genomic SequencesIn silico approach - Aligned human PSRs with preliminary rhesus genomic sequence.
• ESTs
Rhesus GeneChip
• Available in March 2005
• Novel design
• Whole genome expression array - 52,024 probes for 47,000 transcripts
• Probesets include 17,093 well-annotated genes (16 probes/probeset)
• Probesets were designed for 1,099 well-annotated genes not present on the U133+2.0 human GeneChip.
Rhesus Genome
• Draft published in Science on April 17, 2007
• “The rhesus macaque genome assembly is a draft DNA sequence, and it contains many gaps.”
What does a “draft” rhesus genome mean?
• 26,907 protein coding genes for the human
• 24,038 protein coding genes for rhesus macaques
• Sounds good, but is misleading.
• 19,450 well-annotated protein coding genes for humans
• 8,744 well-annotated protein coding genes for rhesus macaques
• What does “well annotated” mean”?
• No “hypothetical” genes
• Only genes with “good” gene symbols. No “Locs”.
Problems with GeneChip annotations
• Affymetrix relies on NCBI annotations, hence, many probesets are not annotated with “real” gene symbols
• Stop gap solution:http://www.unmc.edu/rheusgenechip
• Permanent solution requires full and complete annotation of the rhesus genome at NCBI.
What can go wrong at the genome sequencing center?
• Large gaps
• Small gaps
• Misassemblies
• Sequencing errors
What can go wrong with ab initio annotations?
• Incorrect assignment of pseudogene status
• Failure to identify genes
• Incorrect gene models (some exons right, some wrong)
• Incomplete gene models
Consequences of non-annotated genes
• Large number of databases depend on NCBI annotations for their annotations. Example: Affymetrix GeneChips
• Errors and omissions are propagated to dependent databases
• Users are frustrated when they see “Locs” instead of a proper gene symbol
• Users can Blast each probeset consensus sequence or ask their bioinformatics personnel to establish gene identity, but this is wasteful in time and energy.
How to correct annotations
• Annotations must be acceptable to NCBI, if they are not, corrections will not propagate to dependent databases.
• Some gene annotations can be corrected by manual inspection.
• Some gene annotations can be corrected by human ortholog-based gene models rather than ab initio approaches.
• Some gene annotations can only be corrected by additional sequencing.
• And some gene annotations require a trip to Hell...
Defensins - the gene family from Hell
• Large family of genes
• Orthologs poorly conserved - positive selection?
• Will require focused sequencing and annotation
• May require publication before NCBI annotates most of the rhesus defensins
Acknowledgements
• Jeff Kittrell
• Joel Goodsell
• Audrey Gomel
• NCRR/NIH