Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder...
-
Upload
elmer-parker -
Category
Documents
-
view
213 -
download
0
Transcript of Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder...
proposed redefinition of “gene” requires it to have a biological role
Gerstein MB, …, Snyder M. 2007. Genome Res 17: 669-681
example of complexities observed by ENCODE
(A) annotated exons (black rectangles), novel transcriptionally active regions or TARs (hollow rectangles); conventional annotation identifies only 4 genes or just a fraction of the transcripts reported (dashed lines are introns)
(B) observed transcripts are shown alongside the sequences that regulate them (gray circles); note that some of the enhancers are actually promoters for novel splice isoforms
a redefinition of the “gene”
1. a gene is a genomic sequence directly encoding functional product molecules, either RNAs or proteins
2. when there are several functional products that share overlapping regions, take the union of all overlapping genomic sequences encoding them
3. this union must be coherent, done separately for protein and RNA products, but it does not require that all the products necessarily share a common subsequence
concisely summarized as
a union of genomic sequences encoding a coherent set of potentially overlapping functional products
4 genes defined in this one locus
there are three primary transcripts, two of which encode five proteins, while the third encodes a noncoding RNA; two primary transcripts share a 5’ untranslated
region, but they are considered different genes because the translated regions (D and E do not overlap; there is a noncoding RNA, but the fact it shares its genomic sequence (X and Y) with the protein-coding genomic segments A and E does not make it a co-product of these genes; there are four genes in this one locus by the
new definition
gene number estimates as a function of time and methodology
time
genes
sequence annotation
observed transcripts
genome is sequenced
dark matter
dark matter is reproducible, but it’s poorly transcribed, poorly conserved, non protein coding, and outnumbers validated microRNAs by ~1000 fold
cDNA sequencing reveals an abundance of non-coding genes
coding1 coding2 non-coding1 non-coding2number of cDNAs 14,317 3,277 11,526 4,280 size of transcript 2146 (1061) 2174 (1091) 1939 (1019) 1790 (996)size of best ORFs 1107 (742) 550 (578) 206 (91) 194 (80)% as single exon 13.4% 35.4% 68.7% 73.1%
FANTOM categories for mouse cDNAs
mouse cDNAs by Okazaki Y, …, Hayashizaki Y. 2002. Nature 420: 563 or human cDNAs by Imanishi T, …, Sugano S. 2004. PLoS Biol 2: e162
neutral evolution of non-coding cDNAs from mouse transcriptome
coding1coding1-CDScoding2non-coding1non-coding2ncRNAsintron1intergenic
0
10
20
30
50 60 70 80 90 100sequence identity [%]
BlastZ to HUMAN at 25% thresholdcoding1coding1-CDScoding2non-coding1non-coding2ncRNAsintron1intergenic
0
10
20
30
60 70 80 90 100sequence identity [%]
BlastZ to RAT at 25% threshold
ncRNAs are known RNA genes; intron1 and intergenic are negative controls communications arising Wang J, …, Wong GK. 2004. Nature 431: after p757
tiling array data are riddled with unexplained signal anomalies too
do not assume that non-coding cDNAs are tiling arrays exons
human thymus polyA+ cDNAs profiled at locus of Ewing sarcoma breakpoint region 1 gene; from Johnson JM, …, Schadt EE. 2005. Trends Genet 21: 93
mystery BURST
indications of biological relevance: transcription, conservation, both lines of evidence, or neither?
possible dark matter explanations:
1. biological noise, i.e. real transcripts with no biological roles
2. RNA genes unique to a species
3. long RNAs are precursors for short (and conserved) RNAs
NB: dark matter based on tiling arrays with 150 bp exons is not equivalent to cDNA sequences with 1800 bp exons
poorly
transcribed
highly
transcribed
most
biology
highly
conserved
dark
matter
poorly
conserved
hypothesis is unannotated long RNAs are precursors for short RNAs
Kapranov P, …, Gingeras TR. 2007. Science 316: 1484-1488
nuclear and cytosolic polyadenylated RNAs longer than 200 nt (long RNAs, lRNAs) and whole-cell RNAs less than 200 nt (short RNAs, sRNAs) for non-repetitive
portion of human genome; 64% of poly(A)+ transcription (nucleus and cytosol) do not align with annotated exons but of these 265,237 annotated exons some 80% are
detected
lRNAs that overlap with sRNAs are more PhastCons conserved (i)
PhastCons identifies evolutionarily conserved elements from a multi-species sequence alignment, given their phylogenetic tree, and based on a statistical model of evolution called a phylogenetic hidden Markov model (phylo-HMM)
lRNAs that overlap with sRNAs are more PhastCons conserved (ii)
quantile-quantile plot of PhastCons scores for long RNAs that do (x axis) and do not (y axis) overlap with short RNAs; conservatively, 3.1% of HepG2 and
2.4% of HeLa nuclear lRNA transfrags might be parts of precursors of sRNAs
sRNAs associate with 5’ and 3’ boundaries of annotated transcripts
enrichment over random expectation is plotted as function of distance from 5’ and 3’ termini for sRNAs on same (sense) or opposite (antisense) strand as the annotated transcripts; comparison is made against random regions with matched G+C content