Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder...

proposed redefinition of “gene” requires it to have a biological role

Gerstein MB, …, Snyder M. 2007. Genome Res 17: 669-681

example of complexities observed by ENCODE

(A) annotated exons (black rectangles), novel transcriptionally active regions or TARs (hollow rectangles); conventional annotation identifies only 4 genes or just a fraction of the transcripts reported (dashed lines are introns)

(B) observed transcripts are shown alongside the sequences that regulate them (gray circles); note that some of the enhancers are actually promoters for novel splice isoforms

a redefinition of the “gene”

1. a gene is a genomic sequence directly encoding functional product molecules, either RNAs or proteins

2. when there are several functional products that share overlapping regions, take the union of all overlapping genomic sequences encoding them

3. this union must be coherent, done separately for protein and RNA products, but it does not require that all the products necessarily share a common subsequence

concisely summarized as

a union of genomic sequences encoding a coherent set of potentially overlapping functional products

4 genes defined in this one locus

there are three primary transcripts, two of which encode five proteins, while the third encodes a noncoding RNA; two primary transcripts share a 5’ untranslated

region, but they are considered different genes because the translated regions (D and E do not overlap; there is a noncoding RNA, but the fact it shares its genomic sequence (X and Y) with the protein-coding genomic segments A and E does not make it a co-product of these genes; there are four genes in this one locus by the

new definition

gene number estimates as a function of time and methodology

time

genes

sequence annotation

observed transcripts

genome is sequenced

dark matter

dark matter is reproducible, but it’s poorly transcribed, poorly conserved, non protein coding, and outnumbers validated microRNAs by ~1000 fold

cDNA sequencing reveals an abundance of non-coding genes

coding1 coding2 non-coding1 non-coding2number of cDNAs 14,317 3,277 11,526 4,280 size of transcript 2146 (1061) 2174 (1091) 1939 (1019) 1790 (996)size of best ORFs 1107 (742) 550 (578) 206 (91) 194 (80)% as single exon 13.4% 35.4% 68.7% 73.1%

FANTOM categories for mouse cDNAs

mouse cDNAs by Okazaki Y, …, Hayashizaki Y. 2002. Nature 420: 563 or human cDNAs by Imanishi T, …, Sugano S. 2004. PLoS Biol 2: e162

neutral evolution of non-coding cDNAs from mouse transcriptome

coding1coding1-CDScoding2non-coding1non-coding2ncRNAsintron1intergenic

0

10

20

30

50 60 70 80 90 100sequence identity [%]

BlastZ to HUMAN at 25% thresholdcoding1coding1-CDScoding2non-coding1non-coding2ncRNAsintron1intergenic

0

10

20

30

60 70 80 90 100sequence identity [%]

BlastZ to RAT at 25% threshold

ncRNAs are known RNA genes; intron1 and intergenic are negative controls communications arising Wang J, …, Wong GK. 2004. Nature 431: after p757

tiling array data are riddled with unexplained signal anomalies too

do not assume that non-coding cDNAs are tiling arrays exons

human thymus polyA+ cDNAs profiled at locus of Ewing sarcoma breakpoint region 1 gene; from Johnson JM, …, Schadt EE. 2005. Trends Genet 21: 93

mystery BURST

indications of biological relevance: transcription, conservation, both lines of evidence, or neither?

possible dark matter explanations:

1. biological noise, i.e. real transcripts with no biological roles

2. RNA genes unique to a species

3. long RNAs are precursors for short (and conserved) RNAs

NB: dark matter based on tiling arrays with 150 bp exons is not equivalent to cDNA sequences with 1800 bp exons

poorly

transcribed

highly

transcribed

most

biology

highly

conserved

dark

matter

poorly

conserved

hypothesis is unannotated long RNAs are precursors for short RNAs

Kapranov P, …, Gingeras TR. 2007. Science 316: 1484-1488

nuclear and cytosolic polyadenylated RNAs longer than 200 nt (long RNAs, lRNAs) and whole-cell RNAs less than 200 nt (short RNAs, sRNAs) for non-repetitive

portion of human genome; 64% of poly(A)+ transcription (nucleus and cytosol) do not align with annotated exons but of these 265,237 annotated exons some 80% are

detected

lRNAs that overlap with sRNAs are more PhastCons conserved (i)

PhastCons identifies evolutionarily conserved elements from a multi-species sequence alignment, given their phylogenetic tree, and based on a statistical model of evolution called a phylogenetic hidden Markov model (phylo-HMM)

lRNAs that overlap with sRNAs are more PhastCons conserved (ii)

quantile-quantile plot of PhastCons scores for long RNAs that do (x axis) and do not (y axis) overlap with short RNAs; conservatively, 3.1% of HepG2 and

2.4% of HeLa nuclear lRNA transfrags might be parts of precursors of sRNAs

sRNAs associate with 5’ and 3’ boundaries of annotated transcripts

enrichment over random expectation is plotted as function of distance from 5’ and 3’ termini for sRNAs on same (sense) or opposite (antisense) strand as the annotated transcripts; comparison is made against random regions with matched G+C content

Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder...

Documents

Transcript of Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder...