Single-cell transcriptome analysis of lineage diversity in ...
Interrogating the transcriptome in all its diversity
description
Transcript of Interrogating the transcriptome in all its diversity
Interrogating the transcriptome in all its diversity
Joel H Graber
• Nature Genetics, June 2000, v25, n2.
Why were so many predictions of the number of genes in a mammalian genome wrong?
Mammalian genomes contain far more transcript variants than protein variants
• Average protein products per locus = 1.7• Average distinct transcripts per locus = 5.7
Genome Biology (2009) 10:201.
A processed, protein coding mRNA molecule includes distinct functional regions
Protein coding sequence5’-untranslated
region (5’-UTR)3’-untranslated
Region (3’-UTR)
Genomic sequence
Pieces of a (Eukaryotic) Protein -Coding Gene(on the genome)
5’
3’
3’
5’
~ 1-100 Mbp
5’
3’
3’
5’……
……
~ 1-1000 kbp
exons (cds & utr) / introns(~ 102-103 bp) (~ 102-105 bp)
Polyadenylation site (~10-100 bp)
promoter (~103 bp)
enhancers (~10-100 bp) other regulatory sequences (~ 10-100 bp)
Alternate mRNA processing can lead to multiple transcript and/or protein products
……
3 transcripts1 protein product
Carolyn demonstrates gene regulation
Transcription control
mRNA degradation
mRNA localization
Protein degradation
Translation control
Protein = water in pool
mRNA = water in hose
DNA = water in pipes
A somewhat more formal view of regulation in the various stages of gene expression
Systematic changes to mRNA processing can significantly change the regulatory program of a cell
• Changes can be in a single gene or systemic
• Regulatory control during transcript generation– Transcription initiation site– Splicing pattern– 3’-processing (polyadenylation and cleavage) site– RNA editing
• Subsequent isoform-specific regulatory control– Stability– Translational efficiency– Localization
A brief history of transcript measurement
Implications of transcript variation for gene expression measurement
• Most large scale expression studies report one level per gene per sample– Microarrays:
• One reported value of expression per probeset; • Duplicate probesets are either averaged or discarded
– mRNAseq• RPKM (reads per kilobase of transcript per million reads)
• For many genes, summarization to one expression level in a given cell type is inadequate
Every time we find a new way to measure RNA, we find previously unknown types
Mattick et al, Trends Genet 2009
Classes of alternative transcripts
• Alternative splicing
• Alternative transcript initiation sites
• Alternative cleavage and polyadenylation (3’-processing)
• Combinations of one or more of these
The cascade of alternative mRNA processing in gene regulation
mRNA processing selections during mRNA generation can have a profound effect on downstream regulation of the resulting transcript
Processing and specifically alternative processing are controlled by cis-elements and transfactors
• mRNA processing signals are typically constrained in both sequence content and positioning
• Activity of specific sites is a function of the strength of the local signals and the cell/environment specific concentrations/activities of transfactors
Alternative splicing
Alternative splicing can occur in several ways
http://www.wormbook.org/
Splicing signals and interacting factors
Cis elements required for splicing
Vertebrates
BP
ESE
ESE? ESE?
UA-rich UA-rich
ESE
Yeast
Plants
GUAAGU
GUAUGU
GUAAGU
AG
AG
CURAY
UACUAAC
CURAY
NCAG
YAG
UGYAG
GU
GU
YYYY10-15
62 6479
10099 42
70 9558 100
49 100 4453 57
5‘ss 3‘ss
5‘ss – 5‘ splice site (donor site)3‘ss – 3‘ splice site (acceptor site)BP – branch point (A is branch point base)YYYY10-15 – polypyrimidine track
Y – pyrimidineR – purineN – any base
PWM representations of splice site signals (mice)
Frequency of bases in each position of the splice sites
Donor sequences: 5’ splice site
exon intron%A 30 40 64 9 0 0 62 68 9 17 39 24%U 20 7 13 12 0 100 6 12 5 63 22 26%C 30 43 12 6 0 0 2 9 2 12 21 29%G 19 9 12 73 100 0 29 12 84 9 18 20
A G G U A A G U
Acceptor sequences: 3’ splice site
intron exon%A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 0 22 17%U 51 44 50 53 60 49 49 45 45 57 58 29 31 0 0 8 37%C 19 25 31 21 24 30 33 28 36 36 28 22 65 0 0 18 22%G 15 21 10 10 10 6 7 9 7 7 5 24 1 0 100 52 25
Y Y Y Y Y Y Y Y Y Y Y N Y A G G
Polypyrimidine track (Y = U or C; N = any nucleotide)
Example 1: Insulin-like growth factor 1 (Igf1)
• AKA somatomedin C or mechano growth factor• Produced primarily by the liver as an endocrine hormone• Primary action is mediated by binding to IGF1R• Natural activator of the AKT pathway• A primary mediator of the effects of growth hormone• Expression has been
– Negatively correlated with lifespan– Positively correlated with body size
• Its regulatory control remains poorly understand after 30y
IGF1 is subject to extensive alternative mRNA processing
~83,000 nt
IGF1 mRNA data indicates at least 15 or more transcript isoforms
Salient features of IGF1 expression
• Mature, circulating IGF1 protein is a cleavage product, coded entirely in exons 3 and 4
• Exon 5 contains an additional peptide cleavage product, with demonstrated independent functionality
• Exons 1 and 2 are mutually exclusive, and likely not the only upstream, transcript initiating exons
• Exon 5 can be skipped, included or 3’-terminal
• Exon 6’s reading frame changes depending on whether it is spliced from exon 4 or 5
IGF1 has two possible terminal exons (5 and 6)
~22,000 nt
IGF1 Exon 6, if included can vary between ~200 and ~6400 nt
Alternative polyadenylation
Alternative 3’-processing can arise in several ways with varying consequences
Adapted from Yan J, et al.,Genome Research. 2005; 15(3):369-75.
PAPOL68 kD 73 kD160 kD
25 kD
30 kD
100 kDCPSF 50 kD
64 kD
77 kD
hnRNP H
Symplekin
PolyA site selection depends on sequence elements and abundance/stochiometry of trans-factors
AAUAAAUGUA
G-rich
U-rich
UG-rich
5’
3’
PAS
DSE64 kD
77 kD
50 kD
CSTF
Up to >80 proteins in complex
NMF defines patterns of signals that control 3’-processing (cleavage and polyadenylation)
Example 2: Insulin-like growth factor 2 mRNA binding protein 1 (Igf2bp1)
• Contains four K homology domains and two RNA recognition motifs
• Binds to the 5’-UTR of IGF2 mRNA, regulating translation• Can act as an oncogene if misregulated• Evolutionarily conserved, with critical role in mRNA
localization and translational control
Consequences: Igf2bp1 has transforming potential only when expressed in its truncated isoform
~50,000 nt
~6,500 nt
Mayr and Bartel, Cell 2009
AAA… AAA…
5’ 3’
Inclusion (or exclusion) of regulatory sequences in the 3’-UTR fine tune expression and response
• Spicher et al, Mol Cell Biol 1998
Example 3: Regulated control of polyA site selection for anitbodies during B-cell maturation
Alternative transcription initiation
Alternative transcription initiation can arise in several ways with varying consequences
CAGE tags showed an unexpectedly high frequency in the 3’-UTR
3’-UTR CAGE tags occur in evolutionarily conserved contexts with a common local sequence
The definition of a gene becomes much more fluid: Ins2-IGF2
• Two genes with spurious connection?• One large genes with distinct, disjoint transcripts?
Cleaved 3’-UTR RNA products (uaRNAs) are often tissue-specific and can localize differentially
Next time: Details of measuring transcript differences in
large-scale