RNA-Seq with the Tuxedo Suite
Transcript of RNA-Seq with the Tuxedo Suite
RNA-Seq with the Tuxedo Suite
Monica Britton, Ph.D. Sr. Bioinformatics Analyst
June 2015 Workshop
The Basic Tuxedo Suite
References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology
Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/
transcripts/promoters
Alignment and Differential Expression
TopHat
Cuffdiff
Read set(s)
Existing annotation
(GTF)
bam file(s)
Toptables, etc.
We followed these steps with the single-end reads
But, do we have all the genes?
• For organisms with genomes, gene models are stored in gtf file
• Assumptions: – The gtf file contains annotation for ALL transcripts and genes
– All splice sites, start/stop codons, etc. are correct
• Are these assumptions correct for every sequenced organism?
• RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation
• Method used depends on how much sequence information there is for the organism…
Gene Construction (Alignment) vs. Assembly
Haas and Zody (2010) Nat. Biotech. 28:421-3
Novel or Non-Model Organisms
Genome-Sequenced Organisms
Trinity software
Gene / Transcriptome Construction
• Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons
– Combine expressed exons into genes
– Find all splice variants for a gene
– Discover novel transcripts
• For newly sequenced organisms – Validate ab initio annotation
– Comparison between different annotation sets
• Can assist in finding some types of contamination – Reconstruction of rRNA genes
– Genomic/mitochondrial DNA in RNA library preps.
Reference Annotation Based Transcript (RABT) Assembly
TopHat
Cufflinks
Cuffmerge
Cuffcompare
Cuffdiff
Read set(s)
Existing annotation
(GTF) [optional]
bam file(s)
Read-set specific GTF(s)
Merged GTF
Final assembly (GTF and stats)
Toptables, etc.
TopHat Spliced Alignment to a Genome
Reference Annotation Based Transcript (RABT) Assembly
Cufflinks – Identification of Incompatible Fragments
Incompatible alignment
Cufflinks – Minimum Paths to Transcripts
Cufflinks – Abundance Estimation
Cufflinks – Abundance Estimation
Merging Cufflinks Assemblies
So Now We’ve Explored These Tools…
We’ve Used Other Software in Conjunction
HTSeq-count
edgeR
Raw Counts
(But HTSeq-count and edgeR are independent)
And Then Came Some Extensions…
Modules Introduced in 2014
Cuffquant
• Improves efficiency of running multiple samples
• Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm
Cuffnorm
• Generate tables of expression values that are normalized for library size.
• Tables are used as input to Monocle
Monocle
• Used to analyze single-cell expression data
• Trapnell, et al., 2014, Nat. Biotech. 32:381
…But Software Continues to Evolve
HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts)
• Kim et al., 2015, Nat. Methods
• Planned to be Tophat3
• Faster than other aligners
• More accurate on simulated reads.
…But Software Continues to Evolve
StringTie
• Pertea et al., 2015, Nat. Biotech
• Probable successor to Cufflinks2
• Assembles more transcripts (based on simulated reads)
Ballgown
• Frazee et al., 2015, Nat. Biotech
• Bioconductor R package
• Probable successor to Cuffdiff2
• Includes useful Tablemaker preprocessor
A New Potential Game-Changer (2015)
Kallisto (“Near-Optimal RNA-Seq Quantification”)
• Bray et al. (http://arxiv.org/abs/1505.02710)
• Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs
Speed Accuracy