RNA-Seq with the Tuxedo Suite

21
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 2015 Workshop

Transcript of RNA-Seq with the Tuxedo Suite

Page 1: RNA-Seq with the Tuxedo Suite

RNA-Seq with the Tuxedo Suite

Monica Britton, Ph.D. Sr. Bioinformatics Analyst

June 2015 Workshop

Page 2: RNA-Seq with the Tuxedo Suite

The Basic Tuxedo Suite

References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology

Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/

transcripts/promoters

Page 3: RNA-Seq with the Tuxedo Suite

Alignment and Differential Expression

TopHat

Cuffdiff

Read set(s)

Existing annotation

(GTF)

bam file(s)

Toptables, etc.

We followed these steps with the single-end reads

Page 4: RNA-Seq with the Tuxedo Suite

But, do we have all the genes?

• For organisms with genomes, gene models are stored in gtf file

• Assumptions: – The gtf file contains annotation for ALL transcripts and genes

– All splice sites, start/stop codons, etc. are correct

• Are these assumptions correct for every sequenced organism?

• RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation

• Method used depends on how much sequence information there is for the organism…

Page 5: RNA-Seq with the Tuxedo Suite

Gene Construction (Alignment) vs. Assembly

Haas and Zody (2010) Nat. Biotech. 28:421-3

Novel or Non-Model Organisms

Genome-Sequenced Organisms

Trinity software

Page 6: RNA-Seq with the Tuxedo Suite

Gene / Transcriptome Construction

• Annotation can be improved – even for well-annotated model organisms – Identify all expressed exons

– Combine expressed exons into genes

– Find all splice variants for a gene

– Discover novel transcripts

• For newly sequenced organisms – Validate ab initio annotation

– Comparison between different annotation sets

• Can assist in finding some types of contamination – Reconstruction of rRNA genes

– Genomic/mitochondrial DNA in RNA library preps.

Page 7: RNA-Seq with the Tuxedo Suite

Reference Annotation Based Transcript (RABT) Assembly

TopHat

Cufflinks

Cuffmerge

Cuffcompare

Cuffdiff

Read set(s)

Existing annotation

(GTF) [optional]

bam file(s)

Read-set specific GTF(s)

Merged GTF

Final assembly (GTF and stats)

Toptables, etc.

Page 8: RNA-Seq with the Tuxedo Suite

TopHat Spliced Alignment to a Genome

Page 9: RNA-Seq with the Tuxedo Suite

Reference Annotation Based Transcript (RABT) Assembly

Page 10: RNA-Seq with the Tuxedo Suite

Cufflinks – Identification of Incompatible Fragments

Incompatible alignment

Page 11: RNA-Seq with the Tuxedo Suite

Cufflinks – Minimum Paths to Transcripts

Page 12: RNA-Seq with the Tuxedo Suite

Cufflinks – Abundance Estimation

Page 13: RNA-Seq with the Tuxedo Suite

Cufflinks – Abundance Estimation

Page 14: RNA-Seq with the Tuxedo Suite

Merging Cufflinks Assemblies

Page 15: RNA-Seq with the Tuxedo Suite

So Now We’ve Explored These Tools…

Page 16: RNA-Seq with the Tuxedo Suite

We’ve Used Other Software in Conjunction

HTSeq-count

edgeR

Raw Counts

(But HTSeq-count and edgeR are independent)

Page 17: RNA-Seq with the Tuxedo Suite

And Then Came Some Extensions…

Page 18: RNA-Seq with the Tuxedo Suite

Modules Introduced in 2014

Cuffquant

• Improves efficiency of running multiple samples

• Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm

Cuffnorm

• Generate tables of expression values that are normalized for library size.

• Tables are used as input to Monocle

Monocle

• Used to analyze single-cell expression data

• Trapnell, et al., 2014, Nat. Biotech. 32:381

Page 19: RNA-Seq with the Tuxedo Suite

…But Software Continues to Evolve

HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts)

• Kim et al., 2015, Nat. Methods

• Planned to be Tophat3

• Faster than other aligners

• More accurate on simulated reads.

Page 20: RNA-Seq with the Tuxedo Suite

…But Software Continues to Evolve

StringTie

• Pertea et al., 2015, Nat. Biotech

• Probable successor to Cufflinks2

• Assembles more transcripts (based on simulated reads)

Ballgown

• Frazee et al., 2015, Nat. Biotech

• Bioconductor R package

• Probable successor to Cuffdiff2

• Includes useful Tablemaker preprocessor

Page 21: RNA-Seq with the Tuxedo Suite

A New Potential Game-Changer (2015)

Kallisto (“Near-Optimal RNA-Seq Quantification”)

• Bray et al. (http://arxiv.org/abs/1505.02710)

• Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs

Speed Accuracy