Expression Analysis of RNA-seq Data Manuel Corpas Plant and Animal Genomes Project Leader...

Expression Analysis of RNA-seq Data

Manuel CorpasPlant and Animal Genomes Project

[email protected]

Generation of Sequence

Mapping Reads

Assembly of Transcripts

Statistical Analysis1 Summarization (by exon,

by transcript, by gene)2 Normalization (within

sample and between sample)

3 Differential expression testing (poisson test,

negative binomial test)

Identification of splice junctions

The Tuxedo Tools

• Developed by Institute of Genetic Medicine at Johns Hopkins University / University of California, Berkeley / Harvard University

• 157 pubmed citationsTophat

Fast short read aligner (Bowtie)Spliced read identification (Tophat)

Cufflinks packageCufflinks – Transcript assembly

Cuffmerge – Merges multiple transcript assembliesCuffcompare – Compare transcript assemblies to

reference annotationCuffdiff – Identifies differentially expressed genes and

transcripts

CummeRbundVisualisation of differential expression results

RNA-seq Experimental design

• Sequencing technology (Solid, Illumina)– Hiseq 2000, 150 million read pairs per lane, 100bp

• Single end (SE) Paired end (PE), strand specific– SE Quantification against known genes– PE Novel transcripts, transcript level quantification

• Read length (50-100bp)– Greater read length aids mapping accuracy, splice

variant assignment and identification of novel junctions

• Number of replicates– often noted to have substantially less technical

variability– Biological replicates should be included (at least 3 and

preferably more)

• Sequencing depth– Dependent on experimental aims

• Extrapolation of the sigmoid shape suggests 20 % of transcripts not expressed

• First saturation effects set in at ~40 million read alignments

• ~240 Million reads achieve 84 % transcript recall

RNA-seq Experimental designToung et al. RNA-sequence analysis of human B-cells. Genome Research (2011) .

Labaj et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics (2011)

RNA-seq Experimental designGeneral guide• Quantify expression of high-moderatly expressed known genes

• ~20 million mapped reads, PE, 2 x 50 bp• Assess expression of alternative splice variants, novel

transcripts, and strong quantification including low copy transcripts

• in excess of 50 million reads, PE, 2 x 100 bp

Example• Examine gene expression in 6 different conditions with 3

biological replicates (18 samples)• Multiplexing 6 samples per lane on 3 lanes of the HiSeq (50

bp PE)• Generates ~25 M reads per sample• Assuming ~80% of reads map/pass additional QC (20 M

mapped read per sample)• Cost – 3 lanes (£978 x 3 ) 18 libraries (£ 105 x 18), total

£4824

Step 1 – Preprocessing reads

• Sequence data provided as Fastq files• QC analaysis – sequence quality, adapter

contamination (FASTQC)• Quality trimming, adapter removal (FASTX,

Prinseq, Sickle)

Step 2 – Data sources

• Reads (Fastq, phred 33)• Genomic reference (fasta TAIR10), or pre built

Bowtie index• GTF/GFF file gene calls (TAIR10)• http://tophat.cbcb.umd.edu/igenomes.html

Tuxedo Protocol

Read 1

CUFFLINKS (Transcript Assembly)

CUFFMERGE (Final Transcript Assembly)

CUFFDIFF (Differential expression results)

CUMMERBUND (Expression Plots)

CUFFCOMPARE (Comparion to reference)

Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

GTF + Genome

Alignments (BAM)

GTF

Assemblies (GTF)

GTF

Assembly (GTF)

Visualisation (PDF)

TOPHAT (Read Mapping)

Step 3 Tuxedo Protocol - TOPHAT

Read 1

Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

GTF + Genome


Alignments (BAM)

• Non-spliced reads mapped by bowtie– Reads mapped directly to transcriptome sequence

• Spliced reads identified by tophat– Initial mapping used to build a database of spliced junctions– Input reads split into smaller segments

• Coverage islands• Paired end reads map to distinct regions• Segments map in distinct regions• Long reads >=75bp used to identify GT-AG, GC-AG and AT-

AC splicings)

Step 3 Tuxedo Protocol - TOPHAT

Read 1

Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

GTF + Genome


Alignments (BAM)

• -i/--min-intron-length <int> 40 • -I/--max-intron-length <int> 5000• -a/--min-anchor-length <int> 10• -g/--max-multihits <int> 20• -G/--GTF <GTF/GFF3 file>

Tuxedo Protocol

Read 1






Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

GTF + Genome

Alignments (BAM)

GTF

Assemblies (GTF)

GTF

Assembly (GTF)

Visualisation (PDF)


Step 4 Tuxedo Protocol - CUFFLINKS

Read 1


Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

GTF + Genome

Alignments (BAM)

GTF

Assemblies (GTF)


• Accurate quantification of a gene requires identifying which isoform produced each read.

• Reference Annotation Based Transcript (RABT) assembly• Sequence bias correction -b/--frag-bias-correct

<genome.fa>• multi-mapped read correction is enabled (-u/--multi-read-

correct)

Tuxedo Protocol

Read 1






Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

GTF + Genome

Alignments (BAM)

GTF

Assemblies (GTF)

GTF

Assembly (GTF)

Visualisation (PDF)


Tuxedo Protocol - CUFFDIFF

Read 1


Read 2

Leaf (SAM1)

Read 1

Read 2

Leaf (SAM2)

Read 1

Read 2

Flower (SAM3)

Read 1

Read 2

Flower (SAM4)

Reads (FASTQ)

Alignments (BAM)

GTF + Genome



Assembly (GTF)


CUFFDIFF output – FPKM (fragments per kilobase of transcript per million fragments mapped) values, fold change, test statistic, p-value, significance statement.

GTF mask file

CUFFDIFF - Summarisation

• Cuffdiff output (11 files)– FPKM tracking files (Transcript, Gene, CDS, Primary

transcript)– Differential expression tests (Transcript, Gene, CDS,

Primary transcript)– Differential splicing tests – splicing.diff– Differential coding output – cds.diff– Differential promoter use – promoters.diff

1 2 3 4

1 2 4

1 2 4

(A)

(B)

(C)

1. A + B + C Grouped at Gene level2. B + C Grouped at CDS level3. A + C Grouped at Primary transcript level4. A, B, C No group at the transcript level

Look at difference in distribution (rather than total level)

A test case – Ricinus Communis (Castor bean)• 5 tissues – Aim : identify differences in lipid-

metabolic pathways

• Cufflinks – Cuffcompare Results– RNA-Seq reads assembled into 75090 transcripts

corresponding to 29759 ‘genes’– Compares to the 31221 genes in version 0.1 of the JCVI

assembly– 35587 share at least one splice junction (possible novel

splice variant).– 2847 were located intergenic to the JCVI annotation and

hence may represent novel genes– 218147 splice junctions were identified, 112337

supported by at least 10 reads, >300,000 distinct to the JCVI annotation

A test case – Ricinus Communis (Castor bean)

Visualisation

• Bam files can be converted to wiggle plots• CummeRbund for visualisation of Cuffdiff

output

Bam, Wiggle and GTF files viewed in IGVCummeRbund volcano and scatter

plots

• David Swarbreck (Genome Analysis Team Leader, TGAC)• Mario Caccamo (Head Bioinformatics Division, TGAC)

Thanks

Expression Analysis of RNA-seq Data Manuel Corpas Plant and Animal Genomes Project Leader...

Documents

Transcript of Expression Analysis of RNA-seq Data Manuel Corpas Plant and Animal Genomes Project Leader...