Expression Analysis of RNA-seq Data Manuel Corpas Plant and Animal Genomes Project Leader...
-
Upload
gerald-sullivan -
Category
Documents
-
view
218 -
download
1
Transcript of Expression Analysis of RNA-seq Data Manuel Corpas Plant and Animal Genomes Project Leader...
Generation of Sequence
Mapping Reads
Assembly of Transcripts
Statistical Analysis1 Summarization (by exon,
by transcript, by gene)2 Normalization (within
sample and between sample)
3 Differential expression testing (poisson test,
negative binomial test)
Identification of splice junctions
The Tuxedo Tools
• Developed by Institute of Genetic Medicine at Johns Hopkins University / University of California, Berkeley / Harvard University
• 157 pubmed citationsTophat
Fast short read aligner (Bowtie)Spliced read identification (Tophat)
Cufflinks packageCufflinks – Transcript assembly
Cuffmerge – Merges multiple transcript assembliesCuffcompare – Compare transcript assemblies to
reference annotationCuffdiff – Identifies differentially expressed genes and
transcripts
CummeRbundVisualisation of differential expression results
RNA-seq Experimental design
• Sequencing technology (Solid, Illumina)– Hiseq 2000, 150 million read pairs per lane, 100bp
• Single end (SE) Paired end (PE), strand specific– SE Quantification against known genes– PE Novel transcripts, transcript level quantification
• Read length (50-100bp)– Greater read length aids mapping accuracy, splice
variant assignment and identification of novel junctions
• Number of replicates– often noted to have substantially less technical
variability– Biological replicates should be included (at least 3 and
preferably more)
• Sequencing depth– Dependent on experimental aims
• Extrapolation of the sigmoid shape suggests 20 % of transcripts not expressed
• First saturation effects set in at ~40 million read alignments
• ~240 Million reads achieve 84 % transcript recall
RNA-seq Experimental designToung et al. RNA-sequence analysis of human B-cells. Genome Research (2011) .
Labaj et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics (2011)
RNA-seq Experimental designGeneral guide• Quantify expression of high-moderatly expressed known genes
• ~20 million mapped reads, PE, 2 x 50 bp• Assess expression of alternative splice variants, novel
transcripts, and strong quantification including low copy transcripts
• in excess of 50 million reads, PE, 2 x 100 bp
Example• Examine gene expression in 6 different conditions with 3
biological replicates (18 samples)• Multiplexing 6 samples per lane on 3 lanes of the HiSeq (50
bp PE)• Generates ~25 M reads per sample• Assuming ~80% of reads map/pass additional QC (20 M
mapped read per sample)• Cost – 3 lanes (£978 x 3 ) 18 libraries (£ 105 x 18), total
£4824
Step 1 – Preprocessing reads
• Sequence data provided as Fastq files• QC analaysis – sequence quality, adapter
contamination (FASTQC)• Quality trimming, adapter removal (FASTX,
Prinseq, Sickle)
Step 2 – Data sources
• Reads (Fastq, phred 33)• Genomic reference (fasta TAIR10), or pre built
Bowtie index• GTF/GFF file gene calls (TAIR10)• http://tophat.cbcb.umd.edu/igenomes.html
Tuxedo Protocol
Read 1
CUFFLINKS (Transcript Assembly)
CUFFMERGE (Final Transcript Assembly)
CUFFDIFF (Differential expression results)
CUMMERBUND (Expression Plots)
CUFFCOMPARE (Comparion to reference)
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
GTF + Genome
Alignments (BAM)
GTF
Assemblies (GTF)
GTF
Assembly (GTF)
Visualisation (PDF)
TOPHAT (Read Mapping)
Step 3 Tuxedo Protocol - TOPHAT
Read 1
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
GTF + Genome
TOPHAT (Read Mapping)
Alignments (BAM)
• Non-spliced reads mapped by bowtie– Reads mapped directly to transcriptome sequence
• Spliced reads identified by tophat– Initial mapping used to build a database of spliced junctions– Input reads split into smaller segments
• Coverage islands• Paired end reads map to distinct regions• Segments map in distinct regions• Long reads >=75bp used to identify GT-AG, GC-AG and AT-
AC splicings)
Step 3 Tuxedo Protocol - TOPHAT
Read 1
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
GTF + Genome
TOPHAT (Read Mapping)
Alignments (BAM)
• -i/--min-intron-length <int> 40 • -I/--max-intron-length <int> 5000• -a/--min-anchor-length <int> 10• -g/--max-multihits <int> 20• -G/--GTF <GTF/GFF3 file>
Tuxedo Protocol
Read 1
CUFFLINKS (Transcript Assembly)
CUFFMERGE (Final Transcript Assembly)
CUFFDIFF (Differential expression results)
CUMMERBUND (Expression Plots)
CUFFCOMPARE (Comparion to reference)
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
GTF + Genome
Alignments (BAM)
GTF
Assemblies (GTF)
GTF
Assembly (GTF)
Visualisation (PDF)
TOPHAT (Read Mapping)
Step 4 Tuxedo Protocol - CUFFLINKS
Read 1
CUFFLINKS (Transcript Assembly)
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
GTF + Genome
Alignments (BAM)
GTF
Assemblies (GTF)
TOPHAT (Read Mapping)
• Accurate quantification of a gene requires identifying which isoform produced each read.
• Reference Annotation Based Transcript (RABT) assembly• Sequence bias correction -b/--frag-bias-correct
<genome.fa>• multi-mapped read correction is enabled (-u/--multi-read-
correct)
Tuxedo Protocol
Read 1
CUFFLINKS (Transcript Assembly)
CUFFMERGE (Final Transcript Assembly)
CUFFDIFF (Differential expression results)
CUMMERBUND (Expression Plots)
CUFFCOMPARE (Comparion to reference)
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
GTF + Genome
Alignments (BAM)
GTF
Assemblies (GTF)
GTF
Assembly (GTF)
Visualisation (PDF)
TOPHAT (Read Mapping)
Tuxedo Protocol - CUFFDIFF
Read 1
CUFFDIFF (Differential expression results)
Read 2
Leaf (SAM1)
Read 1
Read 2
Leaf (SAM2)
Read 1
Read 2
Flower (SAM3)
Read 1
Read 2
Flower (SAM4)
Reads (FASTQ)
Alignments (BAM)
GTF + Genome
CUFFMERGE (Final Transcript Assembly)
CUFFLINKS (Transcript Assembly)
Assembly (GTF)
TOPHAT (Read Mapping)
CUFFDIFF output – FPKM (fragments per kilobase of transcript per million fragments mapped) values, fold change, test statistic, p-value, significance statement.
GTF mask file
CUFFDIFF - Summarisation
• Cuffdiff output (11 files)– FPKM tracking files (Transcript, Gene, CDS, Primary
transcript)– Differential expression tests (Transcript, Gene, CDS,
Primary transcript)– Differential splicing tests – splicing.diff– Differential coding output – cds.diff– Differential promoter use – promoters.diff
1 2 3 4
1 2 4
1 2 4
(A)
(B)
(C)
1. A + B + C Grouped at Gene level2. B + C Grouped at CDS level3. A + C Grouped at Primary transcript level4. A, B, C No group at the transcript level
Look at difference in distribution (rather than total level)
A test case – Ricinus Communis (Castor bean)• 5 tissues – Aim : identify differences in lipid-
metabolic pathways
• Cufflinks – Cuffcompare Results– RNA-Seq reads assembled into 75090 transcripts
corresponding to 29759 ‘genes’– Compares to the 31221 genes in version 0.1 of the JCVI
assembly– 35587 share at least one splice junction (possible novel
splice variant).– 2847 were located intergenic to the JCVI annotation and
hence may represent novel genes– 218147 splice junctions were identified, 112337
supported by at least 10 reads, >300,000 distinct to the JCVI annotation
A test case – Ricinus Communis (Castor bean)
Visualisation
• Bam files can be converted to wiggle plots• CummeRbund for visualisation of Cuffdiff
output
Bam, Wiggle and GTF files viewed in IGVCummeRbund volcano and scatter
plots