COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6,...
Transcript of COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6,...
COUNTING READS from alignments to count tables
Images
Raw reads
Aligned reads
Read count table
Normalized read count table
List of fold changes & statistical values
Downstream analyses on DE genes
FASTQC
Base calling & demultiplexing
Mapping
.tif
.fastq
Bioinformatics workflow of RNA-seq analysis
.sam/.bam Aligned reads
Bustard/RTA/OLB, CASAVA
STAR
RSeQC
Thursday, Oct 1
• QC of aligned reads • counting reads • working with read counts
• normalizing
• transforming
• similarity assessments: hierarchical clustering, PCA
Basic QC of aligned reads
• aligner output (e.g., Log.final.out) • samtools flagstat • RSeQC’s bam_stat • …
RSeQC’s read_distribution.py
to reproduce the plot, see https://github.com/friedue/course_RNA-seq2015
5’ and 3’ bias of RNA-seq samples
Lahens et al. Genome Biology 2014 15:R86 doi:10.1186/gb-2014-15-6-r86
RSeQC’s geneBody_coverage.py
5’ and 3’ bias identification with RSeQC RSeQC’s geneBody_coverage.py
in silico RIN calculation • RIN is rarely documented in the
data repositories • mRIN is based on 3’ coverage bias • use RSeQC’s tin.py
ARTICLEReceived 7 May 2015 | Accepted 15 Jun 2015 | Published 3 Aug 2015
mRIN for direct assessment of genome-wide andgene-specific mRNA integrity from large-scaleRNA-sequencing dataHuijuan Feng1,2, Xuegong Zhang1 & Chaolin Zhang2
The volume of RNA-Seq data sets in public repositories has been expanding exponentially,
providing unprecedented opportunities to study gene expression regulation. Because
degraded RNA samples, such as those collected from post-mortem tissues, can result in
distinct expression profiles with potential biases, a particularly important step in mining these
data is quality control. Here we develop a method named mRIN to directly assess mRNA
integrity from RNA-Seq data at the sample and individual gene level. We systematically
analyse large-scale RNA-Seq data sets of the human brain transcriptome generated by dif-
ferent consortia. Our analysis demonstrates that 30 bias resulting from partial RNA
fragmentation in post-mortem tissues has a marked impact on global expression profiles,
and that mRIN effectively identifies samples with different levels of mRNA degradation.
Unexpectedly, this process has a reproducible and gene-specific component, and transcripts
with different stabilities are associated with distinct functions and structural features
reminiscent of mRNA decay in living cells.
DOI: 10.1038/ncomms8816 OPEN
1 MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China.2 Department of Systems Biology, Department of Biochemistry and Molecular Biophysics, Center for Motor Neuron Biology and Disease, Columbia University,New York, New York 10032, USA. Correspondence and requests for materials should be addressed to C.Z. (email: [email protected]).
NATURE COMMUNICATIONS | 6:7816 | DOI: 10.1038/ncomms8816 | www.nature.com/naturecommunications 1
& 2015 Macmillan Publishers Limited. All rights reserved.
https://github.com/friedue/course_RNA-seq2015 à exercise #3
Images
Raw reads
Aligned reads
Read count table
Normalized read count table
List of fold changes & statistical values
Downstream analyses on DE genes
Base calling & demultiplexing
Mapping
.tif
.fastq
Bioinformatics workflow of RNA-seq analysis
.sam/.bam
Bustard/RTA/OLB, CASAVA
STAR
Counting featureCounts
Which regions are expressed?
How much are they
expressed?
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count
Counting read–gene overlaps
featureCounts will identify overlaps as small as 1 bp
multi-overlap reads
at the saturated sensitivity and precision values (Table 2). Thesesets are obtained when mean coverage is equal to 100 and com-prise the best reconstruction achievable using these methodswith standard parameter settings.
For these transcript sets, the correlation between true andestimated expression values of TP transcripts for Cufflinks is 0.95and for Oases is 0.85 (Table 2). A lower correlation for the de novoset is expected, given that the corresponding transcript set con-tains more non-existent transcripts, which absorb some of thetrue expression signal. The correlation for Cufflinks is compar-able with that of some of the simulated curated annotation sets.
The FP=TP ratio is significantly higher for computationalmethods than for reference-based methods using annotationswith similar accuracy. This is because FP transcripts that are re-constructed from the data are necessarily supported by reads,and thus have expression signal assigned to them, whereas se-quence in FP transcripts in a curated annotation set will in gen-eral not coincide with read sequence in the data, and so will beassigned low expression values. Cufflinks appears worse in thisrespect, as on average it assigns the equivalent of 81% of meanTP expression to FP transcripts, while for Oases this value isonly 41% (Table 2). This is consistent with Cufflinks assemblingfewer incorrect transcripts than Oases, which concentrates thesignal in a smaller number of FP transcripts.
The correlation of TP transcripts reconstructed by Cufflinksis high relative to the low sensitivity (s¼ 0.36) of the method. Asthe more highly expressed transcripts are more likely to be re-constructed accurately (Figure 6) than the more lowly expressedtranscripts and as highly expressed transcripts are easier to es-timate accurately (on the logarithmic scale), the result is a highcorrelation for a small subset of the truly expressed transcripts.
We note that the sampling of TP annotated transcripts inour simulations does not depend on the expression level,which is favourable towards reconstruction methods. As wehave shown (Figure 2), the annotated fraction of truly expressedtranscripts tends to be more highly expressed than the unanno-tated fraction. Thus, the TP correlations for the annotation-based approach are conservative.
Transcriptome reference-guided reconstructionprovides modest improvements in accuracy ofexpression estimates
The CufflinksþRABT approach supplements a curated annota-tion transcript set with additional reconstructed transcriptsrequired to explain the data. In our simulation set-up, we con-sider the range of annotation sets with different sensitivitiesand precisions, as above. We present two aspects of the results:firstly the effect that RABT has on the sensitivity and precisionof the final transcript set, and secondly the correlation betweentrue and estimated expression of the TP transcripts and theoverall FP=TP ratio.
Supplementing annotated transcripts with reconstructedtranscripts using CufflinksþRABT generally increases sensitiv-ity (the starts and ends of the arrows in Figure 7 point to theannotated and supplemented sensitivities and precisions, re-spectively). When using annotations with the lowest sensitivityof s¼ 0.2, RABT roughly doubles sensitivity of the transcript set.The gains in sensitivity decrease substantially as the sensitivityof the annotations increases and are not noticeable beyonds¼ 0.6. Thus, CufflinksþRABT is no better than an annotation-based approach overall when the sensitivity of annotations ismoderate to high.
Table 2. Performance of the computational transcriptome recon-struction methods
Cufflinks Oases
Sensitivity 0.36 0.36Precision 0.45 0.17Correlation of TPs 0.95 0.85FP=TP 0.81 0.41
Sensitivity, precision, correlation between true and estimated expression of TP tran-scripts and FP=TP for the computational transcriptome reconstruction methods.
Figure 6. Estimated expression levels of reconstructed transcripts by Cufflinks.Densities of the log expression estimates of transcripts properly reconstructed andincorrectly reconstructed by Cufflinks (p ¼ 2:17# 10$6, Kolmogorov–Smirnov test).
Figure 7. Log expression estimates for FP transcripts obtained using RABT.Densities of the log expression estimates of FP transcripts reconstructed byRABT and FP transcripts present in the simulated annotation set used by RABTas a starting point (s ¼ 0.6, p¼0.4).
A comparative study of RNA-seq analysis strategies | 7
at Cornell U
niversity Library on July 9, 2015http://bib.oxfordjournals.org/
Dow
nloaded from
• Transcriptome reconstruction suffers from bad precision and bad sensitivity => many FP transcripts!
• False transcripts capture a considerable portion of the reads
Janes et al. (2015). Briefings in Bioinformatics, (January), 1–9. doi:10.1093/bib/bbv007
The problem with reconstructed transcripts…
Avoid RPKM and total read count normalization for
DGE
FP rates for varying % of DE genes (0-30%)
Dillies et al.(2012). Briefings in Bioinformatics. doi:10.1093/bib/bbs046
Effects of normalization methods on FC calculation and DGE analysis
Technicalities when measuring expression strength with read counts
• strongly influenced by • gene length
• sequencing depth
• expression of all other genes in the same sample DESeq’s size factor normalization
hetero-skedasticity
• large dynamic range • discrete values
log transformation and variance stabilization (DESeq’s rlog() )
clustering tends to work better on normalized & transformed read counts