COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6,...

13
COUNTING READS from alignments to count tables

Transcript of COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6,...

Page 1: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

COUNTING READS from alignments to count tables

Page 2: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

Images

Raw reads

Aligned reads

Read count table

Normalized read count table

List of fold changes & statistical values

Downstream analyses on DE genes

FASTQC

Base calling & demultiplexing

Mapping

.tif

.fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam Aligned reads

Bustard/RTA/OLB, CASAVA

STAR

RSeQC

Page 3: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

Thursday, Oct 1

• QC of aligned reads •  counting reads • working with read counts

•  normalizing

•  transforming

•  similarity assessments: hierarchical clustering, PCA

Page 4: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

Basic QC of aligned reads

•  aligner output (e.g., Log.final.out) •  samtools flagstat • RSeQC’s bam_stat • …

Page 5: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

RSeQC’s read_distribution.py

to reproduce the plot, see https://github.com/friedue/course_RNA-seq2015

Page 6: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

5’ and 3’ bias of RNA-seq samples

Lahens et al. Genome Biology 2014 15:R86 doi:10.1186/gb-2014-15-6-r86

RSeQC’s geneBody_coverage.py

Page 7: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

5’ and 3’ bias identification with RSeQC RSeQC’s geneBody_coverage.py

Page 8: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

in silico RIN calculation •  RIN is rarely documented in the

data repositories •  mRIN is based on 3’ coverage bias •  use RSeQC’s tin.py

ARTICLEReceived 7 May 2015 | Accepted 15 Jun 2015 | Published 3 Aug 2015

mRIN for direct assessment of genome-wide andgene-specific mRNA integrity from large-scaleRNA-sequencing dataHuijuan Feng1,2, Xuegong Zhang1 & Chaolin Zhang2

The volume of RNA-Seq data sets in public repositories has been expanding exponentially,

providing unprecedented opportunities to study gene expression regulation. Because

degraded RNA samples, such as those collected from post-mortem tissues, can result in

distinct expression profiles with potential biases, a particularly important step in mining these

data is quality control. Here we develop a method named mRIN to directly assess mRNA

integrity from RNA-Seq data at the sample and individual gene level. We systematically

analyse large-scale RNA-Seq data sets of the human brain transcriptome generated by dif-

ferent consortia. Our analysis demonstrates that 30 bias resulting from partial RNA

fragmentation in post-mortem tissues has a marked impact on global expression profiles,

and that mRIN effectively identifies samples with different levels of mRNA degradation.

Unexpectedly, this process has a reproducible and gene-specific component, and transcripts

with different stabilities are associated with distinct functions and structural features

reminiscent of mRNA decay in living cells.

DOI: 10.1038/ncomms8816 OPEN

1 MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China.2 Department of Systems Biology, Department of Biochemistry and Molecular Biophysics, Center for Motor Neuron Biology and Disease, Columbia University,New York, New York 10032, USA. Correspondence and requests for materials should be addressed to C.Z. (email: [email protected]).

NATURE COMMUNICATIONS | 6:7816 | DOI: 10.1038/ncomms8816 | www.nature.com/naturecommunications 1

& 2015 Macmillan Publishers Limited. All rights reserved.

https://github.com/friedue/course_RNA-seq2015 à exercise #3

Page 9: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

Images

Raw reads

Aligned reads

Read count table

Normalized read count table

List of fold changes & statistical values

Downstream analyses on DE genes

Base calling & demultiplexing

Mapping

.tif

.fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam

Bustard/RTA/OLB, CASAVA

STAR

Counting featureCounts

Which regions are expressed?

How much are they

expressed?

Page 10: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count

Counting read–gene overlaps

featureCounts will identify overlaps as small as 1 bp

multi-overlap reads

Page 11: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

at the saturated sensitivity and precision values (Table 2). Thesesets are obtained when mean coverage is equal to 100 and com-prise the best reconstruction achievable using these methodswith standard parameter settings.

For these transcript sets, the correlation between true andestimated expression values of TP transcripts for Cufflinks is 0.95and for Oases is 0.85 (Table 2). A lower correlation for the de novoset is expected, given that the corresponding transcript set con-tains more non-existent transcripts, which absorb some of thetrue expression signal. The correlation for Cufflinks is compar-able with that of some of the simulated curated annotation sets.

The FP=TP ratio is significantly higher for computationalmethods than for reference-based methods using annotationswith similar accuracy. This is because FP transcripts that are re-constructed from the data are necessarily supported by reads,and thus have expression signal assigned to them, whereas se-quence in FP transcripts in a curated annotation set will in gen-eral not coincide with read sequence in the data, and so will beassigned low expression values. Cufflinks appears worse in thisrespect, as on average it assigns the equivalent of 81% of meanTP expression to FP transcripts, while for Oases this value isonly 41% (Table 2). This is consistent with Cufflinks assemblingfewer incorrect transcripts than Oases, which concentrates thesignal in a smaller number of FP transcripts.

The correlation of TP transcripts reconstructed by Cufflinksis high relative to the low sensitivity (s¼ 0.36) of the method. Asthe more highly expressed transcripts are more likely to be re-constructed accurately (Figure 6) than the more lowly expressedtranscripts and as highly expressed transcripts are easier to es-timate accurately (on the logarithmic scale), the result is a highcorrelation for a small subset of the truly expressed transcripts.

We note that the sampling of TP annotated transcripts inour simulations does not depend on the expression level,which is favourable towards reconstruction methods. As wehave shown (Figure 2), the annotated fraction of truly expressedtranscripts tends to be more highly expressed than the unanno-tated fraction. Thus, the TP correlations for the annotation-based approach are conservative.

Transcriptome reference-guided reconstructionprovides modest improvements in accuracy ofexpression estimates

The CufflinksþRABT approach supplements a curated annota-tion transcript set with additional reconstructed transcriptsrequired to explain the data. In our simulation set-up, we con-sider the range of annotation sets with different sensitivitiesand precisions, as above. We present two aspects of the results:firstly the effect that RABT has on the sensitivity and precisionof the final transcript set, and secondly the correlation betweentrue and estimated expression of the TP transcripts and theoverall FP=TP ratio.

Supplementing annotated transcripts with reconstructedtranscripts using CufflinksþRABT generally increases sensitiv-ity (the starts and ends of the arrows in Figure 7 point to theannotated and supplemented sensitivities and precisions, re-spectively). When using annotations with the lowest sensitivityof s¼ 0.2, RABT roughly doubles sensitivity of the transcript set.The gains in sensitivity decrease substantially as the sensitivityof the annotations increases and are not noticeable beyonds¼ 0.6. Thus, CufflinksþRABT is no better than an annotation-based approach overall when the sensitivity of annotations ismoderate to high.

Table 2. Performance of the computational transcriptome recon-struction methods

Cufflinks Oases

Sensitivity 0.36 0.36Precision 0.45 0.17Correlation of TPs 0.95 0.85FP=TP 0.81 0.41

Sensitivity, precision, correlation between true and estimated expression of TP tran-scripts and FP=TP for the computational transcriptome reconstruction methods.

Figure 6. Estimated expression levels of reconstructed transcripts by Cufflinks.Densities of the log expression estimates of transcripts properly reconstructed andincorrectly reconstructed by Cufflinks (p ¼ 2:17# 10$6, Kolmogorov–Smirnov test).

Figure 7. Log expression estimates for FP transcripts obtained using RABT.Densities of the log expression estimates of FP transcripts reconstructed byRABT and FP transcripts present in the simulated annotation set used by RABTas a starting point (s ¼ 0.6, p¼0.4).

A comparative study of RNA-seq analysis strategies | 7

at Cornell U

niversity Library on July 9, 2015http://bib.oxfordjournals.org/

Dow

nloaded from

•  Transcriptome reconstruction suffers from bad precision and bad sensitivity => many FP transcripts!

•  False transcripts capture a considerable portion of the reads

Janes et al. (2015). Briefings in Bioinformatics, (January), 1–9. doi:10.1093/bib/bbv007

The problem with reconstructed transcripts…

Page 12: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

Avoid RPKM and total read count normalization for

DGE

FP rates for varying % of DE genes (0-30%)

Dillies et al.(2012). Briefings in Bioinformatics. doi:10.1093/bib/bbs046

Effects of normalization methods on FC calculation and DGE analysis

Page 13: COUNTING READS - Cornell Universityincorrectly reconstructedbyCufflinks(p ¼ 2:17 # 10$6, Kolmogorov–Smirnov test). Figure 7. Log expression estimates for FP transcripts obtained

Technicalities when measuring expression strength with read counts

•  strongly influenced by •  gene length

•  sequencing depth

•  expression of all other genes in the same sample DESeq’s size factor normalization

hetero-skedasticity

•  large dynamic range •  discrete values

log transformation and variance stabilization (DESeq’s rlog() )

clustering tends to work better on normalized & transformed read counts