Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
RNAseq
description
Transcript of RNAseq
![Page 1: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/1.jpg)
RNAseq
![Page 2: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/2.jpg)
transcriptome
RNA-seq readsIllumina
sequencingmRNA
RNA-Seq Alignment
35bp - 150bpsingle or paired-end
reads
![Page 3: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/3.jpg)
RNA-seq alignments
1. Wang K., et al., MapSplice: Accurate Mapping of RNA-seq Reads for Splice Junction Discovery, Nucleic Acids Research, 2010.
2. Hu Y., et al., A Probabilistic Framework for Aligning Paired-end RNA-seq Data, Bioinformatics, 2010
5’ 3’
Reference Genome Exon 1 Exon 2 Exon 3
transcriptome
RNA-seq reads
mRNA
Sequencing and Alignment
Illumina sequencing 35bp - 150bp
single or paired-endreads
MapSplice
![Page 4: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/4.jpg)
MapSplice Features• Aligns to reference genome without dependence on annotations
– Finds canonical and non-canonical spliced alignments– SNP and indel tolerance relative to the reference genome– Aligns arbitrarily long reads with multiple splices– Aligns reads over arbitrarily long gaps (e.g. fusion transcripts that result from genomic
translocations)– Can detect exons as small as 8bp (assuming sufficient read coverage)
• Two-stage alignment – Classify “true” junctions using candidate alignments for all reads– Realign reads using only true junctions
• Positive independent evaluations in comparative studies– MapSplice consistently outperforms other methods in measures of sensitivity & specificity in
junctions, overall accuracy, and fraction aligned reads (Grant et al., Bioinformatics, 2011; Chen et al., NAR 2011)
![Page 5: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/5.jpg)
• Segmented alignment of reads
pre-processing
mapping
junction classification
remapping
best alignment
Mapping
Genome
mRNA tag Tt1 t2 t3 t4
k k hj1 j2
exon 1 exon 2 exon 3
• example: 100nt read is split into four 25nt segments
• segments aligned to the genome using bowtie (mismatch 1)
• unaligned segments implicate splices or indels
• find splices/indels by searching from neighboring aligned segments
• double anchored search• single anchored search
• each end of a PER is aligned individually
![Page 6: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/6.jpg)
Mapping
tj tj+2
? tj+1
tj tj+1
3’5’
tj? tj+1
tj+2
Segment alignment
![Page 7: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/7.jpg)
Mapping
tj tj+2
tj+1
tj tj+1
3’5’
tj
tj+1
tj+2
Spliced/indel alignment
![Page 8: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/8.jpg)
Mapping
3’5’Segment Assembly
![Page 9: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/9.jpg)
Mapping
3’5’Segment Assembly
A read may have multiple alignments
![Page 10: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/10.jpg)
Paired End Data
3’5’
3’5’
![Page 11: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/11.jpg)
Junction classification1. Alignment quality, e.g., average mismatch ≤ 2 (max 3)2. Anchor significance, e.g., left and right anchor ≥ 15bp 3. Entropy, e.g., close to uniform distribution of starting
positions for reads that span the splice junction
3’5’
readlength-1 readlength-1
pre-processing
mapping
junction classification
remapping
best alignment
![Page 12: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/12.jpg)
Remapping
3’5’
readlength-1 readlength-1Synthetic sequences
pre-processing
mapping
junction classification
remapping
best alignment
• Realign all reads contiguously to synthetic sequences centered on each junction
• In general, multiple synthetic sequences for each junction
![Page 13: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/13.jpg)
Remapping
3’5’
Readlength-1 Readlength-1
Remapping for contiguous alignment
Synthetic sequences
![Page 14: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/14.jpg)
Best Alignment
pre-processing
mapping
junction classification
remapping
best alignment
• Alignments of a read are scored as a combination of
1) mate-pair distance if both ends are mapped 2) mismatch - sum of both ends, if mapped 3) confidence of junctions , if spliced alignment
The alignment(s) with the top score are reported.
![Page 15: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/15.jpg)
Best alignments: 354,181,215 reads aligned
Alignment Statistics(human cytosolic data – all reads pooled)
Pre-processing: 399,753,836 reads
Mapping: 342,289,432 reads aligned
Remapping: 355,646,083 reads aligned
Pooled dataset: 426,542,817 reads
- 6%
- 13%- 10%
- 0.3%
![Page 16: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/16.jpg)
A view of alignment quality
![Page 17: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/17.jpg)
PerformanceDataset # total
readsMapSplice 1.15 MapSplice Parallel
Time Mem Disk Time Mem Disk
Synthetic R1
80 Million
~39 hours (bowtie with
8 threads)
4 GB 600 GB ~4 hours with 8
threads
~20 GB 50 GB
![Page 18: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/18.jpg)
3’Exon 1 Exon 2 Exon 3
RNA-seq alignments
Reference Genome 5’
Isoform 1
Isoform 2
Reference transcript isoforms
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
What is the relative abundance of each isoform?
Transcript Quantification
![Page 19: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/19.jpg)
Observed coverage on exons4
24
Isoform 1
Isoform 2
Reference transcript isoforms
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
x
y
Isoform copy
Exon-Centric Approach
![Page 20: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/20.jpg)
42
4
Isoform 1
Isoform 2
Reference transcript isoforms
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
x
y
Isoform copy
4 = x + y 2 = x 4 = x + y
Exon-Centric ApproachObserved coverage on exons
![Page 21: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/21.jpg)
42
4
Isoform 1
Isoform 2
Reference transcript isoforms
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
x=2
y=2
Isoform copy
4 = x + y 2 = x 4 = x + y
Exon-Centric ApproachObserved coverage on exons
![Page 22: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/22.jpg)
Observed coverage on exons
Reference transcript isoforms
Isoform 1
Isoform 2
Exon 1 Exon 2 E 3
Exon 1 E3
Exon 1 Exon 2 E3 Exon 4
Exon 1 E3 Exon 4
Isoform 3
Isoform 4
# copies
7
3
7
3
True
1
3
2
1
3
1
0
3
P1
2
2
1
2
P2
Problem 1: underdetermined solutions
![Page 23: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/23.jpg)
Problem 2: Mappabilityexample of a “high” expressed junction
Mappability tracks
![Page 24: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/24.jpg)
Abundance Estimation using EMLi et al., RNA-Seq gene expression estimation with read mapping uncertainty, Bioinformatics, 26(4):493-500, 2010.
• Probabilistic framework to estimate gene and isoform abundance from a model incorporating read bias
• The general approach is to maximize the probability of observing the read alignments, given some expected isoform abundances.
– Explicitly handles multimapped reads (reads mapped to multiple genes or isoforms) – In addition, 95% credibility intervals (CI) and posterior mean estimate (PME) are computed
besides ML estimate.
• RSEM appears to be most accurate– Cufflinks, IsoEM, and other methods follow a similar approach and agree to a large degree
We are currently developing an extension, Multisplice, that does this better!
![Page 25: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/25.jpg)
RNAseq
• Parametric Models– edgeR– Deseq (NB)– Generalized Poisson ( λp)– GeneCounter (NBp)– RSEM (MLE, directed graph, various)
• Non-parametric – (Li and Tibshirani)– Biswas et al in prep (hybrid)
![Page 26: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/26.jpg)
Comparison among methods
![Page 27: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/27.jpg)
Histograms of Expression levels of significant genes
![Page 28: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/28.jpg)
Bias..
![Page 29: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/29.jpg)
![Page 30: RNAseq](https://reader036.fdocuments.net/reader036/viewer/2022062521/56816741550346895ddbf491/html5/thumbnails/30.jpg)