Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with...
Estimation of alternative splicing isoform frequencies
from RNA-Seq dataMarius Nicolae
Computer Science and Engineering Department
University of Connecticut
Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky
Introduction EM Algorithm Results Conclusions and future work
Outline
RNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression (GE)
A B C
A C
D E
Isoform Discovery (ID) Isoform Expression (IE)
Read ambiguity (multireads)
What is the gene length?
Gene Expression Challenges
A B C D E
Ignore multireads [Mortazavi et al. 08]
◦ Fractionally allocate multireads based on unique read estimates
[Pasaniuc et al. 10]◦ EM algorithm for solving ambiguities
Gene length: sum of lengths of exons that appear in at least one isoform Underestimate expression levels for genes with 2
or more isoforms [Trapnell et al. 10]
Previous approaches to GE
Read Ambiguity in IE
A B C D E
A C
[Jiang&Wong 09]◦ Poisson model, single reads only
[Li et al.10]◦ EM Algorithm, single reads only
[Feng et al. 10]◦ Convex quadratic program, pairs used only for ID
[Trapnell et al. 10]◦ Extends Jiang’s model to paired reads◦ Fragment length distribution
Previous approaches to IE
EM Algorithm for IE◦ Single and paired reads◦ Fragment length distribution◦ Strand information◦ Base quality scores
Solving GE by adding isoform levels
Our contributions
Introduction EM Algorithm Results Conclusions and future work
Outline
Read-Isoform Compatibility
Paired reads
Single reads
Fragment length distribution
A B C
A C
A B C
A CA C
A B C
A B C
A C
A B C
A C
A B C
A C
Series1
Series1
Series1
Series1
IsoEM algorithm
E-step
M-step
Introduction EM Algorithm Results Conclusions and future work
Outline
Human genome UCSC known isoforms
GNFAtlas2 gene expression levels◦ Uniform/geometric expression of gene isoforms
Normally distributed fragment lengths◦ Mean 250, std. dev. 25
Experimental setup
0 5 10 15 20 25 30 35 40 45 50 551
10
100
1000
10000
100000
Number of isoforms
Num
ber
of
genes
0
5000
10000
15000
20000
25000
Isoform length
Num
ber
of
isofo
rms
Error Fraction (EF)◦ Percentage of isoforms (or genes) with relative
error larger than given threshold t
Median Percent Error (MPE)◦ Threshold t for which EF is 50%
r2 ◦ Coefficient of determination
Accuracy measurements
30M single reads of length 25
Main difference b/w IsoEM and RSEM is fragment length modeling
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100Uniq Rescue UniqLN RSEM
IsoEM
Relative error threshold
% o
f is
ofo
rms
ove
r th
resh
old
Isoform Error Fraction Curves
Gene Error Fraction Curves
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq Rescue
GeneEM RSEM
IsoEM
Relative error threshold
% o
f g
enes
ove
r th
resh
old
30M single reads of length 25
Fixed sequencing throughput (750Mb)
50bp reads better than 100bp!
Read Length Effect
25 35 45 55 65 75 85 950
5
10
15
20
25
Paired reads
Single reads
Read lengthM
ed
ian
Perc
en
t E
rro
r25 35 45 55 65 75 85 95
0.962000000000001
0.964000000000001
0.966000000000001
0.968000000000001
0.970000000000001
0.972000000000001
0.974000000000001
0.976000000000001
0.978000000000001
Paired reads
Single reads
Read length
r2
1-60M 75bp reads
Pairs help, strand info doesn’t [Trapnell et al. 10] r2=.95 for 13M PE reads
Effect of Pairs & Strand Information
0 10000000 20000000 30000000 40000000 50000000 600000000.9250.93
0.9350.94
0.9450.95
0.9550.96
0.9650.97
0.9750.98
0.985
RandomStrand-Pairs-PerfectMapping
RandomStrand-Pairs
CodingStrand-pairs
RandomStrand-Single
CodingStrand-single
# reads
r2
Introduction EM Algorithm Results Conclusions and future work
Outline
Presented EM algorithm for isoform frequency estimation that exploits fragment length distribution for both single and paired reads◦ Significant accuracy improvement over existing
methods◦ Code and datasets to be released publicly soon
Ongoing extensions◦ Confidence intervals◦ Allelic specific isoform expression◦ Testing for novel isoforms◦ Integration with isoform discovery
Conclusions & Future Work
Questions?