Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with...

Estimation of alternative splicing isoform frequencies

from RNA-Seq dataMarius Nicolae

Computer Science and Engineering Department

University of Connecticut

Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky

Introduction EM Algorithm Results Conclusions and future work

Outline

RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE)

A B C

A C

D E

Isoform Discovery (ID) Isoform Expression (IE)

Read ambiguity (multireads)

What is the gene length?

Gene Expression Challenges

A B C D E

Ignore multireads [Mortazavi et al. 08]

◦ Fractionally allocate multireads based on unique read estimates

[Pasaniuc et al. 10]◦ EM algorithm for solving ambiguities

Gene length: sum of lengths of exons that appear in at least one isoform Underestimate expression levels for genes with 2

or more isoforms [Trapnell et al. 10]

Previous approaches to GE

Read Ambiguity in IE

A B C D E

A C

[Jiang&Wong 09]◦ Poisson model, single reads only

[Li et al.10]◦ EM Algorithm, single reads only

[Feng et al. 10]◦ Convex quadratic program, pairs used only for ID

[Trapnell et al. 10]◦ Extends Jiang’s model to paired reads◦ Fragment length distribution

Previous approaches to IE

EM Algorithm for IE◦ Single and paired reads◦ Fragment length distribution◦ Strand information◦ Base quality scores

Solving GE by adding isoform levels

Our contributions


Outline

Read-Isoform Compatibility

Paired reads

Single reads

Fragment length distribution

A B C

A C

A B C

A CA C

A B C

A B C

A C

A B C

A C

A B C

A C

Series1

Series1

Series1

Series1

IsoEM algorithm

E-step

M-step


Outline

Human genome UCSC known isoforms

GNFAtlas2 gene expression levels◦ Uniform/geometric expression of gene isoforms

Normally distributed fragment lengths◦ Mean 250, std. dev. 25

Experimental setup

0 5 10 15 20 25 30 35 40 45 50 551

10

100

1000

10000

100000

Number of isoforms

Num

ber

of

genes

0

5000

10000

15000

20000

25000

Isoform length

Num

ber

of

isofo

rms

Error Fraction (EF)◦ Percentage of isoforms (or genes) with relative

error larger than given threshold t

Median Percent Error (MPE)◦ Threshold t for which EF is 50%

r2 ◦ Coefficient of determination

Accuracy measurements

30M single reads of length 25

Main difference b/w IsoEM and RSEM is fragment length modeling

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100Uniq Rescue UniqLN RSEM

IsoEM

Relative error threshold

% o

f is

ofo

rms

ove

r th

resh

old

Isoform Error Fraction Curves

Gene Error Fraction Curves

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq Rescue

GeneEM RSEM

IsoEM

Relative error threshold

% o

f g

enes

ove

r th

resh

old

30M single reads of length 25

Fixed sequencing throughput (750Mb)

50bp reads better than 100bp!

Read Length Effect

25 35 45 55 65 75 85 950

5

10

15

20

25

Paired reads

Single reads

Read lengthM

ed

ian

Perc

en

t E

rro

r25 35 45 55 65 75 85 95

0.962000000000001

0.964000000000001

0.966000000000001

0.968000000000001

0.970000000000001

0.972000000000001

0.974000000000001

0.976000000000001

0.978000000000001

Paired reads

Single reads

Read length

r2

1-60M 75bp reads

Pairs help, strand info doesn’t [Trapnell et al. 10] r2=.95 for 13M PE reads

Effect of Pairs & Strand Information

0 10000000 20000000 30000000 40000000 50000000 600000000.9250.93

0.9350.94

0.9450.95

0.9550.96

0.9650.97

0.9750.98

0.985

RandomStrand-Pairs-PerfectMapping

RandomStrand-Pairs

CodingStrand-pairs

RandomStrand-Single

CodingStrand-single

# reads

r2


Outline

Presented EM algorithm for isoform frequency estimation that exploits fragment length distribution for both single and paired reads◦ Significant accuracy improvement over existing

methods◦ Code and datasets to be released publicly soon

Ongoing extensions◦ Confidence intervals◦ Allelic specific isoform expression◦ Testing for novel isoforms◦ Integration with isoform discovery

Conclusions & Future Work

Questions?

Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with...

Documents

Transcript of Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with...