RNAseq: isoform expression quantification and transcript...

32
RNAseq: isoform expression quantification and transcript assembly Slides courtesy from S. Salzberg, C. Trapnell, L. Pachter and K. Okrah

Transcript of RNAseq: isoform expression quantification and transcript...

Page 1: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

RNAseq: isoform expression quantification and transcript assembly

Slides courtesy from S. Salzberg, C. Trapnell, L. Pachter and K. Okrah

Page 2: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

!

Corrada Bravo 10/30/09

Sec-gen Sequencing

2

mRNA fragments to be sequenced

Page 3: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

!

Corrada Bravo 10/30/09

Sec-gen SequencingPaired-Ends

3

mRNA fragments to be sequenced

In paired-end sequencing reads are generated from both ends of a fragment

Page 4: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Goal:  Develop  and  analyze  a  sta1s1cal  model  for  measuring  differen1al  expression  ofIsoforms  of  the  same  gene  using  Rna-­‐Seq.  

Source:  Computa.onal  Genome  Analysis

Recall:

Page 5: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Isoform  1  :  True  abundance  measure  

Isoform  2  :  True  abundance  measure

Isoform  3  :  True  abundance  measure  

Isoform  4  :  True  abundance  measure

The goal is to estimate the true abundance measure of the 4 isoforms.

Suppose  we  have  a  gene  with  4  isoforms  and  3  alterna1vely  spliced  (AS)  exonsas  shown  above.

AS  exons

GENE

Transcrip

t  pop

ula.

on

5’  UTR 3’  UTR

Page 6: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Fragmented  mRNas:    54  t0tal  reads  with  18  unique  types.

s1 s2 s3 s4 s7s6

s1

s8

s11

s2

s11

s11

s11

s12 s13

s13

s3 s10s9

s7

s5

s1 s2 s3 s4 s5 s6 s7 s8

s1 s2 s3 s4 s5 s6 s7 s8

s10

s12

s8

s15s14s13

s17

s8s7

s13s12 s14 s15

s18s12

s16

s16

spliced  reads

Page 7: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

TopHat for second generation RNA-Seq: spliced read alignment

• Suitable for

• short reads (25-50bp)

• long reads (100+ bp)

• paired end reads

• New features since 0.8x (Trapnell et al., Bioinformatics 2009)

• Much faster, almost fully threaded

• Semi-canonical introns (GC-AG and AT-AC) and some support for microexons

7

...

...

read set

genome

1. TopHat

Page 8: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

FPKM

• Expected number of Fragments Per Kilobase (of transcript) per Million fragments sequenced in an RNA-Seq experiment.

•These units are proportional to the .

8

θi

Page 9: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

FPKMg =1R

�ra

lengtha

�+

1R

�rb

lengthb

FPKMproj(g) =1R

�ra + rb

lengthproj(g)

ra

lengtha≥ ra

lengthproj(g),

rb

lengthb≥ rb

lengthproj(g)

FPKMg ≥ FPKMproj(g)

Projective normalization underestimates expression

9

isoform aisoform b

project all isoforms into genome coordinates

R reads total, r reads for the gene:- ra for isoform a- rb for isoform b

but so

Page 10: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

How should expression levels be estimated?

10

• A-B are distinguished by the presence of splice junction (a) or (b).

• A-C are distinguished by the presence of splice junction (a) and change in UTR

• B-C are distinguished by the presence of splice junction (b) and change in UTR

(a)(b)

Page 11: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

How should expression levels be estimated?

11

• Longer transcripts contain more reads.

• Reads that could have originated from multiple transcripts are informative.

• Relative abundance estimation requires “discriminatory reads”.

(a)(b)

Page 12: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Isoform-level expression quantification

Jiang and Wong. Bioinformatics, 2009.Salzman, Jiang and Wong. Statistical Science, 2011.

Page 13: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Isoform  1  :  True  abundance  measure  

Isoform  2  :  True  abundance  measure

Isoform  3  :  True  abundance  measure  

Isoform  4  :  True  abundance  measure

The goal is to estimate the true abundance measure of the 4 isoforms.

Suppose  we  have  a  gene  with  4  isoforms  and  3  alterna1vely  spliced  (AS)  exonsas  shown  above.

AS  exons

GENE

Transcrip

t  pop

ula.

on

5’  UTR 3’  UTR

Page 14: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

14Example:  mouse  RNAseq  data

STATISTICAL MODELING OF RNA-SEQ DATA 75

FIG. 6. Visualization of RNA-Seq read pairs mapped to the mouse gene Rnpep in the CisGenome Browser (see Jiang et al., 2010). Fromtop to bottom: genomic coordinates, gene structure where exons are magnified for better visualization, read pairs mapped to the gene. Readsare 32 bp at each end. A read that spans a junction between two exons is represented by a wider box.

amino peptidase, meaning that it is used to degradeproteins in the cell. After mapping, 116 read pairs wereassigned to this gene, out of which 113 read pairs wereused in the computation after outlier removal. Figure 6presents the positions where the reads are mapped.The gene was picked because it has two alternativelyspliced isoforms with a structure that makes distin-guishing reads from each isoform challenging, and be-cause the number of reads was small enough to visual-ize all of them in a simple figure.

5.1 Uniform Sampling Model

Any paired end read experiment can be treated as asingle end read experiment by taking each paired endread and treating it as two distinct single end reads, one

from each side of the pair. In this, the 113 paired endreads become 226 single end reads (without pairing in-formation).

In the uniform sampling model, for either isoform,the sampling rate vector for each read sj can take atmost two values: 2n when the isoform can generateread j and 0 when it cannot. Because there are onlytwo isoforms, one of which (isoform 2) excludes oneof the exons of the other (isoform 1), it is evident thatin the uniform sampling model, there are only threecategories for the two isoforms.

The total length of isoform 1 is 2,300. The totallength of isoform 2 is 2,183. Hence, computing ai,j bysumming over the sampling rate vectors of the readsin the same category, the three categories can be rep-

Page 15: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Fragmented  mRNas:    54  t0tal  reads  with  18  unique  types.

s1 s2 s3 s4 s7s6

s1

s8

s11

s2

s11

s11

s11

s12 s13

s13

s3 s10s9

s7

s5

s1 s2 s3 s4 s5 s6 s7 s8

s1 s2 s3 s4 s5 s6 s7 s8

s10

s12

s8

s15s14s13

s17

s8s7

s13s12 s14 s15

s18s12

s16

s16

Sampling  rate:

The  ability  for  each  of  the  54  reads  to  be  sequenced  depends  on:

1.Transcript  fragmenta3on.2.  Size  selec3on.3.  Sequence  specificamplifica3on  of  selec3on.  

Page 16: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

3.3  Likelihood  Function

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18

θ1 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 24

θ2 1 1 1 0 0 0 2 2 1 2 1 1 1 0 0 0 0 0 13

θ3 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 0 0 12

θ4 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 5

4 4 4 3 3 3 5 5 1 2 4 4 4 2 2 2 1 1 54

For  each  read  type,  we  only  observe  nj.  We  want  to  es1mate  last  column  (transcript  abundance).

Last  lecture  concentrated  on  using  the  sum  over  the  en1re  table  (54)  for  posi1ons  that  overlap  every  transcript

Page 17: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

3.3  Likelihood  Function

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18

θ1 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 24

θ2 1 1 1 0 0 0 2 2 1 2 1 1 1 0 0 0 0 0 13

θ3 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 0 0 12

θ4 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 5

4 4 4 3 3 3 5 5 1 2 4 4 4 2 2 2 1 1 54

• In reality we only observe nj = niji=1

I

! .

• nj ~ Poisson( !iaiji=1

I

! =! Taj ), where !=!1

...! I

"

#

$$$$

%

&

''''

, aj=

a1 j

...aIj

"

#

$$$$

%

&

''''

.

• Likelihood: f! (n1,n2,...,nJ ) =(! Taj )

nj e(!Taj

nj !j=1

J

) .

Page 18: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Uniform  sampling  model

• Appropriate  for  single  read  data.  (transcript  length  is  not  considered)

Model  for  A:

Interpreta3on  of  abundance:

This choice of this A means that !i =cilicii!

,

where li is the length of transcript i and ci isthe numberof copies in the ith transcript in the sample.

Remember  FPKM!

Page 19: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

How do you fit it?

• The use of Poisson model makes things very easy

• The idea is to use Maximum Likelihood Estimation: find estimates that maximize the probability of observed data under Poisson model!

• Equivalent to a convex optimization problem:

maximize nT log(AT θ)− sum(AT θ)s.t. θ ≥ 0

Page 20: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

RNAseq: transcript assembly and quantification

All Slides courtesy from S. Salzberg, C. Trapnell and L. PachterTrapnell, et al. Nature Biotechnology, 2010.

Page 21: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Overview of cufflinks

21

...

g1

g2

g3

g4

gm

genome

Partition into loci

Minimal set of explaining transcriptsvia Dilworth’s theorem

Statistical estimation of transcript abundances

Transcriptome

g1 g2 g3 g4 gm

ρt

ρt

Page 22: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Comparative transcript assembly

• Desirable properties of an assembly: consistency, parsimony and identifiability.

• Dilworth’s theorem and its application to transcript assembly.

• The Cufflinks assembler.

• Promoter discovery and novel isoforms.

• Lessons learned.

22

Page 23: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Transcriptome assembly with a reference genome

23

How many transcripts?

Don’t know that two reads came from the same transcripts, but sometimes know that they came from different transcripts

Genome

reads

Page 24: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

A partial order on paired end read alignments

• Alignment x y when

• x starts to the left of y in the reference

• x and y overlap consistently

• y is not contained in x

• That is, x y when they could have come from the same transcript

24

Page 25: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Dilworth’s theorem applied to the read partial order

• Definition: an antichain in the read partial order is a set of alignments with the property that no two are compatible (i.e. could arise from the same transcript).

• Theorem [R.P. Dilworth, “A decomposition theorem for partially ordered sets”, Annals of Mathematics, 1950]: The size of the largest antichain is equal to the minimum size of a chain partition.

25

Page 26: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Dilworth’s theorem applied to the read partial order

• Definition: an antichain in the read partial order is a set of alignments with the property that no two are compatible (i.e. could arise from the same transcript).

• Theorem [R.P. Dilworth, “A decomposition theorem for partially ordered sets”, Annals of Mathematics, 1950]: The size of the largest antichain is the minimum number of transcripts needed to explain the alignments.

• There is a constructive proof of the theorem, which reduces the problem to finding a maximum matching in a bipartite graph. The Hopcroft-Karp algorithm solves this problem in time where we have V=M, the number of fragments sequenced.

• We rely instead on a maximum weighted matching algorithm; the best running time for weighted maximum matching is .

• This approach builds on ideas from N. Eriksson et al. (PLoS Computational Biology 2008) where a similar parsimony approach is used for viral population estimation.

26

O(√

V E)

O(V 2logV + V E)

Page 27: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Phasing splicing events using weighted matching

27

A B

CD E

Page 28: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Properties of Cufflinks assemblies

• The assemblies are parsimonious- guarantee that the number of assembled transcripts is minimal.

• In the case of multiple minimal assemblies, likelihoods are compared in order to pick the best phasing.

• Identifiability of the resulting models is a corollary of Dilworth’s theorem (the maximum antichain is a permutation submatrix of the read-transcript matrix, hence the latter is full rank).

28

Page 29: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Discovery is necessary for accurate abundance estimates

29

!

Page 30: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

RNA-Seq time course analysis

• Measuring changes in relative abundances over time.

• Iosoform switching and generalizations.

• Inference of transcriptional versus post-transcriptional regulation.

30

Page 31: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

The skeletal myogenesis transcriptomeRNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation

31

-24 hours

60 hours

168 hours

differentiation(starting at 0 hours)

fusion

myotubemyoctyte

120 hours

Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832

•84,369,078 reads

•140,384,062reads

• 82,138,212reads

•123,575,666reads

•66,541,668alignments

•103,681,081alignments

•47,431,271alignments

•89,162,512alignments

•10,754,363to junctions

•19,194,697to junctions

•9,015,806to junctions

•17,449,848to junctions

•58,008transfrags

•69,716transfrags

•55,241transfrags

•63,664transfrags

Page 32: RNAseq: isoform expression quantification and transcript …users.umiacs.umd.edu/~hcorrada/CMSC858B/lectures/lect13/... · 2012. 3. 5. · transcript assembly Slides courtesy from

Dynamics of Myc expression

32

!

!

d( , )