RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon...
Transcript of RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon...
RNA-seq data analysis with BioconductorÂngela Filimon Gonçalves
Functional Genomics Team at the European Bioinformatics Institute - EMBL
Tuesday, 11 September 2012
Analysis pipeline overviewFrom raw sequence files to expression levels:
1. Quality control and filtering
2. Read mapping
3. Quantifying expression
4. Normalisation
Downstream analysis:
5. E.g. Differential expression
Tuesday, 11 September 2012
Analysis pipeline overviewFrom raw sequence files to expression levels:
1. Quality control and filtering
2. Read mapping
3. Quantifying expression
4. Normalisation
Downstream analysis:
5. Differential expression
Tuesday, 11 September 2012
Read mappingWhich reads come from which regions of a genome or transcript?
Reads can be:
• assembled de novo to find the transcriptome
or mapped to
• a genome
• a transcriptome.
decreasing complexity
Tuesday, 11 September 2012
Read mappingWhich reads come from which regions of a genome or transcript?
Reads can be:
• assembled de novo to find the transcriptome
or mapped to
• a genome
• a transcriptome.
decreasing complexity
Tuesday, 11 September 2012
De novo read assemblyObjective:
• find the set of longest possible contiguously expressed regions (contigs) without using a reference - by exploiting the overlaps between reads
One of the most prevalent algorithmic approaches uses a de Bruijn graph representation
• implemented in methods such as:
• Trinity (Grabherr, 2011)
• Trans-ABySS (Robertson, 2010)
• or Oases (Schulz, 2012).
Tuesday, 11 September 2012
De novo read assemblyE.g. we sequence a transcript in
three reads:
We choose a length k (e.g. k = 4) and create two subsets:
• one with all possible substrings of length k (known as k-mers):
ATGG, TGGC, GCGT, CGTG, GTGC, TGCA
• and another with substrings of length k-1:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
A T G G C G T G C AA T G G C
G C G T GG T G C A
Tuesday, 11 September 2012
De novo read assemblyBuild a graph, with the k-1-mers as
nodes and the k-mers as directed
edges:
edges:
ATGG, TGGC, GCGT, CGTG, GTGC, TGCA
nodes:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
ATG TGGATGG
A T G G C G T G C AA T G G C
G C G T GG T G C A
Tuesday, 11 September 2012
De novo read assembly
ATG TGGATGG
GGCTGGC
A T G G C G T G C AA T G G C
G C G T GG T G C A
edges:
ATGG, TGGC, GCGT, CGTG, GTGC, TGCA
nodes:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
Tuesday, 11 September 2012
De novo read assembly
ATG TGGATGG
GGCTGGC
??
edges:
ATGG, TGGC, GCGT, CGTG, GTGC, TGCA
nodes:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
A T G G C G T G C AA T G G C
G C G T GG T G C A
Tuesday, 11 September 2012
De novo read assembly
What now?
We cannot extend the contig, but we can create a second one:
ATG TGGATGG
GGCTGGC
GCG CGTGCGT
GTGCGTG
TGCGTGC
GCATGCA
ATGGC
GCGTGCA
A T G G C G T G C AA T G G C
G C G T GG T G C A
Tuesday, 11 September 2012
De novo read assembly
edges:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
nodes:
AT, TG, GG, GC, CG, GT, CA
A T G G C G T G C AA T G G C
G C G T GG T G C A
Why was the transcript split into 2?
Minimum overlap between reads is only 2, so our choice of k (4) is too high.
Try k = 3 (more edges, fewer nodes):
Tuesday, 11 September 2012
De novo read assembly
edges:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
nodes:
AT, TG, GG, GC, CG, GT, CA
AT TGATG
GGTGG
GCGGC
CGGCG
GTCGT
Tuesday, 11 September 2012
De novo read assembly
edges:
ATG, TGG, GGC, GCG, CGT, GTG, TGC, GCA
nodes:
AT, TG, GG, GC, CG, GT, CA
AT TGATG
GGTGG
GCGGC
CGGCG
GTCGT
TGCGTG
CAGCA
Tuesday, 11 September 2012
De novo read assemblySmaller k:
• increases the chance of observing overlaps -> more complex graphs
Larger k:
• decreases the chance of observing overlaps -> simpler graphs
There is a balance between sensitivity and specificity determined by k -> k has a significant influence on the quality of the assembly.
The best k depends on the coverage, read length and error rate but it’s hard to estimate it.
Often a small range of values is tested to find the optimal k.
Tuesday, 11 September 2012
De novo read assemblyFor genome assembly the coverage is reasonably even, so a single k works well.
For RNA-seq
• highly expressed genes -> more overlap between reads -> the optimal k is larger -> allows long contigs and produces simpler graphs
• lowly expressed genes -> less overlap -> optimal k is smaller -> complex graphs
use a range of ks and merge contigs.
Tuesday, 11 September 2012
De novo read assembly
AT TGATG
GGTGG
GCGGC
CGGCG
GTCGT
TGCGTG
CAGCA
Additional complexity arises from:
heterozygosity
sequencing errors
alternative splicing!!!
Tuesday, 11 September 2012
De novo read assemblyHow to assemble the contigs into gene models? There are many many methods... (Robertson, 2010)
contig merging and expansion
connecting contigs using paired-end reads
alignment of the contigs to the genome and comparison with known annotation
etc...
Tuesday, 11 September 2012
De novo read assemblyMethods using de Bruijn graphs can be used to assemble transcriptomes de novo.
Con
• the assembly if difficult and only the most abundant transcripts are likely to be fully assembled
Pro
• useful for when there is no reference genome for the species in question and/or the annotation is of poor quality.
Tuesday, 11 September 2012
Read mappingWhich reads come from which regions of a genome or transcript?
Reads can be:
• assembled de novo to find the transcriptome
or mapped to
• a genome
• a transcriptome
decreasing complexity
Tuesday, 11 September 2012
Read mapping
read
genome
1st mate 2nd mate...
read
transcript
X
Tuesday, 11 September 2012
Read mappingTopHat - a hybrid approach¹
• First, align to the genome contiguously
• then assemble thealigned reads into putative exons
• map the remaining reads to a set of pairing betweenthose exons.
1. Trapnell et al, Bioinformatics, 2009 Tuesday, 11 September 2012
Read mappingGene model assembly with Cufflinks - find the smallest possible set of transcripts that explains all the observed reads by:
1. constructing a directed acyclic graph where the reads are nodes and the nodes are connected if
• the reads overlap with one another and
• they are not incompatible
2. find the minimum number of partitions in the graph with Dilworth’s theorem.
Tuesday, 11 September 2012
Read mapping
inferred transcript models
Tuesday, 11 September 2012
Read mappingGene model assembly with Cufflinks:
the minimum number of paths is not necessarily unique -> need to phase distant exons together
Cufflinks uses the expression levels to weight the edges and find the most likely path
Tuesday, 11 September 2012
Read mappingGene model assembly with Cufflinks:
• because it’s a de novo method it requires high coverage (high read overlap) to properly assemble transcripts
• one caveat is that all paths are maximally extended -> impossible to detect some instances of alternative transcription start and end sites
• finds the minimum set of transcripts that explains the data (Occam’s Razor principle), which is not necessarily what evolution comes up with...
Tuesday, 11 September 2012
Read mappingMapping to the genome
Pros
• discovery of novel transcribed regions
• de novo assembly of gene models for species with no annotation
Cons
• gapped mapping algorithms have limitations - even when allowing gaps in mapping to the genome quite a few of the reads that don’t map would do so contiguously to the transcriptome
• the analysis of the result in order to obtain transcript expression is more complex
Tuesday, 11 September 2012
Read mappingMapping to the transcriptome
Pros
• less complexity in mapping - reads map contiguously
• the output is easily interpretable
• faster than mapping to a genome
Cons
• only for annotated regions -> you must be reasonable confident in the annotation
• multiple transcripts for the same gene (reads might map equally well to these because of shared sequence)
Tuesday, 11 September 2012
Analysis pipeline overviewFrom raw sequence files to expression levels:
1. Quality control and filtering
2. Read mapping
3. Quantifying expression
4. Normalisation
Downstream analysis:
5. Differential expression
Tuesday, 11 September 2012
Quantifying expression“Easy”: count the number of reads that aligned to a gene, exon, splice-junction...
What about multi-mapping reads?
read
genome
annotatedtranscripts
g1 g2
Tuesday, 11 September 2012
Quantifying expression
read
genome
annotatedtranscripts
g1 t1g1 t2
And what about isoform levels?
Tuesday, 11 September 2012
Quantifying expressionDiscard multi-mapping reads:
• good at the gene, exon, splice junction level
• not so good at the isoform level, use an alternative method
Assign the reads probabilistically according to unique mapping neighbouring reads (still doesn’t work for isoform levels).
Use a method for estimating the expression most likely to explain the data
• MMSEQ (Turro, 2011) - alignment to the transcriptome
• Cufflinks (Trapnell, 2010) - alignment to the genome + de novo gene model assembly
• and several others (POEM, MISO, RSEM...)
Tuesday, 11 September 2012
Quantifying expressionMulti-mapping reads
• Obtain transcript sets, such that each read maps to only 1 set
• Transcripts may belong to more than one set
• Read counts per set can be observed
• Transcripts can be isoforms sharing exons or from multiple
genes
A
B
C
D
A
B
D
A
B
C
Reads Transcripts
Reads Transcripts Transcriptsets
1
2
3
MMSEQ¹
• map reads to transcriptsequences with Bowtie
1. Turro et al., Genome Biology, 2011Tuesday, 11 September 2012
Quantifying expressionMMSEQLatent variables for read counts
A
B
D
A
B
C
Reads Transcripts Transcriptsets
1
2
3
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
Latent variables for read counts
A
B
D
A
B
C
Reads Transcripts Transcriptsets
1
2
3
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
Latent variables for read counts
A
B
D
A
B
C
Reads Transcripts Transcriptsets
1
2
3
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
k =
212
r =�r1 r2 r3 r4
�Xit ∼ Poisson(bsiMitµt),
ki ∼ Poisson(bsi�
t
Mitµt),
{X1t , . . . ,Xmt}|rt ∼ Mult
�rt ,
M1ts1�i Mitsi
, . . . ,Mmtsm�i Mitsi
�.
A B C D
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Transcriptsets
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Observedset counts
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts
Poisson model for transcript set reads counts
Model reads per transcript set instead of per gene.
Define Mit =
�1 if transcript t in set i ,0 otherwise.
Now model for reads counts is:
ki ∼ Poisson(bsi�
t
Mitµt),
where si is the effective length shared by transcripts in set i .Xit ∼ Poisson(bsiMitµt),
ki ∼ Poisson(bsi�
t
Mitµt),
rt ∼ Poisson(bµt
�
i
Mitsi ) = Poisson(bltµt),
{X1t , . . . ,Xmt}|rt ∼ Mult
�rt ,
M1ts1�i Mitsi
, . . . ,Mmtsm�i Mitsi
�.
observed number of reads in set i
Latent variables for read counts
A
B
D
A
B
C
Reads Transcripts Transcriptsets
1
2
3
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
M =
1 1 0 01 1 1 00 0 0 1
X =
X11 X12 0 0X21 X22 X23 00 0 0 X44
r =�r1 r2 r3 r4
�
Latent variables for read counts
A
B
D
A
B
C
Reads Transcripts Transcriptsets
1
2
3
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
Latent variables for read counts
A
B
D
A
B
C
Reads Transcripts Transcriptsets
1
2
3
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
A B C D
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Latent variables for read counts
M =
1 1 0 0
1 1 1 0
0 0 0 1
X =
X11 X12 0 0
X21 X22 X23 0
0 0 0 X34
ki ∼ Poisson(bsi�
t
Mitµt)
Xit ∼ Pois(bsiMitµt)
Similar model has been used by Richard et al. 2010; Jiang and Wong 2009(defining transcript sets by exon-sharing, don’t use reads mapping tomultiple genes),also by Li et al. 2010.
None of these have used haplotype-specific isoforms.Alex Lewin (Imperial) Haplotype-specific isoform expression 12 April 2011 8 / 20
k =
212
r =�r1 r2 r3 r4
�Xit ∼ Poisson(bsiMitµt),
ki ∼ Poisson(bsi�
t
Mitµt),
{X1t , . . . ,Xmt}|rt ∼ Mult
�rt ,
M1ts1�i Mitsi
, . . . ,Mmtsm�i Mitsi
�.
A B C D
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti.
Transcripts
Transcriptsets
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Transcriptsets
123
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit=
�1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki∼Poisson(bsi�
t
Mitµt),
wheresiistheeffectivelengthsharedbytranscriptsinseti.
Observedset counts
Poissonmodelfortranscriptsetreadscounts
Modelreadspertranscriptsetinsteadofpergene.
DefineMit = �1iftranscripttinseti,0otherwise.
Nowmodelforreadscountsis:
ki ∼Poisson(bsi �t Mit µt ),
wheresi istheeffectivelengthsharedbytranscriptsinseti. Unobservedtranscript counts
Poisson model for transcript set reads counts
Model reads per transcript set instead of per gene.
Define Mit =
�1 if transcript t in set i ,0 otherwise.
Now model for reads counts is:
ki ∼ Poisson(bsi�
t
Mitµt),
where si is the effective length shared by transcripts in set i .Xit ∼ Poisson(bsiMitµt),
ki ∼ Poisson(bsi�
t
Mitµt),
rt ∼ Poisson(bµt
�
i
Mitsi ) = Poisson(bltµt),
{X1t , . . . ,Xmt}|rt ∼ Mult
�rt ,
M1ts1�i Mitsi
, . . . ,Mmtsm�i Mitsi
�.
unknown real expression of transcript t
can be estimated with an EM algorithm
normalisationterms
Tuesday, 11 September 2012
Analysis pipeline overviewFrom raw sequence files to expression levels:
1. Quality control and filtering
2. Read mapping
3. Quantifying expression
4. Normalisation
Downstream analysis:
5. Differential expression
Tuesday, 11 September 2012
NormalisationDo we need normalisation?
• longer transcripts produce more reads than shorter ones - divide by length
• and read count is proportional to the depth of sampling - divide by total number of reads
Tuesday, 11 September 2012
NormalisationSample 1 Sample 2
Gene 1 1,000,000 1,000,000
Gene 2 0 1,000,000
: : :
Gene 19,999 20,000 20,000
Gene 20,000 0 20,000
Total Number of Reads
10,000,000 20,000,000
Tuesday, 11 September 2012
NormalisationSample 1 Norm Sample
1Sample 2 Norm Sample
2Gene 1 1,000,000 0.1 1,000,000 0.05
Gene 2 0 0 1,000,000 0.05
: : : : :
Gene 19,999 20,000 0.002 20,000 0.001
Gene 20,000 0 0 20,000 0.001
Total Number of Reads
10,000,000 20,000,000
Tuesday, 11 September 2012
NormalisationRobinson and Oshlack¹ propose a method that after trimming the data calculates a weighted global fold change between the two samples, which is used as a normalisation factor
This, and other similar approaches²³, can better account for differences in read count between samples
1. Robinson and Oshlack, Genome Biology, 20102. Bullard et al., BMC Bioinformatics, 20103. Anders and Huber, Genome Biology, 2010
Tuesday, 11 September 2012
Normalisation - an additional note on normalising across genes
• recent studies found that there are position and sequence-dependent biases¹²³⁴ that affect which fragments are sequenced
• e.g. 3’ biases due fragmentation, depletion of reads near exon-exon junctions because of difficulties in alignment, sequence preferences introduced by random hexamer priming, etc...
1. Mortazavi et al., Nature Methods, 20082. Wang et al., Nature Review Genetics, 20093. Bohnert and Ratsch, Nucleic Acids Research, 20104. Hansen et al., BMC Bioinformatics, 2009
Tuesday, 11 September 2012
Normalisation - an additional note on normalising across genes
• we may not be able to understand these biases, but several authors propose methods for correcting for the biases by learning from the data¹²³
• is this necessary to do this kind of correction? Depends, but the improvement on the estimates appears to be only slight³⁴.
1. Hansen et al., BMC Bioinformatics, 20092. Li et al., Genome Biology 2010 -> BIOCONDUCTOR PACKAGE MSEQ3. Roberts et al., Genome Biology, 20114. Turro et al., Genome Biology, 2011
Tuesday, 11 September 2012
Analysis pipeline overviewFrom raw sequence files to expression levels:
1. Quality control and filtering
2. Read mapping
3. Quantifying expression
4. Normalisation
Downstream analysis:
5. Differential expression
Tuesday, 11 September 2012
Differential expressionObjective: comparing read counts between different biological conditions.
We want to use a statistical test to decide whether, for a given gene, an observed difference in read counts is significant.
Tuesday, 11 September 2012
Differential expressionedgeR (Robinson et al., 2010)
• Normalisation: quantile adjusted method (or TMM)
• Model & test: Negative binomial, exact test to obtain a p-value
DESeq (Anders & Huber, 2010)
• Normalisation: size factor estimation (median of the ratios of gene read counts to the geometric mean of all genes in sample i)
• Model: Negative binomial, exact test to obtain a p-value
baySeq (Hardcastle & Kelly, 2010)
• Normalisation: none
• Model & test: Bayesian approach with Poisson or Negative binomial, estimate posterior likelihood of each model and rank the models rather than obtaining a p-value, not restricted to pairwise comparisons
Tuesday, 11 September 2012
Differential expressionDESeq, edgeR and baySeq have been compared using simulated data¹². Overall the studies found that the three methods perform similarly, with baySeq performing marginally better in ranking genes according to their significance.
DEGSeq (Wang et al., 2009)
• Normalisation: none, loess, median
• Model & test: MA-plot based methods, Likelihood Ratio Test, Fisher’s Exact Test
1. Kvam et al., American Journal of Botany, 20122. Glaus et al., Bioinformatics, 2012
Tuesday, 11 September 2012
Analysis pipeline overviewFrom raw sequence files to expression levels:
1. Quality control and filtering
2. Read mapping
3. Quantifying expression
4. Normalisation
Downstream analysis:
5. E.g. Differential expression
Tuesday, 11 September 2012
AcknowledgmentsErnest Turro, Cambridge University
John Marioni, EBI, Cambridge
Alvis Brazma, EBI, Cambridge
Tuesday, 11 September 2012
Practical
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
• differential expression with DESeq
Tuesday, 11 September 2012
SAM/BAM alignment filesAll aligners output their results in the SAM/BAM format:SRR034782.85 16 EG:4_group3 5036819 25 30M * 0 0 AAAACAATAAATTTCACTATTATTTCAAAA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< XT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:28G0G0
QNAME ID of the read (“query”)FLAG alignment flagsRNAME ID of the reference (typically: chromosome name)POS Position in reference (1-based, left side)MAPQ Mapping quality (as Phred score)CIGAR Alignment description (gaps etc.) in CIGAR formatMRNM Mate reference sequence name [for paired end data]MPOS Mate position [for paired end data]ISIZE inferred insert size [for paired end data]SEQ sequence of the readQUAL quality string of the readextra fields
Tuesday, 11 September 2012
SAM/BAM CIGAR strings
AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT
a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC
GCAGT CAACGCCAT
CIGAR string: 6M 2I 4M 1D 2M
Tuesday, 11 September 2012
SAM/BAM CIGAR strings
AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT
a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC
GCAGT CAACGCCAT
CIGAR string: 3S 11M
Tuesday, 11 September 2012
SAM/BAM CIGAR strings
AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT
a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC
GCAGT CAACGCCAT
CIGAR string: 6M 14N 8M
Tuesday, 11 September 2012
SAM/BAM CIGAR strings
AT GT T AGAT AA* * GAT AGCT GT GCT AGT AGGCAGT CAGCGCCATAGAT AAAGGAT A* CT
a a a T AGT AGGCAGTAT AGCT . . . . . . . . . . . . . . T CAGCGCCC
GCAGT CAACGCCAT
CIGAR string: 14M
extra tags: NM (number of mismatches) = 1
Tuesday, 11 September 2012
SAM/BAM extra tags
NMEdit distance to the reference, including ambiguous bases but excluding clipping
NH Number of reported alignments that contains the query in the current record
H0 Number of perfect hits
H1 Number of 1-difference hits (see also NM)Number of 2-difference hits
H2 Number of 2-difference hits
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
Tuesday, 11 September 2012
SAM/BAM alignment filesThe BAM format can be read easily into R.
ShortRead package
• (is able to read almost every sequencer’s manufacturer proprietary formats)
• loads most information (read DNA sequence, base quality, some alignment information, does not import the extra tags...) - not flexible
• stores data into an efficient object of class AlignedRead
• sequence manipulation and QA reports
• not aware of spliced alignments.
Tuesday, 11 September 2012
SAM/BAM alignment filesThe BAM format can be read easily into R.
Rsamtools package:
• can read ALL bam fields
• flexibility to choose which fields and which subset of the data to import (for example by genomic location, by alignment type, etc...)
• reads all data into lists, not very efficient for memory or speed
• low level interface with no data manipulation functions.
Tuesday, 11 September 2012
SAM/BAM alignment filesThe BAM format can be read easily into R.
GenomicRanges package
• builds on top of Rsamtools so it’s also flexible
• by default discards some information for the sake of memory usage and speed (read sequence, quality string, etc...)
• stores alignments into an efficient GappedAlignments object
• useful functions for finding overlaps and coverage
• is aware of spliced alignments
• does not import extra tag fields.
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
• differential expression with DESeq
Tuesday, 11 September 2012
SAM/BAM extra tags
NMEdit distance to the reference, including ambiguous bases but excluding clipping
NH Number of reported alignments that contains the query in the current record
H0 Number of perfect hits
H1 Number of 1-difference hits (see also NM)Number of 2-difference hits
H2 Number of 2-difference hits
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
• differential expression with DESeq
Tuesday, 11 September 2012
Importing annotationrtracklayer package:
• read in annotation from GFF files
biomaRt:
• interface to a growing collection of databases such as Ensembl, Uniprot and HapMap for retrieving annotation and reference sequences
• returns a data.frame
GenomicRanges:
• represent the features as a GRanges object that can be overlapped with the aligned reads (in the GappedAlignments object)
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
• differential expression with DESeq
Tuesday, 11 September 2012
Quantifying expressionIRanges and GenomicRanges:
• utilities for overlapping aligned reads and genomic features (genes, exons, etc...)
• aware of spliced alignments
• several modes for counting: union, intersection strict, intersection non-empty
We will end up with a table of counts per gene and per exon which can be used as the input for DE packages such as DESeq, edgeR, baySeq...
Tuesday, 11 September 2012
Quantifying expression
read
genome
annotatedgenes
g1g2
We already removed multi-mapping reads, but what if the read overlaps several features in the same locus?
There is no optimal solution.
Tuesday, 11 September 2012
Union IntStrict IntNonEmpty
A A A
A - A
A A A
ambiguous A A
ambiguous ambiguous ambiguous
How to avoid double counting?
READGENE A
READGENE A
READ READGENE A GENE A
READGENE A
GENE B
READGENE A
GENE B
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
• differential expression with DESeq
Tuesday, 11 September 2012
> 54 bp
> 3 reads
Defining novel transcribed regionsIRanges and GenomicRanges:
• define coverage
• find “islands” of expression
• define potential “exons” by setting thresholds...
• retain only regions that do not overlap annotated features
Tuesday, 11 September 2012
In the practical we will...• align the reads to the genome with TopHat
• import aligned reads into R
• discard reads that map multiple times to the genome
• import annotation from files or public databases (e.g. Ensembl)
• count reads over genes and exons
• find novel transcribed regions
• differential expression analysis with DESeq
Tuesday, 11 September 2012
What’s going on right now?
Read and do the practical till the end of section 3.4.
Tuesday, 11 September 2012