Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. ·...
Transcript of Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. ·...
Analysis of RNA-‐seq Data (Part 2)
Lecture 10: September 20, 2012
Transcript Assembly and QuanEficaEon
Transcript Assembly
• Transcript assembly aims to answer quesEons about gene expression, which results in wanEng to know the transcripts (and hence, exon regions and splice sites) and the number of transcripts.
• Several transcripts may overlap at different regions or there may be mulEple copies of the same transcript.
3
• Recall with genome assembly, the aim is assemble a single sequence from a set of reads.
• Whereas with transcript assembly, the aim is assemble the set of sequence that most-‐likely explains all the reads. – Transcript assembly is much more difficult for this reason.
4
Transcript Assembly
• De novo transcript assembly: assembly of transcripts where there exists no reference genome
• Reference guided transcript assembly: significantly easier than de novo assembly – Map to the reference (using the methods discussed from last Eme) and use the alignment to guide assembly
5
De Novo vs. References Guided Transcript Assembly
Reference Guided Transcript Assembly
Cufflinks (Trapnell et al)
• Cufflinks: an algorithm that idenEfies complete novel transcripts and probabilisEcally assigns reads to isoforms.
• Extends the work of TopHat (Pachter lab). • The RNA sequence fragments are mapped to the reference using TopHat.
• Aim is to recover the minimal set of transcripts supported by the alignments.
7
8
Cufflinks (Trapnell et al) • A fragment corresponds to a single cDNA molecule, which can be represented by a pair of reads from each end.
• Uses a comparaEve transcriptome assembly algorithm to produce the minimal set of transcripts supported by the fragment alignment.
• Reduces the transcript assembly problem to finding a maximum matching in a weighted bipar>te graph.
9
10
TopHat
CuffLinks
Cufflinks (Trapnell et al)
• Takes as input cDNA fragment sequences that have been aligned to the genome by using so^ware that is capable of doing split alignments (i.e. TopHat).
• With paired-‐end RNA-‐seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping “bundles” of fragment alignments separately. – This reduces running Eme and memory usage.
11
12
The first step is to idenEfy pairs of incompaEble fragments that must have originated from disEnct spliced mRNA isoforms. Fragments are connected in an “overlap graph” when they are compaEble and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from le^ to right along the genome, is placed between each pair of compa>ble fragments.
13
14
Isoforms are then assembled from the overlap graph. Paths through the graph correspond to mutually compa>ble fragments that could be merged into complete isoforms.
Dilworth’s Theorem: characterizes the width of any parEally ordered set in terms of a parEEon of the order into a minimum number of chains
15
ParEally Ordered Set • Par>ally ordered set (or poset) formalizes and generalizes the intuiEve concept of an ordering, sequencing, or arrangement of the elements of a set.
Poset = Set + Binary RelaEon 16
AnEchain and Poset Width
• We say two elements a and b of a parEally ordered set are comparable if a ≤ b or b ≤ a.
• Chain: set of elements every two of which are comparable.
• An>chain: subset of a parEally ordered set such that any two elements in the subset are incomparable.
• Width of a poset: the cardinality of a maximum anEchain.
17
Dilworth’s Theorem: characterizes the width of any parEally ordered set in terms of a parEEon of the order into a minimum number of chains
18
Dilworth’s Theorem • Dilworth's Theorem: the number of mutually incompaEble reads is the same as the minimum number of transcripts needed to “explain” all the fragments.
• A proof of Dilworth's Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform.
19
20
Transcript Abundance is EsEmated • Fragments are matched to the transcripts from which they could have originated.
• Transcript abundance is es>mated using a staEsEcal model in which the probability of observing each fragment is a linear funcEon of the abundance of the transcripts from which it could have originated.
• Because only the ends of each fragment are sequenced, the length maybe unknown.
21
22
Violet fragment
Assigning a fragment to different isoforms o^en implies a length for it. Cufflinks incorporates the distribuEon of fragment lengths to help assign fragments to isoforms.
23
Lastly, Cufflinks maximizes a funcEon that assigns a likelihood to all possible sets of relaEve abundances, which produces the abundance that best explain the observed fragments (shown in the pie chart).
24
TopHat
CuffLinks
De Novo Transcript Assembly
De Novo Transcript Assemblers
26
Trans-‐ABySS: one of the first tools, a repurposed de Bruijn genome assembler (ABySS) that works well for viruses and bacteria. Oases: is the equivalent to Trans-‐ABySS from the developers Velvet.
De Novo Transcript Assemblers
27
• Trinity is probably the best one in terms of results and ease of use. The original paper showed some impressive results on non-‐coding RNAs in mammals.
• SOAPdenovo-‐Trans: developed at BGI. Has heavy memory requirements of SOAP tools (30 GB for a RNA-‐seq run).