Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. ·...

27
Analysis of RNAseq Data (Part 2) Lecture 10: September 20, 2012

Transcript of Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. ·...

Page 1: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Analysis  of    RNA-­‐seq  Data  (Part  2)    

Lecture  10:  September  20,  2012    

Page 2: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Transcript  Assembly  and  QuanEficaEon  

Page 3: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Transcript  Assembly  

•  Transcript  assembly  aims  to  answer  quesEons  about  gene  expression,  which  results  in  wanEng  to  know  the  transcripts  (and  hence,  exon  regions  and  splice  sites)  and  the  number  of  transcripts.  

•  Several  transcripts  may  overlap  at  different  regions  or  there  may  be  mulEple  copies  of  the  same  transcript.  

3  

Page 4: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

•  Recall  with  genome  assembly,  the  aim  is  assemble  a  single  sequence  from  a  set  of  reads.  

•  Whereas  with  transcript  assembly,  the  aim  is  assemble  the  set  of  sequence  that  most-­‐likely  explains  all  the  reads.  – Transcript  assembly  is  much  more  difficult  for  this  reason.  

4  

Transcript  Assembly  

Page 5: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

•  De  novo  transcript  assembly:    assembly  of  transcripts  where  there  exists  no  reference  genome  

•  Reference  guided  transcript  assembly:    significantly  easier  than  de  novo  assembly  – Map  to  the  reference  (using  the  methods  discussed  from  last  Eme)  and  use  the  alignment  to  guide  assembly  

5  

De  Novo  vs.  References  Guided  Transcript  Assembly  

Page 6: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Reference  Guided  Transcript  Assembly  

Page 7: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Cufflinks  (Trapnell  et  al)  

•  Cufflinks:  an  algorithm  that  idenEfies  complete  novel  transcripts  and  probabilisEcally  assigns  reads  to  isoforms.  

•  Extends  the  work  of  TopHat  (Pachter  lab).  •  The  RNA  sequence  fragments  are  mapped  to  the  reference  using  TopHat.  

•  Aim  is  to  recover  the  minimal  set  of  transcripts  supported  by  the  alignments.  

7  

Page 8: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

8  

Page 9: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Cufflinks  (Trapnell  et  al)  •  A  fragment  corresponds  to  a  single  cDNA  molecule,  which  can  be  represented  by  a  pair  of  reads  from  each  end.  

•  Uses  a  comparaEve  transcriptome  assembly  algorithm  to  produce  the  minimal  set  of  transcripts  supported  by  the  fragment  alignment.  

•  Reduces  the  transcript  assembly  problem  to  finding  a  maximum  matching  in  a  weighted  bipar>te  graph.  

9  

Page 10: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

10  

TopHat  

CuffLinks  

Page 11: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Cufflinks  (Trapnell  et  al)  

•  Takes  as  input  cDNA  fragment  sequences  that  have  been  aligned  to  the  genome  by  using  so^ware  that  is  capable  of  doing  split  alignments  (i.e.  TopHat).  

•  With  paired-­‐end  RNA-­‐seq,  Cufflinks  treats  each  pair  of  fragment  reads  as  a  single  alignment.    The  algorithm  assembles  overlapping  “bundles”  of  fragment  alignments  separately.  – This  reduces  running  Eme  and  memory  usage.  

11  

Page 12: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

12  

The  first  step  is  to  idenEfy  pairs  of  incompaEble  fragments  that  must  have  originated  from  disEnct  spliced  mRNA  isoforms.  Fragments  are  connected  in  an  “overlap  graph”  when  they  are  compaEble  and  their  alignments  overlap  in  the  genome.    Each  fragment  has  one  node  in  the  graph,  and  an  edge,  directed  from  le^  to  right  along  the  genome,  is  placed  between  each  pair  of  compa>ble  fragments.  

Page 13: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

13  

Page 14: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

14  

Isoforms  are  then  assembled  from  the  overlap  graph.      Paths  through  the  graph  correspond  to  mutually  compa>ble  fragments  that  could  be  merged  into  complete  isoforms.      

Page 15: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Dilworth’s  Theorem:  characterizes  the  width  of  any  parEally  ordered  set  in  terms  of  a  parEEon  of  the  order  into  a  minimum  number  of  chains  

15  

Page 16: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

ParEally  Ordered  Set  •  Par>ally  ordered  set  (or  poset)  formalizes  and  generalizes  the  intuiEve  concept  of  an  ordering,  sequencing,  or  arrangement  of  the  elements  of  a  set.      

     

 Poset  =  Set  +  Binary  RelaEon  16  

Page 17: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

AnEchain  and  Poset  Width  

•  We  say  two  elements  a  and  b  of  a  parEally  ordered  set  are  comparable  if  a  ≤  b  or  b  ≤  a.  

•  Chain:  set  of  elements  every  two  of  which  are  comparable.  

•  An>chain:  subset  of  a  parEally  ordered  set  such  that  any  two  elements  in  the  subset  are  incomparable.    

•  Width  of  a  poset:  the  cardinality  of  a  maximum  anEchain.  

17  

Page 18: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Dilworth’s  Theorem:  characterizes  the  width  of  any  parEally  ordered  set  in  terms  of  a  parEEon  of  the  order  into  a  minimum  number  of  chains  

18  

Page 19: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Dilworth’s  Theorem  •  Dilworth's  Theorem:  the  number  of  mutually  incompaEble  reads  is  the  same  as  the  minimum  number  of  transcripts  needed  to  “explain”  all  the  fragments.  

•  A  proof  of  Dilworth's  Theorem  that  produces  a  minimal  set  of  paths  that  cover  all  the  fragments  in  the  overlap  graph  by  finding  the  largest  set  of  reads  with  the  property  that  no  two  could  have  originated  from  the  same  isoform.  

19  

Page 20: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

20  

Page 21: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

Transcript  Abundance  is  EsEmated  •  Fragments  are  matched  to  the  transcripts  from  which  they  could  have  originated.  

•  Transcript  abundance  is  es>mated  using  a  staEsEcal  model  in  which  the  probability  of  observing  each  fragment  is  a  linear  funcEon  of  the  abundance  of  the  transcripts  from  which  it  could  have  originated.  

•  Because  only  the  ends  of  each  fragment  are  sequenced,  the  length  maybe  unknown.  

21  

Page 22: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

22  

Violet    fragment  

Assigning  a  fragment  to  different  isoforms  o^en  implies  a  length  for  it.    Cufflinks  incorporates  the  distribuEon  of  fragment  lengths  to  help  assign  fragments  to  isoforms.  

Page 23: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

23  

Lastly,  Cufflinks  maximizes  a  funcEon  that  assigns  a  likelihood  to  all  possible  sets  of  relaEve  abundances,  which  produces  the  abundance  that  best  explain  the  observed  fragments  (shown  in  the  pie  chart).  

Page 24: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

24  

TopHat  

CuffLinks  

Page 25: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

De  Novo  Transcript  Assembly  

Page 26: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

De  Novo  Transcript  Assemblers  

26  

Trans-­‐ABySS:  one  of  the  first  tools,  a  repurposed  de  Bruijn  genome  assembler  (ABySS)  that  works  well  for  viruses  and  bacteria.        Oases:  is  the  equivalent  to  Trans-­‐ABySS  from  the  developers  Velvet.      

Page 27: Analysis(of((RNA-seq(Data(Part2)((cs680/Slides/lecture10.pdf · 2012. 9. 19. · •Denovotranscriptassembly:((assembly(of(transcripts(where(there(exists(no(reference(genome • Reference#guided#transcript#assembly:(

De  Novo  Transcript  Assemblers  

27  

•  Trinity  is  probably  the  best  one  in  terms  of  results  and  ease  of  use.  The  original  paper  showed  some  impressive  results  on  non-­‐coding  RNAs  in  mammals.    

 •  SOAPdenovo-­‐Trans:  developed  at  BGI.  Has  heavy  memory  requirements  of  SOAP  tools  (30  GB  for  a  RNA-­‐seq  run).