Forsharing cshl2011 sequencing

73
HighResolu,on Views of Cancer Genomes

description

Short overview talk on exome and genome sequencing and DNAse-seq.

Transcript of Forsharing cshl2011 sequencing

Page 1: Forsharing cshl2011 sequencing

High-­‐Resolu,on  Views  of  Cancer  Genomes  

Page 2: Forsharing cshl2011 sequencing
Page 3: Forsharing cshl2011 sequencing

The  Central  Dogma  

Page 4: Forsharing cshl2011 sequencing
Page 5: Forsharing cshl2011 sequencing
Page 6: Forsharing cshl2011 sequencing

+  

Page 7: Forsharing cshl2011 sequencing
Page 8: Forsharing cshl2011 sequencing

Your  Nature  Paper  

Page 9: Forsharing cshl2011 sequencing

Our  First  Experiment  

Page 10: Forsharing cshl2011 sequencing

Overview  of  BAC  in  the  Genome  

Page 11: Forsharing cshl2011 sequencing

Sequencing  a  BAC  

Page 12: Forsharing cshl2011 sequencing

Sequence  Coverage  

Page 13: Forsharing cshl2011 sequencing

Repeats  

Page 14: Forsharing cshl2011 sequencing

Repeats  

Page 15: Forsharing cshl2011 sequencing

Repeats  are  not  created  equal  

Page 16: Forsharing cshl2011 sequencing

Genomic  Sequencing  

TargeFng  the  Exome  

Page 17: Forsharing cshl2011 sequencing

  Long  oligos  synthesized  on  arrays  (DNA)  

  RNA  baits  synthesized  from  DNA  oligo  template  

  RNA  baits  hybridized  to  DNA  sequencing  library  

  Targets  captured  using  beads  and  bioFn-­‐labeled  baits  

  RNA  bait  degraded,  leaving  sequencing  library  enriched  for  target  regions  

Page 18: Forsharing cshl2011 sequencing

Data  Flow  

  FASTQ  files  generated  by  Illumina  pipeline    Aligned  to  reference  genome  (hg18,  excluding  _random,  unmapped,  and  hap)  using  Novoalign    SAM/BAM  used  extensively  

  Follow  Broad  InsFtute  GATK  pipeline  for  exome  capture  

  Use  picard  java  library  for  quality  assessment    Processed  BAM  files  available  via  local  hZp  for  browsing  

Page 19: Forsharing cshl2011 sequencing

Data  Pipeline....  

  Samtools  import    Samtools  sort  

  Picard  MarkDuplicates  

  GATK  Indel  Realignment  

  GATK  Quality  RecalibraFon  

  Picard  QC  metrics  

Page 20: Forsharing cshl2011 sequencing

Realignment  around  Indels  

  The  problem  -  Aligners  align  each  read  independently  -  PotenFally  leads  to  increased  error  rates  around  

indels  

  A  potenFal  soluFon  -  Locally  realign  reads  in  regions  that  might  

harbor  an  indel  -  Goal  is  to  align  reads  overlying  indels  more  

accurately,  reducing  errors  in  each  read  and,  in  turn,  reducing  SNV  call  error  rates  

Page 21: Forsharing cshl2011 sequencing

Quality Recalibration

  Since most SNV callers will rely on quality scores to estimate error probabilities, having the best possible estimates for error rates is important

  Reported error rates from the Illumina sequencer generally reflect technical parameters of the base call process, but not other systematic biases

  Quality recalibration can include covariates to account for systematic biases

-  Cycle count, dinucleotide context, original quality, and sample/library variables

Page 22: Forsharing cshl2011 sequencing

Variant  Calling  and  EvaluaFon  

A  developing  art  

Page 23: Forsharing cshl2011 sequencing
Page 24: Forsharing cshl2011 sequencing

Sequencing  Tumor/Normal  Pairs  

Page 25: Forsharing cshl2011 sequencing

Good  SNP  

Page 26: Forsharing cshl2011 sequencing

Suspect  Variant  

Page 27: Forsharing cshl2011 sequencing

SomaFc  (tumor  only)  Variant  

Page 28: Forsharing cshl2011 sequencing

Likely  False  PosiFve  (normal  only)  

Page 29: Forsharing cshl2011 sequencing

LOH  

Page 30: Forsharing cshl2011 sequencing

NCI60  Exome  Sequencing  

No  Normals  Available!  

Page 31: Forsharing cshl2011 sequencing
Page 32: Forsharing cshl2011 sequencing
Page 33: Forsharing cshl2011 sequencing

Variants  by  Genomic  LocaFon  

Page 34: Forsharing cshl2011 sequencing

All  Coding  Variants  

Page 35: Forsharing cshl2011 sequencing

Type  1:  in  dbSNP,  Type  2:  not  in  dbSNP  

Page 36: Forsharing cshl2011 sequencing

Coding,  novel  (no  dbSNP)  

Page 37: Forsharing cshl2011 sequencing
Page 38: Forsharing cshl2011 sequencing

Copy  Number  from  Exomes  

Page 39: Forsharing cshl2011 sequencing
Page 40: Forsharing cshl2011 sequencing
Page 41: Forsharing cshl2011 sequencing
Page 42: Forsharing cshl2011 sequencing

Complete  Genome  Sequencing  

Complete  Genomics  Data  

Page 43: Forsharing cshl2011 sequencing

Data  

  Delivery    Via  USB  results  

  Storage    Sizes  are  LARGE  -  400GB  per  sample  as  delivered  with  raw  reads  included  

  Should  use  2-­‐locaFon  backed-­‐up  storage  -  Not  trivial  to  find  such  storage,  so  might  resort  to  mulFple  USB  drives  

  Minimize:  -  Data  movement  -  Keeping  mulFple  copies  indefinitely  

Page 44: Forsharing cshl2011 sequencing

Breakdown  of  Data  Sizes  

Page 45: Forsharing cshl2011 sequencing
Page 46: Forsharing cshl2011 sequencing

Data  

  Delivery    Storage    Processing  

  Data  are  typically  tab-­‐delimited  text  files,  so  Excel  can  be  useful  for  examining  individual  small  files  

  Generally,  command-­‐line  tools  needed    MacOS  and  linux  only  supported  operaFng  systems,  but  Windows  might  work....  

  Some  analyses  (snpdiff)  require  large  memory  

Page 47: Forsharing cshl2011 sequencing

Directory  Structure  

Page 48: Forsharing cshl2011 sequencing

Workflows  

  Tumor/Normal    Copy  Number  

  Structural  Varia,on    Annotated  SomaFc  Variants  

  Germline    List  of  annotated  genotypes  per  individual,  summarized  into  a  single  file  that  can  be  used  for  filtering  

Page 49: Forsharing cshl2011 sequencing

Germline  Workflow  

Page 50: Forsharing cshl2011 sequencing

Germline  Workflow  

  Output    Future  direcFons  

  Be  “smarter”  about  inheritance  framework  

  Further  refinements  of  comparison  to  other  data  types  (exomes,  snp  arrays,  RNA-­‐seq)  

Page 51: Forsharing cshl2011 sequencing

Tumor/Normal  Workflow  

Page 52: Forsharing cshl2011 sequencing

Medvedev  et  al.,  Nature  2009  

Page 53: Forsharing cshl2011 sequencing
Page 54: Forsharing cshl2011 sequencing
Page 55: Forsharing cshl2011 sequencing
Page 56: Forsharing cshl2011 sequencing
Page 57: Forsharing cshl2011 sequencing
Page 58: Forsharing cshl2011 sequencing

The  Cancer  Genome  Atlas  Research  Network  Nature  000,  1-­‐8  (2008)  doi:10.1038/nature07385  

Frequent  geneFc  alteraFons  in  three  criFcal  signalling  pathways.  

Page 59: Forsharing cshl2011 sequencing
Page 60: Forsharing cshl2011 sequencing
Page 61: Forsharing cshl2011 sequencing

ChromaFn  

  ChromaFn  is  the  complex  of  protein  and  DNA  that  make  up  the  chromosomes.    It  is  not  a  staFc  structure.  

Page 62: Forsharing cshl2011 sequencing

  DNAse  is  an  enzyme  that  cuts  DNA  at  locaFons  where  DNA  is  accessible  

  These  “accessible”  regions  have  been  associated  with  open  chromaFn  

  Regions  of  open  chromaFn  are  necessary  for  transcripFonal  and  regulatory  machinery  to  have  access  to  gene  neighborhoods  and  facilitate  transcripFon  

Page 63: Forsharing cshl2011 sequencing

DNAse  HypersensiFvity  

  Method  for  finding  regions  of  “open”  chromaFn  

  In  data  published  with  the  ENCODE  consorFum,  DNAse  hypersensiFve  (HS)  were  shown  to  be  correlated  with:    Histone  modificaFon    TranscripFon  start  sites    Early  replicaFng  regions    TranscripFon  factor  binding  sites  (experimentally  determined  by  ChIP/chip,  etc.)  

IdenFficaFon  and  analysis  of  funcFonal  elements  in  1%  of  the  human  genome  by  the  ENCODE  pilot  project.    The  ENCODE  ConsorFum.    Nature,  2007.  

Page 64: Forsharing cshl2011 sequencing

DNAse-­‐chip  Method  

Crawford,  G.E.,  Davis,  S.,  Scacheri,  P.C.,  Renaud,  G.,  Halawi,  M.J.,  Erdos,  M.R.,  Green,  R.,  Meltzer,  P.S.,  Wolfsberg,  T.G.,  and  Collins,  F.S.  Nat  Methods,  2006  

Page 65: Forsharing cshl2011 sequencing

DNAse-­‐Seq  Method  

Crawford,  G.E.,  Davis,  S.,  Scacheri,  P.C.,  Renaud,  G.,  Halawi,  M.J.,  Erdos,  M.R.,  Green,  R.,  Meltzer,  P.S.,  Wolfsberg,  T.G.,  and  Collins,  F.S.  Nat  Methods,  2006  

Page 66: Forsharing cshl2011 sequencing
Page 67: Forsharing cshl2011 sequencing

DNAse  Sites  RelaFve  to  Genes  

Page 68: Forsharing cshl2011 sequencing

DNAse  HS  Sites  and  Gene  Expression  

  DNAse  HS  sites  near  transcripFon  start  sites  are  associated  with  acFvely  transcribed  genes.  

Page 69: Forsharing cshl2011 sequencing
Page 70: Forsharing cshl2011 sequencing

  Distances  between  sequences  in  non-­‐DNAse  HS  regions  have  an  oscillaFng  paZern  with  frequency  that  corresponds  to  a  single  turn  of  the  double-­‐helix  

  DNAse  is  known  to  cut  preferenFally  in  the  minor  groove,  which  is  exposed  every  10.4  bases  when  wrapped  around  a  nucleosome  

  A  nucleosome  is  wrapped  by  147  base  pairs  when  complexed  with  DNA  

  ImplicaFon:  Nucleosomes  are  posiFoned  in  a  highly  organized,  precise  manner  

Nucleosome  PosiFoning  

Page 71: Forsharing cshl2011 sequencing
Page 72: Forsharing cshl2011 sequencing
Page 73: Forsharing cshl2011 sequencing

The  Last  Mile