Badalamenti PacBio tutorial 12-10-2014 · .h5 files contain a lot more than just basecalls and...

55

Transcript of Badalamenti PacBio tutorial 12-10-2014 · .h5 files contain a lot more than just basecalls and...

upcoming tutorials

Today,  December  10  –  2:30  PM  Friday,  December  12  –  1:00  PM  Wednesday,  December  17  –  1:00  PM  Wednesday,  January  7  –  1:00  PM  Tuesday,  January  13  –  10:00  AM    All  sessions  to  be  held  in  138  Cargill    register  at  msi.umn.edu  

PRE-

PRO

CESS

ING

ASS

EMBL

YPO

LISH

ING

Short Reads (Illumina) - graph assembly

adapterremoval

qualitytrimming

de Bruijn or string graph construction

errorcorrection

T

T

A

T

T

scaffolding

contigs

read pairs

NNNNNN

read mapping

Long Reads (PacBio) - HGAP assembly

read length

read

s

read self-correction

overlap-layout-consensusassembly

consensus calling withquiver

assembled genome

ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

1

2

3

1 pre-processing 2 assembly 3 finishing/polishing

the overall assembly strategy is the same…

…but the data and tools are fundamentally different

.h5 files contain a lot more than just basecalls and quality scores

$ h5dump –n <smrtcell_data_file>.bax.h5

2500

2000

1500

1000

500

5000 10000 15000 20000 25000

subread length

subr

eads

50

100

150

200

250

Mb

> su

brea

d le

ngth

Typical (size-selected) read length distribution, P4-C2 chemistry

data from 1 SMRTcell

quality scores across all bases

20

10

30

position in read 25000

•  PacBio data cannot (currently) be assembled in its raw state

•  several strategies exist for correcting reads prior to assembly •  correction without complementary technology used to be

difficult –  until recently, was limited by computational power and SMRT cell

throughput

PacBio data is noisy

Koren & Philippy Curr Op Micro 2014

30000

25000

10000

5000

5000 10000 15000 20000 25000

subread length

subr

eads

50

100

150

200

250

Mb

> su

brea

d le

ngth

20000

15000

before size selection

data from 1 SMRTcell, ~4 Mb genome

size matters

mean 1,527 bp N50 1,866 bp

2500

2000

1500

1000

500

5000 10000 15000 20000 25000

subread length

subr

eads

50

100

150

200

250

Mb

> su

brea

d le

ngth

…after size selection

data from 1 SMRTcell, ~4 Mb genome

size matters

mean 4,505 bp N50 6,591 bp

other options for assembling PacBio reads

other options for assembling PacBio reads

other options for assembling PacBio reads

other options for assembling PacBio reads

https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads

•  files typically transferred as gzipped tarballs (.tgz) •  deposited by Matt Bockol (Mayo) onto MSI to

/project/scratch/bockolm2 MSI has plans to streamline data delivery •  recommend organizing chronologically by run •  create separate project/sample directories with symbolic links •  SMRT cell directory names are not informative •  within untarred parent directory, run

data delivery and organization

$ get_smrtcell_info.shbadalame@login02 [/home/bonddr/shared/pacbio_data/runs/2014-10-15] % get_smrtcell_info.sh 2014-10-15 A01_1 WT_Gsul_BPSS_repeat_0.050_nM 2014-10-15 A01_2 WT_Gsul_BPSS_repeat_0.050_nM 2014-10-15 B01_1 WTL_BPSS_repeat_0.050_nM 2014-10-15 B01_2 WTL_BPSS_repeat_0.050_nM 2014-10-15 C01_1 CES_3_BPSS_0.050_nM 2014-10-15 C01_2 CES_3_BPSS_0.050_nM 2014-10-15 D01_1 JG233_raw_BPSS_0.050_nM 2014-10-15 D01_2 JG233_raw_BPSS_0.050_nM 2014-10-15 E01_1 JG233_S_C_BPSS_0.050_nM 2014-10-15 E01_2 JG233_S_C_BPSS_0.050_nM

•  gather, organize, and verify data •  start isub session within NX Client on MSI

•  import SMRT cells

•  run subread filtering / standard QC •  run HGAP with length cutoff to provide 100x coverage

•  interpret results / re-run with modifications •  circularize chromosome(s) and plasmids

–  reorient to begin at replication origin if desired –  upload as new “reference” sequence

•  run base modification and motif detection

•  iteratively run quiver until QV > 50

•  final polish with short reads (if available) using Pilon •  annotate

typical workflow de novo microbial assembly

pull reads into other pipeline(s)

import SMRT cells •  any readable file path can be scanned •  three options for importing data

1. physically move or copy SMRT cell data to /smrtanalysis/userdata/inputs_dropboxthis is dangerous if you ever need to remove smrtanalysis from your home directory

2. create symbolic links to SMRT cell data in inputs_dropbox

$ ln –s /path/to/smrtcells ~/smrtanalysis/userdata/inputs_dropboxbetter option

3. scan defined file path(s) for SMRT cell data, e.g.

/home/PIjoe/shared/pacbio_data/projects/sampleID/smrtcellsbest option – allows for personalized data organization outside SMRT Portal

once imported, SMRT cells cannot be (easily) removed from the available list

key terms:

SMRT bell library

QC – adapter removal and subread filtering

Travers et al. Nucl. Acids Res. (2010) !

polymerase read

adapter

full pass subread

subreads

filtered subreads – subreads passing specified length and quality filters

1. generate amplicon

2. ligate adaptors

3. sequence

4. data analysisraw long readprocessed long read

single-molecule fragments

circular consensus sequence (ccs)

SMRTbell

5‘ forward strand 3‘

3‘ reverse strand 5‘

DNApolymerase

template

1 o analysis

Fichot et al. Microbiome 1:10 (2013) !

CCS reads

running HGAP always run with 100x coverage of longest reads key parameters: •  minimum subread length

–  set to value that provides ~100x coverage based on subread filtering curve

•  minimum polymerase read quality –  some pipelines default to 0.75, but I always set to 0.8 unless limited

by coverage

•  anticipated genome size –  your best guess based on related species

pre-assembly •  HGAP automatically sets length cutoff providing 30x

coverage in longest reads •  blasr maps shorter reads to longer reads •  pbdagcon calls consensus and spits out corrected long

reads –  these can be useful for other pipelines –  some long reads get shorter!

•  pre-assembled yield –  the fraction of total seed bases (i.e. 30x in longest reads)

that survived self correction –  can result from ends being truncated and/or long reads

unable to be corrected

Polymerase Read Bases 370,004,973 Length Cutoff 12,944 Seed Bases 114,063,209 Pre-Assembled bases 76,895,237 Pre-Assembled Yield 0.674 Pre-Assembled Reads 7,719 Pre-Assembled Reads Length 9,961 Pre-Assembled N50 13,168

0

10000

20000

30000

Read

Leng

th

raw uncorrected read (26,076 bp)

interpreting HGAP results – final assembly my assembly has always has a ~4kb plasmid?! (not really) •  check coverage plots

–  should be even, without large spikes (collapsed repeats) or dips

•  check for plasmids –  previous versions sent plasmids to separate file

•  why might you have multiple contigs? (for microbes) –  anticipated genome size was incorrect –  long, unresolvable/complex repeats –  low pre-assembled yield –  some contigs with abnormally low or high coverage might be

spurious and can possibly be ignored

•  BLAST any small contigs •  sum lengths of contigs and re-run HGAP if necessary •  try HGAP.2 (slower, but more accurate)

when an assembly returns a circular genome

See http://files.pacb.com/Training/CircularContigConfirmationGepard/story.html

script for separating contigs for individual circularization: $ extract_contigs.sh

uploading reference sequences unlike SMRT cell data, reference sequences cannot by symlinked! •  make copies of reference genomes in

~/smrtanalysis/userdata/references_dropbox •  larger genomes can take several minutes to finish uploading

•  sequence(s) should be in a single .fasta file (including plasmids)

•  makes use of real-time kinetic data to evaluate potential based modifications based on –  long active site residence time –  interpulse duration

•  pipeline also runs RS_Resequencing (i.e. quiver) by default https://github.com/PacificBiosciences/SMRT-Analysis/wiki/SMRT-Pipe-Reference-Guide-v2.2.0

base modification and motif detection

quiver isn’t perfect using Pilon to polish remaining indels

•  makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies

$ java -Xmx16G –jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100

Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > QV 55 consensus

quiver isn’t perfect using Pilon to polish remaining indels

quiver isn’t perfect using Pilon to polish remaining indels

quiver isn’t perfect using Pilon to polish remaining indels

final quiver polish 3,820,756 bp 99.9999% (QV 60)

pilon 128 indels detected 3,820,884 bp

re-run quiver 3,820,866 bp

Sequence Position Variant Type Coverage Confidence Genotype unitig_0|quiver|quiver|quiver|quiver|pilon 3328288 3328288delA DEL 100 50 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3782112 3782112delG DEL 100 50 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 1370128 1370128delC DEL 100 49 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2555272 2555272delG DEL 100 49 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3063922 3063922delG DEL 100 49 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2620561 2620561delG DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2782988 2782988delG DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2924523 2924523delT DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2962387 2962387delC DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3342764 3342764delA DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 218678 218678delG DEL 100 47 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 731966 731966delG DEL 100 47 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2962119 2962119delC DEL 100 47 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 520394 520394delC DEL 100 46 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 1081259 1081259delG DEL 100 45 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3038349 3038349delC DEL 100 44 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 830503 830503delG DEL 100 43 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3790899 3790899delC DEL 100 41 haploid

where is my data?

•  all secondary analysis data resides in

~/smrtanalysis/userdata/jobs

•  each job is assigned a six-digit ID corresponding to its

directory name •  graphs and HTML files are in /results•  filtered subreads, assembly .fasta files and anything else are

in /data

typical run times on MSI

Pipeline Genome size

(Mbp) #

contigs # SMRT

cells coverage wall time

RS_Subreads 3.7 n/a 8 580x 34 m

RS_HGAP.3 3.7 2 8 100x 2h 45 m

RS_HGAP.3 7.2 1 11 100x 13h 45 m

RS_Modification_and Motif 3.7 2 8 580x 10h 12 m

RS_Resequencing 3.7 2 8 140x 2h 7m

NOTE: all pipelines begin with adapter removal and subread filtering

visualizing results in SMRT View

•  start an isub session with extra memory:

$ isub –m 32gb

•  click on the SMRT View button within the job •  launches a java web application•  must be run from an active SMRT Portal session

troubleshooting

•  stop SMRT Portal $ /panfs/roc/pacbio/stop_user_portal.sh

•  LAST RESORT $ pbsave.sh$ /panfs/roc/pacbio/delete_user_portal.sh

THIS WILL REMOVE ALL ANALYSIS DATA UNLESS SAVED! contact [email protected]

resources and additional information visit http://github.com/PacificBiosciences email [email protected]

[email protected] [email protected]

check Twitter! @PacBio @SageSci @UMNmsi @lexnederbragt @aphillippy @sergekoren @mike_schatz @pathogenomenick @BaCh_mira @LizzyWilbanks @TheGeneMyers @infoecho @BioInfoBrett @OmicsOmicsBlog @ctitusbrown

data for 5 organisms just released freely available