Badalamenti PacBio tutorial 12-10-2014 · .h5 ﬁles contain a lot more than just basecalls and...

upcoming tutorials

Today, December 10 – 2:30 PM Friday, December 12 – 1:00 PM Wednesday, December 17 – 1:00 PM Wednesday, January 7 – 1:00 PM Tuesday, January 13 – 10:00 AM All sessions to be held in 138 Cargill register at msi.umn.edu

PRE-

PRO

CESS

ING

ASS

EMBL

YPO

LISH

ING

Short Reads (Illumina) - graph assembly

adapterremoval

qualitytrimming

de Bruijn or string graph construction

errorcorrection

T

T

A

T

T

scaffolding

contigs

read pairs

NNNNNN

read mapping

Long Reads (PacBio) - HGAP assembly

read length

read

s

read self-correction

overlap-layout-consensusassembly

consensus calling withquiver

assembled genome

ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

1

2

3

1 pre-processing 2 assembly 3 finishing/polishing

the overall assembly strategy is the same…

…but the data and tools are fundamentally different

.h5 files contain a lot more than just basecalls and quality scores

$ h5dump –n <smrtcell_data_file>.bax.h5

2500

2000

1500

1000

500

5000 10000 15000 20000 25000

subread length

subr

eads

50

100

150

200

250

Mb

> su

brea

d le

ngth

Typical (size-selected) read length distribution, P4-C2 chemistry

data from 1 SMRTcell

quality scores across all bases

20

10

30

position in read 25000

•  PacBio data cannot (currently) be assembled in its raw state

•  several strategies exist for correcting reads prior to assembly •  correction without complementary technology used to be

difficult –  until recently, was limited by computational power and SMRT cell

throughput

PacBio data is noisy

Koren & Philippy Curr Op Micro 2014

30000

25000

10000

5000

5000 10000 15000 20000 25000

subread length

subr

eads

50

100

150

200

250

Mb

> su

brea

d le

ngth

20000

15000

before size selection

data from 1 SMRTcell, ~4 Mb genome

size matters

mean 1,527 bp N50 1,866 bp

2500

2000

1500

1000

500

5000 10000 15000 20000 25000

subread length

subr

eads

50

100

150

200

250

Mb

> su

brea

d le

ngth

…after size selection

data from 1 SMRTcell, ~4 Mb genome

size matters

mean 4,505 bp N50 6,591 bp

other options for assembling PacBio reads

other options for assembling PacBio reads

https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads

•  files typically transferred as gzipped tarballs (.tgz) •  deposited by Matt Bockol (Mayo) onto MSI to

/project/scratch/bockolm2 MSI has plans to streamline data delivery •  recommend organizing chronologically by run •  create separate project/sample directories with symbolic links •  SMRT cell directory names are not informative •  within untarred parent directory, run

data delivery and organization

$ get_smrtcell_info.shbadalame@login02 [/home/bonddr/shared/pacbio_data/runs/2014-10-15] % get_smrtcell_info.sh 2014-10-15 A01_1 WT_Gsul_BPSS_repeat_0.050_nM 2014-10-15 A01_2 WT_Gsul_BPSS_repeat_0.050_nM 2014-10-15 B01_1 WTL_BPSS_repeat_0.050_nM 2014-10-15 B01_2 WTL_BPSS_repeat_0.050_nM 2014-10-15 C01_1 CES_3_BPSS_0.050_nM 2014-10-15 C01_2 CES_3_BPSS_0.050_nM 2014-10-15 D01_1 JG233_raw_BPSS_0.050_nM 2014-10-15 D01_2 JG233_raw_BPSS_0.050_nM 2014-10-15 E01_1 JG233_S_C_BPSS_0.050_nM 2014-10-15 E01_2 JG233_S_C_BPSS_0.050_nM

•  gather, organize, and verify data •  start isub session within NX Client on MSI

•  import SMRT cells

•  run subread filtering / standard QC •  run HGAP with length cutoff to provide 100x coverage

•  interpret results / re-run with modifications •  circularize chromosome(s) and plasmids

–  reorient to begin at replication origin if desired –  upload as new “reference” sequence

•  run base modification and motif detection

•  iteratively run quiver until QV > 50

•  final polish with short reads (if available) using Pilon •  annotate

typical workflow de novo microbial assembly

pull reads into other pipeline(s)

import SMRT cells •  any readable file path can be scanned •  three options for importing data

1. physically move or copy SMRT cell data to /smrtanalysis/userdata/inputs_dropboxthis is dangerous if you ever need to remove smrtanalysis from your home directory

2. create symbolic links to SMRT cell data in inputs_dropbox

$ ln –s /path/to/smrtcells ~/smrtanalysis/userdata/inputs_dropboxbetter option

3. scan defined file path(s) for SMRT cell data, e.g.

/home/PIjoe/shared/pacbio_data/projects/sampleID/smrtcellsbest option – allows for personalized data organization outside SMRT Portal

once imported, SMRT cells cannot be (easily) removed from the available list

key terms:

SMRT bell library

QC – adapter removal and subread filtering

Travers et al. Nucl. Acids Res. (2010) !

polymerase read

adapter

full pass subread

subreads

filtered subreads – subreads passing specified length and quality filters

1. generate amplicon

2. ligate adaptors

3. sequence

4. data analysisraw long readprocessed long read

single-molecule fragments

circular consensus sequence (ccs)

SMRTbell

5‘ forward strand 3‘

3‘ reverse strand 5‘

DNApolymerase

template

1 o analysis

Fichot et al. Microbiome 1:10 (2013) !

CCS reads

running HGAP always run with 100x coverage of longest reads key parameters: •  minimum subread length

–  set to value that provides ~100x coverage based on subread filtering curve

•  minimum polymerase read quality –  some pipelines default to 0.75, but I always set to 0.8 unless limited

by coverage

•  anticipated genome size –  your best guess based on related species

pre-assembly •  HGAP automatically sets length cutoff providing 30x

coverage in longest reads •  blasr maps shorter reads to longer reads •  pbdagcon calls consensus and spits out corrected long

reads –  these can be useful for other pipelines –  some long reads get shorter!

•  pre-assembled yield –  the fraction of total seed bases (i.e. 30x in longest reads)

that survived self correction –  can result from ends being truncated and/or long reads

unable to be corrected

Polymerase Read Bases 370,004,973 Length Cutoff 12,944 Seed Bases 114,063,209 Pre-Assembled bases 76,895,237 Pre-Assembled Yield 0.674 Pre-Assembled Reads 7,719 Pre-Assembled Reads Length 9,961 Pre-Assembled N50 13,168

0

10000

20000

30000

Read

Leng

th

raw uncorrected read (26,076 bp)

interpreting HGAP results – final assembly my assembly has always has a ~4kb plasmid?! (not really) •  check coverage plots

–  should be even, without large spikes (collapsed repeats) or dips

•  check for plasmids –  previous versions sent plasmids to separate file

•  why might you have multiple contigs? (for microbes) –  anticipated genome size was incorrect –  long, unresolvable/complex repeats –  low pre-assembled yield –  some contigs with abnormally low or high coverage might be

spurious and can possibly be ignored

•  BLAST any small contigs •  sum lengths of contigs and re-run HGAP if necessary •  try HGAP.2 (slower, but more accurate)

when an assembly returns a circular genome

See http://files.pacb.com/Training/CircularContigConfirmationGepard/story.html

script for separating contigs for individual circularization: $ extract_contigs.sh

uploading reference sequences unlike SMRT cell data, reference sequences cannot by symlinked! •  make copies of reference genomes in

~/smrtanalysis/userdata/references_dropbox •  larger genomes can take several minutes to finish uploading

•  sequence(s) should be in a single .fasta file (including plasmids)

•  makes use of real-time kinetic data to evaluate potential based modifications based on –  long active site residence time –  interpulse duration

•  pipeline also runs RS_Resequencing (i.e. quiver) by default https://github.com/PacificBiosciences/SMRT-Analysis/wiki/SMRT-Pipe-Reference-Guide-v2.2.0

base modification and motif detection

quiver isn’t perfect using Pilon to polish remaining indels

•  makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies

$ java -Xmx16G –jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100

Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > QV 55 consensus

where is my data?

•  all secondary analysis data resides in

~/smrtanalysis/userdata/jobs

•  each job is assigned a six-digit ID corresponding to its

directory name •  graphs and HTML files are in /results•  filtered subreads, assembly .fasta files and anything else are

in /data

typical run times on MSI

Pipeline Genome size

(Mbp) #

contigs # SMRT

cells coverage wall time

RS_Subreads 3.7 n/a 8 580x 34 m

RS_HGAP.3 3.7 2 8 100x 2h 45 m

RS_HGAP.3 7.2 1 11 100x 13h 45 m

RS_Modification_and Motif 3.7 2 8 580x 10h 12 m

RS_Resequencing 3.7 2 8 140x 2h 7m

NOTE: all pipelines begin with adapter removal and subread filtering

visualizing results in SMRT View

•  start an isub session with extra memory:

$ isub –m 32gb

•  click on the SMRT View button within the job •  launches a java web application•  must be run from an active SMRT Portal session

troubleshooting

•  stop SMRT Portal $ /panfs/roc/pacbio/stop_user_portal.sh

•  LAST RESORT $ pbsave.sh$ /panfs/roc/pacbio/delete_user_portal.sh

THIS WILL REMOVE ALL ANALYSIS DATA UNLESS SAVED! contact [email protected]

resources and additional information visit http://github.com/PacificBiosciences email [email protected]

[email protected] [email protected]

check Twitter! @PacBio @SageSci @UMNmsi @lexnederbragt @aphillippy @sergekoren @mike_schatz @pathogenomenick @BaCh_mira @LizzyWilbanks @TheGeneMyers @infoecho @BioInfoBrett @OmicsOmicsBlog @ctitusbrown

data for 5 organisms just released freely available

Badalamenti PacBio tutorial 12-10-2014 · .h5 ﬁles contain a lot more than just basecalls and...

Documents

Transcript of Badalamenti PacBio tutorial 12-10-2014 · .h5 ﬁles contain a lot more than just basecalls and...