Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput...
Transcript of Quality Assurance of High Throughput Sequencing in the ... · Quality Assurance of High Throughput...
Quality Assurance of High Throughput Sequencing in the Diagnostic Laboratory
Gary Van Domselaar, PhD
Chief, Bioinformatics
WAVLD17, June 17 2015
Canada’s National Microbiology Laboratory
Next-Generation Sequencing
Illumina MiSeq Ion Torrent PGM Oxford Nanopore
15 Gigabases / Run (1 d) 1 Gigabase / Run ( 2 h) 1 Gigabase / Run
$125K $80K $1K
3
Ion Torrent / Ion Proton 6+ m/80 Million reads/run
Illumina MiSeq 25-Million reads / run
Big Data Analytics and High Performance Computing
Genomic Epidemiology
Canadian Listeriosis Outbreak, 2008
6
Eppinger M et al. mBio 2014; doi:10.1128/mBio.01721-14 Katz L et al. mBio. 2013 Jul-Aug; 4(4): e00398-13.
Genomic Epidemiology
8
PulseNet Canada
Genomic
Epidemiology
Roadmap
Aleisha Reimer with contributions from Drs Celine
Nadon, Morag Graham, and PulseNet Canada
members
October 16, 2013
Based on existing PulseNet model
De-centralized sequencing and analysis
Parallel, centralized storage & analysis of
national data sets
Continued NML support in reference testing,
training, certification & proficiency
Continued method development, refinement,
and KT
Diagnostic Metagenomics
Pathogen Profiling Pipeline
Neptune: Target Pathogen Signature Detection
Inclusion Group Exclusion Group
Target Sequences
HIV Genotyping
NGS HIV Genotying with HyDRA
Quality Assurance
QA/QC Today: A Work in Progress
QC Metrics: From Reads to Annotated Genome
• Sequence Reads – Basic stats in read number, read length, etc – Sequence quality – GC content – Duplicated sequences, over-represented
sequences – Kmer content
• Error correction • Contamination • Sequence Mapping
– Average read coverage and its distribution – Composition of the data set according to read
length – Fragment-length average, distribution, and
outliers – Base quality values – Read duplication rate – Mapping quality values and fraction of
properly mapped read pairs – % of mapped reads – % of concordance reads – % of discordance reads
• Sequence Assemblies – # Contigs – Total length of contigs – Largest contigs – N50 (ref-free); NG50 (with reference) – # misassemblies (with reference) – Unaligned contigs / length
• Sequence Annotation – Genome Quality Score (Land et al Genomic Sciences
2014)
– Number of contigs and number of non-standard bases
– Presence of a full-length 5S, 16S, and 23 rRNA – Presence of at least one tRNA coding for all of
the 20 standard amino acids – Presence of a set of essential genes
containing 102 conserved Pfam-A domains
• cgMLST / wgMLST • Metabolic Pathways
Quality Assessment for Genomic Epidemiology
The Daubert Standard
• Judge is gatekeeper: – assuring that scientific expert testimony truly proceeds from "scientific knowledge", rests on
the trial judge.
• Relevance and reliability: – requires the trial judge to ensure that the expert's testimony is "relevant to the task at hand"
and that it rests "on a reliable foundation
• Scientific knowledge = scientific method/methodology: A conclusion will qualify as scientific knowledge if the proponent can demonstrate that it is the product of sound "scientific methodology" derived from the scientific method
• Factors relevant: The Court defined "scientific methodology" as the process of formulating hypotheses and then conducting experiments to prove or falsify the hypothesis, and provided a nondispositive, nonexclusive, "flexible" set of "general observations" (i.e. not a "test") that it considered relevant for establishing the "validity" of scientific testimony: – Empirical testing: whether the theory or technique is falsifiable, refutable, and/or testable. – Whether it has been subjected to peer review and publication. – The known or potential error rate. – The existence and maintenance of standards and controls concerning its operation. – The degree to which the theory and technique is generally accepted by a relevant scientific
community.
Best Practices for Regulatory Bioinformatics
Best Practices for Regulatory Bioinformatics
• Fitness-for-purpose (statement of quality requirements, validation)
• Gap: Standards for high-quality reference genomes/materials
• Gap: Minimum requirements for quality metrics
• Gap: Benchmarking and validation of pipelines
• Gap: Proficiency verification
• Traceability and auditability • Gap: Organization of information in a manner facilitating retrospective analysis
• Gap: Method-specific details, such as sequencing chemistry, platform, software versions, etc.
• Gap: Measures to prevent or mitigate procedural errors
• Gap: Standardization for bioinformatics analyses*
• Documentation • Gap: Storage of raw reads vs assembled genomes
• Gap: Metadata
Bioinformatic Genomic Analytical Validation, and Best Practices for Microbial Forensics
IRIDA
User and File Management
Gap: Measures to prevent or mitigate procedural errors Gap: Secure storage and sharing of information
Users, Projects, Samples, and Files
Users, Projects, Samples, and Files
Users, Projects, Samples, and Files
Federated Identity with OAuth2
30
• Designed to be a “simple” authorization protocol.
• Developed by a consortium of developers and industry leaders.
• Implemented by Facebook, Google, Twitter, Hotmail, Amazon, Dropbox...
• More of a description of a protocol than a protocol itself.
Ontologies and Data Standards
Analytical Subunit
Subunit Quality Control Module
Quality Report
Quality Control
Subunit Quality Control Module
Quality Report
Quality Control
Quality Assessment • JSON-formatted report {[ { "metric": "n50", "score": "125000", }, ... ]}
Quality Verification Logic • User-modifiable runtime parameters. • Modify JSON-formatted report to
include assessment. • Developer customized module
• Pass with or without warnings (continue workflow), or fail (halt workflow).
• Low-quality/invalid data filtering.
Analytical Subunit
Variant Consolidation
HGT & Recombination
Filtering
Repeat region filtering
Meta-alignment generation
SNP Matrix
Whole Genome Phylogeny
Isolate Sequencing
Reads Variant Calling
Isolate Sequencing
Reads Variant Calling
The IRIDA SNVPhyl Pipeline
User
selects isolates
Phylogeny Viewer
…
selects reference
Reference Genome
* * *
* *
The SNVPhyl Pipeline
The SNVPhyl Pipeline
Sequence Reads
• FastQC run for all read data uploaded to IRIDA
• Provides statistics about quality of reads
Parameters UI
Files selected for analysis
Optional set of parameters can be defined
Or, parameters can be re-loaded from previously saved set (not shown)
Gap: Standardization for bioinformatics analyses; Measures to prevent or mitigate procedural errors
The SNVPhyl Pipeline
Reports
Pipeline Results
• Results of pipelines will produce report or log files on quality of the data
– E.g. SPAdes logs, Prokka logs, SNV Table, Assembly statistics
• These are stored along with results of each pipeline
Prokka summary contigs: 243 bases: 2131392 rRNA: 2 gene: 2033 CDS: 1978 tRNA: 52 tmRNA: 1
SNV Table #Chromosome Position Status Reference 08-5923 08-5578v3 47737 valid T C 08-5578v3 113283 valid C T 08-5578v3 172841 valid A G
Data Provenance
Gap: Method-specific details, such as sequencing chemistry, platform, software versions, etc.
Provenance – Report
Auditing
• IRIDA tracks resources in its database as resources are created, modified, and removed from the system.
• Resource auditing in IRIDA is implemented as resource-level database auditing, tracking what, who, and when.
• [What]: When a resource in IRIDA is modified, the user account credentials are captured whenever resources in the database are modified. IRIDA securely exposes internal resources for use in external tools over a REST API.
• [Who]: If a human user modifies a resource via a non-human client operation or process, the client credentials are also captured within the database’s auditing information, allowing for traceability of any data manipulations.
• [When]: In addition to user credentials, IRIDA tracks the modification of resources over time.
Transparency
• All provenance information captured by IRIDA is viewable by end-users with permission to view the data stored in IRIDA.
• If a user has permission to view some analysis executed by IRIDA, then the same user can also view the individual tool execution details.
• All provenance information captured by IRIDA is available for export.
HIV Genotying
IRIDA Data Import / Export
IRIDA Data Source for Galaxy
• Thomas Matthews
• Eric Marinier
• Hellen Butungi
• Philip Mabon
• Franklin Bristow
• Heather Kent
• Shane Thiessen
• Morag Graham
• Shaun Tyler
• Geoff Peters
• Kim Melnychuk
• Christine Bonner
Acknowledgements: PPP
• Lead Bioinformatician:
– Eric Enns
• National HIV and Retro Viral Laboratories
– Dr. James Brooks, Hezhao Ji
• Bioinformatics Corefacility
– Dr. Ben Liang
• Co-op Students
– David Peddle, Rory Finnegan, Jonathan Boisvert
Acknowledgements: HyDRA Acknowledgements: IRIDA Project Leaders Fiona Brinkman – SFU Will Hsiao – PHMRL Gary Van Domselaar – NML Rob Beiko – Dalhousie University University of Lisbon Joᾶo Carriҫo National Microbiology Laboratory (NML) Franklin Bristow Aaron Petkau Thomas Matthews Josh Adam Adam Olsen Tara Lynch Shaun Tyler Philip Mabon Philip Au Celine Nadon Matthew Stuart-Edwards Morag Graham Chrystal Berry Lorelee Tschetter Laboratory for Foodborne Zoonoses (LFZ) Eduardo Toboada Peter Kruczkiewicz Chad Laing Vic Gannon Matthew Whiteside Ross Duncan Steven Mutschall
Simon Fraser University (SFU) Melanie Courtot Emma Griffiths Geoff Winsor Julie Shay Matthew Laird Bhav Dhillon Raymond Lo BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC) Judy Isaac-Renton Patrick Tang Natalie Prystajecky Jennifer Gardy Damion Dooley Linda Hoang Kim MacDonald Yin Chang Eleni Galanis Marsha Taylor Cletus D’Souza Ana Paccagnella University of Maryland Lynn Schriml Canadian Food Inspection Agency (CFIA) Burton Blais Catherine Carrillo Dominic Lambert Dalhousie University Rob Beiko Alex Keddy
McMaster University Andrew McArthur Daim Sardar European Nucleotide Archive Guy Cochrane Petra ten Hoopen Clara Amid European Food Safety Agency Leibana Criado Ernesto Vernazza Francesco Rizzi Valentina