Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR,...
description
Transcript of Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR,...
Hospital Universitari Vall d’HebronInstitut de Recerca - VHIR
Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII)
Bioinformàtica per la Recerca Biomèdica
http://ueb.vhir.org/2014BRB
Ferran Briansó[email protected]
15/05/2014
INTRODUCTION TO NGSVARIANT CALLING ANALYSIS
1. NGS WORKFLOW OVERVIEW
2. WET LAB STEPS
3. IMPORTANT SEQUENCING CONCEPTS
4. NGS ANALYSIS WORKFLOW
1. Primary analysis: de-multiplexing, QC
2. Secondary analysis: read mapping and variant calling
3. Tertiary analysis: annotation, filtering...
5. VISUALIZATION
6. COMMON PIPELINES AND FORMATS
7. CONCLUSIONS
5
1
2
3
5
6
PRESENTATION OUTLINE
4
7
NGS WORKFLOW OVERVIEW1
3Extracted from Dr Kassahn's publicly shared slides (2013)
LIBRARY PREPARATION2
4
Select targetHybridization-based cature or PCR
Add adaptersContain binding sequencesBarcodesPrimer sequences
Amplify material
2
5
Select targetHybridization-based cature or PCR
Add adaptersContain binding sequencesBarcodesPrimer sequences
Amplify material
A) Fragment DNA
B) End-repair
C) A-tailing, adapter ligation and PCR
D) Final library contains• sample insert• indices (barcodes)• flowcell binding sequences• primer binding sequences
LIBRARY PREPARATION2
6
Select targetHybridization-based cature or PCR
Add adaptersContain binding sequencesBarcodesPrimer sequences
Amplify material
LIBRARY PREPARATION2
TEMPLATE PREPARATION
7
Attachment of librarye.g. To Illumina Flowcell
Amplification of library moleculese.g. Brigde amplification
2
BRIDGE AMPLIFICATION
8
2
SEQUENCING
9
Sequencing-by-Synthesis
Detection by:• Illumina – fluorescence• Ion Torrent – pH• ROCHE 454 – PO4 and light
2
SEQUENCING-BY-SYNTHESIS (ILLUMINA)
10
2
IMPORTANT SEQUENCING CONCEPTS1
11
Barcoding/Indexing: allows multiplexing of different samples
Single-end vs paired-end sequencing
Coverage: avg. number reads per target
Quality scores (Qscore): log-scales!
3
NGS DATA ANALYSIS WORKFLOW4
12
DE-MULTIPLEXING (BARCODE SPLITTING)
13
4
SEQUENCE QUALITY: fastQC
15
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Details of the output https://docs.google.com/document/pub?id=16GwPmwYW7o_r-ZUgCu8-oSBBY1gC97TfTTinGDk98Ws
4
NGS DATA ANALYSIS WORKFLOW4
16
READ MAPPING (BASIC ALIGNMENT)4
17
Comparison against reference genome(! not assembly !)
Many aligners(short reads, longer reads, RNAseq...)Examples: BWA, Bowtie
SAM/BAM files
BURROWS-WHEELER ALIGNMENT TOOL (BWA)
18
Popular tool for genomic sequence data (not RNASeq!)
Li and Durbin 2009 Bioinformatics
Challenge: compare billion of short sequence reads (.fastq file) against human genome (3Gb)
Burrows-Wheeler Transform to “index” the human genome and allow memory-efficient and fast string matching between sequence read and reference genome
4
Li & Durbin 2009 Bionformatics
SAM/BAM FILES
19
4
see http://samtools.sourceforge.net/SAMv1.pdf
SAM/BAM FILES
20
@ Header (information regarding reference genome, alignment method...)
1) Read ID (QNAME)2) Bitwise FLAG (first/second read in pair, both reads mapped...)3) ReferenceSequence Name (RNAME)4) Position (POS, coordinate)5) MapQuality (MAPQ = -10log10P[wrong mapping position])6) CIGAR (describes alignment – matches, skipped regions, insertions..)7) ReferenceSequence (RNEXT, Ref seq of the pair)8) Position of the pair (PNEXT)9) TemplateLength (TLEN)10) ReadSequence11) QUAL (in Fastq format, '*' if NA)...
4
VARIANT CALLING
21
Identify sequence variantsDistinguish signal vs noiseVCF filesExamples: SAMtools, SNVmix
4
SEQUENCE VARIANTS
22
Differences to the reference
4
SEQUENCE VARIANTS
23
Sanger: is it real??
NGS: read count
Provides confidence (statistics!)
Sensitivity tune-able parameter (dependent on coverage)
4
VARIANT CALLING: GATK
24
Genome Analysis Toolkit (BROAD Institute)
• Initially developed for 1000 Genomes Project
• Single or multiple sample analysis (cohort)
• Popular tool for germline variant calling
• Evaluates probability of genotype given read data
4
see http://www.broadinstitute.org/gatk/and McKenna et al. Genome Research 2010
SOMATIC VARIANT CALLING
25
Somatic mutations can occur at low freq. (<10%) due to:
• Tumor heterogeneity (multiple clones)
• Low tumor purity (% normal cells in tumor sample)
Requires different thresholds than germline variant calling when evaluating signal vs noise
Trade-off between sensitivity (ability to detect mutation) and specificity (rate of false positives)
Nature Reviews Cancer 12, 323-334 (May 2012)
4
INDELS DETECTION1
26
Small insertions/ deletions
The trouble with mapping approaches
4
modified from Heng Li (Broad Institute)
INDELS DETECTION
27
Small insertions/ deletions
The trouble with mapping approaches
4
INDELS DETECTION
28
Small insertions/ deletions
The trouble with mapping approaches
4
RE-ALIGNMENT
29
Re-align considering multi-read context, SNPs & INDELS previous info...
4
adapted from Andreas Schreiber
EVALUATING VARIANT QUALITY
30
TAKING INTO ACCOUNT:
• Coverage at position
• Number independent reads supporting variant
• Observed allele fraction vs expected (somatic / germline)
• Strand bias
• Base qualities at variant position
• Mapping qualities of reads supporting variant
• Variant position within reads (near ends or at centre)
4
VCF FILES
31
Variant Call Format
Standard for reporting variants from NGS
Describes metadata of analysis and variant calls
Text file format (open in Text Editor or Excel)
!!! Not a MS Office vCard !!!
see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
4
VCF FILES
32
4
NGS DATA ANALYSIS WORKFLOW
33
4
VARIANT ANNOTATION
34
Provide biological & clinical context
Identify disease-causing mutations(among 1000s of variants)
4
ANNOTATION OVERVIEW
35
4
VARIANT FILTERING AND PRIORIZATION
36
PURPOSE: Identify pathogenic or
disease-associated mutation(s) Reduce candidate variants
to reportable setCOMMON STEPS:
• Remove poor quality variant calls
• Remove common polymorphisms
• Prioritize variants with high functional impact
• Compare against known disease genes
• Consider mode of inheritance (autosomal recessive, X-linked...)
• Consider segregation in family (where multiple samples available)
4
NGS DATA ANALYSIS WORKFLOW
37
5
VISUALIZATION – IGV (or Genome Browser, Circos...)
38
5
provided by Katherine Pillman
COMMON PIPELINE6
39
bcl2fastq (Illumina)FastQC (open-source)
Exomes (HiSeq): BWA(open-source), GATK (Broad)
Gene panels (MiSeq, PGM): MiSeq Reporter (Illumina) Torrent Suite (Ion Torrent)
Custom scripts and third party tools (Annovar, snpEff, PolyPhen, SIFT...)
Commercial annotation software(GeneticistAssistant, VariantStudio...)
COMMON DATA FORMATS6
40
.bcl
.fastq
.BAM
.VCF
.csv
.txt
.xls
.html ...
CONCLUSIONS7
41
NGS data - the new currency of (molecular) biology
Broad applications (ecology, evolution, ag sciences, medical research and clinical diagnostics...).
Rapidly evolving (sequencing technologies, library preparation methods, analysis approaches, software).
Different tools/pipelines/parametrization gives different results, (more standards needed).
Bioinformatics pipelines typically combine vendor software, third-party tools and custom scripts.
Requires skills in scripting, Linux/Unix, HPC.
Requires advanced hardware (not always available).
Understanding of data (SE, PE, RNA-Seq) important for successful analysis.