Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR,...

Hospital Universitari Vall d’HebronInstitut de Recerca - VHIR

Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII)

Bioinformàtica per la Recerca Biomèdica

http://ueb.vhir.org/2014BRB

Ferran Briansóferran.brianso@vhir.org

15/05/2014

INTRODUCTION TO NGSVARIANT CALLING ANALYSIS

1. NGS WORKFLOW OVERVIEW

2. WET LAB STEPS

3. IMPORTANT SEQUENCING CONCEPTS

4. NGS ANALYSIS WORKFLOW

1. Primary analysis: de-multiplexing, QC

2. Secondary analysis: read mapping and variant calling

3. Tertiary analysis: annotation, filtering...

5. VISUALIZATION

6. COMMON PIPELINES AND FORMATS

7. CONCLUSIONS

PRESENTATION OUTLINE

NGS WORKFLOW OVERVIEW1

3Extracted from Dr Kassahn's publicly shared slides (2013)

LIBRARY PREPARATION2

Select targetHybridization-based cature or PCR

Add adaptersContain binding sequencesBarcodesPrimer sequences

Amplify material

A) Fragment DNA

B) End-repair

C) A-tailing, adapter ligation and PCR

D) Final library contains• sample insert• indices (barcodes)• flowcell binding sequences• primer binding sequences

Amplify material

TEMPLATE PREPARATION

Attachment of librarye.g. To Illumina Flowcell

Amplification of library moleculese.g. Brigde amplification

BRIDGE AMPLIFICATION

SEQUENCING

Sequencing-by-Synthesis

Detection by:• Illumina – fluorescence• Ion Torrent – pH• ROCHE 454 – PO4 and light

SEQUENCING-BY-SYNTHESIS (ILLUMINA)

IMPORTANT SEQUENCING CONCEPTS1

Barcoding/Indexing: allows multiplexing of different samples

Single-end vs paired-end sequencing

Coverage: avg. number reads per target

Quality scores (Qscore): log-scales!

NGS DATA ANALYSIS WORKFLOW4

DE-MULTIPLEXING (BARCODE SPLITTING)

FASTQ FORMAT

see en.wikipedia.org/wiki/FASTQ_format

SEQUENCE QUALITY: fastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Details of the output https://docs.google.com/document/pub?id=16GwPmwYW7o_r-ZUgCu8-oSBBY1gC97TfTTinGDk98Ws

NGS DATA ANALYSIS WORKFLOW4

READ MAPPING (BASIC ALIGNMENT)4

Comparison against reference genome(! not assembly !)

Many aligners(short reads, longer reads, RNAseq...)Examples: BWA, Bowtie

SAM/BAM files

BURROWS-WHEELER ALIGNMENT TOOL (BWA)

Popular tool for genomic sequence data (not RNASeq!)

Li and Durbin 2009 Bioinformatics

Challenge: compare billion of short sequence reads (.fastq file) against human genome (3Gb)

Burrows-Wheeler Transform to “index” the human genome and allow memory-efficient and fast string matching between sequence read and reference genome

Li & Durbin 2009 Bionformatics

SAM/BAM FILES

see http://samtools.sourceforge.net/SAMv1.pdf

SAM/BAM FILES

@ Header (information regarding reference genome, alignment method...)

1) Read ID (QNAME)2) Bitwise FLAG (first/second read in pair, both reads mapped...)3) ReferenceSequence Name (RNAME)4) Position (POS, coordinate)5) MapQuality (MAPQ = -10log10P[wrong mapping position])6) CIGAR (describes alignment – matches, skipped regions, insertions..)7) ReferenceSequence (RNEXT, Ref seq of the pair)8) Position of the pair (PNEXT)9) TemplateLength (TLEN)10) ReadSequence11) QUAL (in Fastq format, '*' if NA)...

VARIANT CALLING

Identify sequence variantsDistinguish signal vs noiseVCF filesExamples: SAMtools, SNVmix

SEQUENCE VARIANTS

Differences to the reference

SEQUENCE VARIANTS

Sanger: is it real??

NGS: read count

Provides confidence (statistics!)

Sensitivity tune-able parameter (dependent on coverage)

VARIANT CALLING: GATK

Genome Analysis Toolkit (BROAD Institute)

• Initially developed for 1000 Genomes Project

• Single or multiple sample analysis (cohort)

• Popular tool for germline variant calling

• Evaluates probability of genotype given read data

see http://www.broadinstitute.org/gatk/and McKenna et al. Genome Research 2010

SOMATIC VARIANT CALLING

Somatic mutations can occur at low freq. (<10%) due to:

• Tumor heterogeneity (multiple clones)

• Low tumor purity (% normal cells in tumor sample)

Requires different thresholds than germline variant calling when evaluating signal vs noise

Trade-off between sensitivity (ability to detect mutation) and specificity (rate of false positives)

Nature Reviews Cancer 12, 323-334 (May 2012)

INDELS DETECTION1

Small insertions/ deletions

The trouble with mapping approaches

modified from Heng Li (Broad Institute)

INDELS DETECTION

RE-ALIGNMENT

Re-align considering multi-read context, SNPs & INDELS previous info...

adapted from Andreas Schreiber

EVALUATING VARIANT QUALITY

TAKING INTO ACCOUNT:

• Coverage at position

• Number independent reads supporting variant

• Observed allele fraction vs expected (somatic / germline)

• Strand bias

• Base qualities at variant position

• Mapping qualities of reads supporting variant

• Variant position within reads (near ends or at centre)

VCF FILES

Variant Call Format

Standard for reporting variants from NGS

Describes metadata of analysis and variant calls

Text file format (open in Text Editor or Excel)

!!! Not a MS Office vCard !!!

see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

VCF FILES

NGS DATA ANALYSIS WORKFLOW

VARIANT ANNOTATION

Provide biological & clinical context

Identify disease-causing mutations(among 1000s of variants)

ANNOTATION OVERVIEW

VARIANT FILTERING AND PRIORIZATION

PURPOSE: Identify pathogenic or

disease-associated mutation(s) Reduce candidate variants

to reportable setCOMMON STEPS:

• Remove poor quality variant calls

• Remove common polymorphisms

• Prioritize variants with high functional impact

• Compare against known disease genes

• Consider mode of inheritance (autosomal recessive, X-linked...)

• Consider segregation in family (where multiple samples available)

NGS DATA ANALYSIS WORKFLOW

VISUALIZATION – IGV (or Genome Browser, Circos...)

provided by Katherine Pillman

COMMON PIPELINE6

bcl2fastq (Illumina)FastQC (open-source)

Exomes (HiSeq): BWA(open-source), GATK (Broad)

Gene panels (MiSeq, PGM): MiSeq Reporter (Illumina) Torrent Suite (Ion Torrent)

Custom scripts and third party tools (Annovar, snpEff, PolyPhen, SIFT...)

Commercial annotation software(GeneticistAssistant, VariantStudio...)

COMMON DATA FORMATS6

.fastq

.html ...

CONCLUSIONS7

NGS data - the new currency of (molecular) biology

Broad applications (ecology, evolution, ag sciences, medical research and clinical diagnostics...).

Rapidly evolving (sequencing technologies, library preparation methods, analysis approaches, software).

Different tools/pipelines/parametrization gives different results, (more standards needed).

Bioinformatics pipelines typically combine vendor software, third-party tools and custom scripts.

Requires skills in scripting, Linux/Unix, HPC.

Requires advanced hardware (not always available).

Understanding of data (SE, PE, RNA-Seq) important for successful analysis.

Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR,...

Science

Transcript of Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR,...

Estatuto da UEB

Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformatics Course - Session 1.2 - VHIR, Barcelona)

Gjeodezi e Zbatuar-ueb

Basic Aspects of Microarray Technology and Data Analysis (UEB-UAT Bioinformatics Course - Session 3.2 - VHIR, Barcelona)

Introduction to Bioinformatics (UEB-UAT Bioinformatics Course - Session 1.1 - VHIR, Barcelona)

Vall d'Hebron Research Institute (VHIR)

Introduction to Functional Analysis with IPA (UEB-UAT Bioinformatics Course - Session 4.2 - VHIR, Barcelona)

UEB TECHNICAL - cnib.ca TECHNICAL 160703.pdf · UEB TECHNICAL 1 Introduction UEB Technical is intended for those who are competent in Unified English Braille (UEB) for literary material.

Curso de Genómica - UAT (VHIR) 2012 - Microarrays

NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Session 2.1.1 - VHIR, Barcelona)

UEB Bombas

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR

Course VHIR-UCTS-UEB - Session 3 - Statistical Analysis

info finante banci ueb

Curso de Genómica - UAT (VHIR) 2012 - Tecnologías de Ultrasecuenciación y de Enriquecimiento de Secuencia

Biomarker Study Protocol WP 09, VHIR-Barcelona

VHIR Annual Report 2011

Relatório Anual UEB 2007

Objetivos generales ueb

CVN - Antonio Rodríguez Sinovas - VHIR