Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide...
Transcript of Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide...
![Page 1: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/1.jpg)
Variant Calling
![Page 2: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/2.jpg)
Ready for variant calling!!
![Page 3: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/3.jpg)
Discover “variants” relative to a reference genome
From GATK Introduction to Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
International Solanaceae Genomics Project
![Page 4: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/4.jpg)
Different types of variants
From GATK Introduction to Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 5: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/5.jpg)
Variant callers are not concordant
Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling pipelines
O'Rawe et al., Genome medicine 2013
![Page 6: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/6.jpg)
Samtools mpileup:
- Simplest way to visualize SNP/indel calling and alignment.
- Piles up reads on each position
- Summarizes the base calls of aligned reads to a reference sequence
SAMtools mpileup
![Page 7: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/7.jpg)
SAMtools mpileup format specification
chr2 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
chr2 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
chr2 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
chr2 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
chr2 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
chr2 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
chr2 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1-based coordinatechromosome
reference base
nb of reads covering the site
read bases Base qualities
![Page 8: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/8.jpg)
SAMtools mpileup format specification
chr2 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
chr2 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
chr2 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
chr2 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
chr2 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1-based coordinatechromosome
reference base
nb of reads covering the site
read bases Base qualities
read base code :. match to the reference base on the forward strand, match on the reverse strand`ACGTN' mismatch on the forward strand`acgtn' mismatch on the reverse strand`\+[0-9]+[ACGTNacgtn]+' insertion between this reference position and the next reference position
![Page 9: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/9.jpg)
chr3 156 A 11 .$......+2AG.+2AG.+2AGGG <975;:<<<<<
chr5 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<
SAMtools mpileup format specification1-based coordinatechromosome
reference base nb of reads covering the site
read bases Base qualities
read base code :. match to the reference base on the forward strand, match on the reverse strand`ACGTN' mismatch on the forward strand`acgtn' mismatch on the reverse strand`+[0-9]+[ACGTNacgtn]' insertion between this reference position and the next reference position`^' marks the start of a read segment`$' marks the end of a read segment
![Page 10: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/10.jpg)
SAMtools mpileup
BUT
- is not a real variant caller
- must be combined to bcftools to perform the variant calling
> samtools mpileup -ugf myrefgenome.fa myreadsaligned.bam | bcftools call -vmO v -o myvariantscalled.vcf
## samtools mpileup# -u : generate uncompressed VCF/BCF output# -g : generate genotype likelihoods in BCF format# -f FILE : faidx indexed reference sequence file
## bcftools# -v : output variant sites only# -m : alternative model for multiallelic and rare-variant calling# -O : output type: 'v' uncompressed VCF [v]# -o, --output <file> : write output to a file [standard output]
samtools mpileup :- Collects summary information in the input
BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format.
bcftools call :- Applies the prior and does the actual calling.
![Page 11: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/11.jpg)
VCF : Variant Call FormatStandardised format for storing the most prevalent types of sequence variationsText file format in 2 parts : header and body.
##fileformat=VCFv4.2##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20, length=62435964, assembly=B36, md5=f126cdf8a6e0c7f379d618ff66beb2da, species="Homo sapiens",taxonomy=x>##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA0000220 14370 rs6054257 ACG A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0/0:48:1:51,51 1|0:48:8:51,5120 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,320 1110696 rs6040355 A G,GT 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,5120 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2
Mandatory Header Lines
VCFheader
Body
Optional header lines (meta-data about the annotations in the VCF body)
Reference alleles (GT=0)
Alternate alleles (GT>0 is an index to the ALT column)
Phased data (G and C above are on the same chromosome)Deletion SNP InsertionOther event
![Page 12: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/12.jpg)
VCF : Variant Call FormatTypes of variants :SNPsAlignment VCF representationACGT POS REF ALTATGT 2 C T
From http://vcftools.sourceforge.net/VCF-poster.pdf
InsertionsAlignment VCF representationAC-GT POS REF ALTACTGT 2 C CT
Large structural variants VCF representation POS REF ALT INFO 100 T <DEL> SVTYPE=DEL;END=300
DeletionsAlignment VCF representationACGT POS REF ALTA--T 1 ACG A
Complex eventsAlignment VCF representationACGT POS REF ALTA-TT 1 ACG AT
![Page 13: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/13.jpg)
VCF : headerLines that start with #Some mandatory lines : file format, column headerOptional header lines contain meta-data about annotations in the vcf body
Meta-data may vary a lot from a variant caller to another one!
INFO vs FORMAT :INFO = annotations on variant as a wholeFORMAT = annotations that apply to each genotype
![Page 14: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/14.jpg)
VCF representation of genotypes
Zygosity VCF representation
Heterozygous 0/1, 1/2, 0/2, ...
Homozygous Reference Alternate
0/01/1, 2/2, 3/3, ...
Missing ./0, ./1, ./., ...
0 = Ref 1 = Alt1 2 = Alt2 3 = Alt3 ...
![Page 15: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/15.jpg)
VCF specification versionsVCF specifications evolve through versions!
Changes between VCFv4.2 and VCFv4.3 :
● VCF compliant implementations must support both LF and CR+LF newline conventions
● INFO and FORMAT tag names must match the regular expression ^[A-Za-z ][0-9A-Za-z .]*$
● Spaces are allowed in INFO field values ● Characters with special meaning (such as ’;’ in INFO, ’:’ in FORMAT, and ’%’ in
both) can be encoded using the percent encoding (see Section 1.2) • The character encoding of VCF files is UTF-8. 35
● The SAMPLE field can contain optional DOI URL for the source data file ● Introduced ##META header lines for defining phenotype metadata ● New reserved tag ”CNP” analogous to ”GP” was added. Both CNP and GP use 0 to
1 encoding, which is a change from previous phred-scaled GP. ● In order for VCF and BCF to have the same expressive power, we state explicitly
that Integers and Floats are 32-bit numbers. Integers are signed.● We state explicitly that zero length strings are not allowed, this includes the CHROM
and ID column, INFO IDs, FILTER IDs and FORMAT IDs. Meta-information lines can be in any order, with the exception of ##fileformat which must come first.
● All header lines of the form ##key= must have an ID value that is unique for a given value of ”key”. All header lines whose value starts with ”<” must have an ID field. Therefore, also ##PEDIGREE newly requires a unique ID.
● We state explicitly that duplicate IDs, FILTER, INFO or FORMAT keys are not valid. ● A section about gVCF was added, introduced the <*> symbolic allele.
...
Changes between VCFv4.1 and VCFv4.2:
● Information field format: adding source and version as recommended fields.
● INFO field can have one value for each possible allele (code R).
● For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields.
● Alternate base (ALT) can include *: missing due to a upstream deletion.
● Quality scores, a sentence removed: High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired.
● Examples changed a bit.
![Page 16: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/16.jpg)
GATK Best practices
https://software.broadinstitute.org/gatk/best-practices
![Page 17: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/17.jpg)
GATK Variant discovery
HaplotypeCaller
GenotypeGVCFs
Once for each sample
Once for the full cohort
https://software.broadinstitute.org/gatk/best-practices
![Page 18: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/18.jpg)
GATK HaplotypeCaller
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 19: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/19.jpg)
GATK HaplotypeCaller : step 1● Sliding window
● Count mismatches, indels and soft clips
● Measure of entropy
● Define active region according to a thresholdActive region
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 20: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/20.jpg)
GATK HaplotypeCaller : step 2● Local reassembly via graph
● Traverse graph to collect most likely haplotypes
● Align haplotypes using Smith-Waterman
Likely haplotypes and candidate variant sitesFrom GATK Best Practices for Variant Discovery Presentation,
https://software.broadinstitute.org/gatk/download/workshops
![Page 21: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/21.jpg)
Recovering indels and remove artifacts
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 22: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/22.jpg)
Resolving complexity
HaplotypeCaller will use one representation for a cleaner output
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 23: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/23.jpg)
GATK HaplotypeCaller : step 3
● PairHMM aligns each read to each haplotype
● Considers base qualities
Likelihood of the haplotype given reads
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 24: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/24.jpg)
GATK HaplotypeCaller results
GVCF file for each sample
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 25: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/25.jpg)
GVCF Format : Why ?● VCF format only includes variant sites
● When performing joint calling : How to genotype homozygous reference samples ?
○ Can be homozygous reference (good coverage, alignment of good quality, etc.)○ Can be unknown (poor coverage, alignment of bad quality, outside of WES kit, etc.)
● Solution : recording homozygous reference sites during the calling
https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
![Page 26: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/26.jpg)
GVCF Format specifications
https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
![Page 27: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/27.jpg)
GVCF Format : example
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLEchr2 21232195 . G A,<NON_REF> 1284.77 . MLEAC=1,0;... GT:AD:DP:GQ 0/1:38,39,0:77:99chr2 21232196 . G <NON_REF> . . END=21232802 GT:DP:GQ:MIN_DP 0/0:94:99:63chr2 21232803 . T C,<NON_REF> 4959.77 . DP=120;... GT:AD:DP:GQ 1/1:0,120,0:120:99chr2 21232805 . T <NON_REF> . . END=21234696 GT:DP:GQ:MIN_DP 0/0:104:99:51chr2 21234697 . A <NON_REF> . . END=21234697 GT:DP:GQ:MIN_DP 0/0:58:96:58chr2 21234698 . G <NON_REF> . . END=21234722 GT:DP:GQ:MIN_DP 0/0:48:99:46chr2 21234723 . C <NON_REF> . . END=21234726 GT:DP:GQ:MIN_DP 0/0:43:90:42
GVCF Default Bands : 1, 2, 3, 4,....., 60, 70, 80, 90, 99
![Page 28: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/28.jpg)
GATK GenotypeGVCFs
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 29: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/29.jpg)
GATK GenotypeGVCFs● Joint calling
● Determine most likely combination of allele(s) for each site
● Based on allele likelihoods (from PairHMM)
● Apply Bayes’ theorem with ploidy assumption
Genotype callshttps://software.broadinstitute.org/gatk/documentation/presentations
![Page 30: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/30.jpg)
GATK Calling variants workflow
BAM file
BAM file
BAM file
VCF file
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 31: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/31.jpg)
GATK Calling variants N+1 problem
BAM file
BAM file
BAM file
VCF file
BAM file
Already processed
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 32: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/32.jpg)
GATK Calling variants tutorial
![Page 33: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/33.jpg)
GATK: Filtering variants
https://software.broadinstitute.org/gatk/best-practices
![Page 34: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/34.jpg)
GATK : Filtering variants● Calling algorithms are very permissive
● Calling sets contain many false positives
● Two filtering approaches :
○ Hard filtering : using thresholds on annotations
○ Variant recalibration using machine learning
● Sensitivity vs Specificity
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 35: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/35.jpg)
GATK : Hard filtering● Suitable for all experiments (targeted gene, WES, small sample size, etc.)
● Goal : define annotations and thresholds to filter bad variants
● Pros :○ Easy to perform
● Cons :○ Hard to define annotations to use
○ Hard to define thresholds
○ May filter good variants, may keep bad variants
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 36: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/36.jpg)
GATK : Annotations● GATK adds annotations to each variant
● Represent properties/statistics describing each variant :○ Sequence context○ Depth of coverage○ Number of reads covering each allele○ Proportion of reads in forward/reverse orientation○ ...
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 37: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/37.jpg)
GATK : Hard filtering example (QualByDepth)● QUAL divided by the unfiltered
depth of the non hom-ref samples● Avoid inflation caused when there
is deep coverage● Two peaks :
○ 12 : Heterozygous○ 32 : Homozygous alternate
● QD > 2 :○ Eliminates most of the false positives○ Keeps some bad variants○ Filters some good variants
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
![Page 38: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/38.jpg)
GATK Hard Filtering example (FisherStrand)● Phred-scaled probability that there
is a strand bias● Identify alternate allele more seen
or less often on the forward or reverse strand than the reference allele
● Large intersection between bad and good variants
● FS < 60 :○ Removes many bad variants○ Keeps many bad variants
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
![Page 39: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/39.jpg)
GATK : Hard filtering examples
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
![Page 40: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/40.jpg)
GATK : Hard filtering recommendations● Filtering SNPs where any :
○ QD < 2.0○ MQ < 40.0 ○ FS > 60.0○ SOR > 3.0○ MQRankSum < -12.5○ ReadPosRankSum < -8.0
● Filtering Indels where any :○ QD < 2.0○ ReadPosRankSum < -20.0○ InbreedingCoeff1 < -0.8○ FS > 200.0○ SOR > 10.0
1 When sample size > 10
Warning : Threshold on maximum depth should not be used for WES data
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
![Page 41: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/41.jpg)
GATK : Hard filtering tutorial
![Page 42: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/42.jpg)
GATK : Variant Quality Score Recalibration (VQSR)● Preferred method
● Requires :○ DNA-seq data (not working on RNA-seq data)
○ Well curated training/truth resources (usually not available for non human organisms)
○ Large amount of variants (no targeted gene panels, etc.)
○ > 30 samples for WES data (1000G WES samples can be added if needed but not optimal)
● Based on machine learning
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 43: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/43.jpg)
GATK : VQSRGoal : Train on high confidence known sites to determine the probability
that other sites are true or false
● Assume annotations tend to form Gaussian clusters
● Build a “Gaussian mixture model” from annotations of known variants
● Score all variants by where their annotations lie relative to the clusters
● Filter based on sensitivity to truth set
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 44: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/44.jpg)
GATK : VQSR
Positive model (good variants)
Negative model (bad variants)True positives
False positives
pq
VQSLOD(x) = Log(p(x)/q(x))
Done for each annotation and then integrated into a single VQSLODFrom GATK Best Practices for Variant Discovery Presentation,
https://software.broadinstitute.org/gatk/download/workshops
![Page 45: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/45.jpg)
GATK : VQSR
Model trained on Hapmap Model applied to new SNPs
DePristo et al. Nat. Genet. 2011
![Page 46: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/46.jpg)
GATK : VQSR workflowOriginal SNPs + Original Indels
VariantRecalibrator
ApplyRecalibration
Recalibrated SNPs + Original Indels
SNP MODE
VariantRecalibrator
ApplyRecalibration Recalibrated SNPs + Recalibrated Indels
INDEL MODE From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 47: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/47.jpg)
GATK : VQSR training and truth resources● Training : input variants that
overlap with these training sites to build the model
● Truth : determine where to set the cutoff
● Known : only for reporting purposes
● Prior : Phred-scaled estimate of data accuracy
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 48: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/48.jpg)
GATK : VQSR SNP human resources● Hapmap
○ Training○ Truth○ Prior = 15
● Omni○ Training○ Truth○ Prior = 12
● 1000G SNPs High confidence○ Training○ Prior = 10
● dbSNP○ Known○ Prior = 2
https://software.broadinstitute.org/gatk/documentation/article?id=1259
Annotations : QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR, DP1, InbreedingCoeff
1 Not to be used for WES
![Page 49: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/49.jpg)
GATK : VQSR Indel human resources● Mills indels
○ Training○ Truth○ Prior = 12
● dbSNP○ Known○ Prior = 2
https://software.broadinstitute.org/gatk/documentation/article?id=1259
Annotations : QD, DP1, FS, SOR, ReadPosRankSum, MQRankSum, InbreedingCoeff
1 Not to be used for WES
![Page 50: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/50.jpg)
GATK : VQSR plots
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 51: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/51.jpg)
GATK : VQSR tranches plots
90 99 99.9 100 Truth sensitivity (%)
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 52: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/52.jpg)
GATK : VQSR output example
#CHROM POS FILTER INFO
chr2 121456 VQSRTrancheSNP99.9to100.0 AC=2;..;NEGATIVE_TRAIN_SITE;VQSLOD=-4.532;culprit=MQ
chr2 121782 PASS AC=24;..;VQSLOD=3.278;culprit=FS
chr2 121987 VQSRTrancheINDEL99.0to99.9 AC=1;..;POSITIVE_TRAIN_SITE;VQSLOD=-2.312;culprit=SOR
From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops
![Page 53: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/53.jpg)
Variant normalization
![Page 54: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/54.jpg)
Why is variant normalization necessary?Every variant in the human genome has various representations!
When merging variants from multiple variant callers for the same sample
⇒ which variants are common between callers ?
When comparing variant from the same variant caller but from different samples
⇒ which variants are shared between samples ?
A normalized variant is parsimonious and left-aligned
![Page 55: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/55.jpg)
Variant represented in as few nucleotides as possible without an allele of length 0.
If the leftmost nucleotide of each variant is of the same type and the removal of the nucleotide from each allele will not result in an empty allele ⇒ superfluous nt on its left side!
Variant normalization - Parsimony
![Page 56: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/56.jpg)
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
![Page 57: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/57.jpg)
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
![Page 58: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/58.jpg)
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
![Page 59: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/59.jpg)
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
![Page 60: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/60.jpg)
Variant normalization - Parsimony
https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png
Parsimonious !
![Page 61: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/61.jpg)
A variant is left aligned if and only if it is no longer possible to shift its position to the left while keeping the length of all its alleles constant.
Variant normalization - Left Alignment
![Page 62: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/62.jpg)
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
![Page 63: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/63.jpg)
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
Empty allele is not allowed !
![Page 64: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/64.jpg)
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
![Page 65: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/65.jpg)
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
![Page 66: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/66.jpg)
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
![Page 67: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/67.jpg)
Variant normalization - Left Alignment
https://genome.sph.umich.edu/wiki/File:Normalization_str.png
![Page 68: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling](https://reader033.fdocuments.net/reader033/viewer/2022053121/60a559e3afaa5531e40c18d0/html5/thumbnails/68.jpg)
Ready for annotating variants!!