Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers...
Transcript of Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers...
talks
Indel-basedRealignment
Improvingtheoriginalalignmentsofthereadsbasedonmul8plesequence
(re-)alignment
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery
InDels=inser8on/dele8on
AGCTAGGGTC AGCTAGGGTC
AGCTAGGGTC
TTC
AGCGGTC
Refseq
Sampleseq
Inser&on Dele&on
Theproblemwewanttofix
Severalconsecu3ve“SNPs”onlyfoundonreadsendingonthe
rightofthehomopolymer
Severalconsecu3ve“SNPs”onlyfoundonreadsendingonthe
le;ofthehomopolymer 7bp“T”
homopolymerrun
Addinga1-bpinser3onbringssanityto
theen3realignment
AlignmentbyBWA
A;errealignment
Whydoesthishappen?
þ Localrealignmentaroundindels->mostparsimoniousalignment
þ Improvesaccuracyofseveraldownstreamprocessingsteps
Ref T A C C C A T T T T T T T C T A A A A G C T BWA C C A T T T T T T C T A A A A A C T IR C C A – T T T T T T C T A A A A A C T
CATGCA CCA TGCA G
ref
del
ins
• Mapperscannot“see”indelsnearendsofreads• Becausemismatchesare“cheaper”thanagapinthis
context
Missmatch=-1Opengap=-3
Howdoweiden8fywhererealignmentisneeded?
• Knownsites(e.g.dbSNP,1000Genomes)
• Indelsseeninoriginalalignments(inCIGARs)
• Siteswhereevidencesuggestsahiddenindel
-Entropycalcula8oniden8fies“messyareas”
1.Findthebestalternateconsensussequencethat,togetherwiththereference,bestfitsthereadsinapile(maximumof1indel)
3.Ifbestalternateconsensusissufficientlybe`erthantheoriginalalignments(usingLODscorethreshold)->acceptproposedrealignment
2.Scoreforalternateconsensus=totalsumofqualityscoresofmismatchingbases
Howdoestherealignmentalgorithmwork?
AAGAGTAGRef:
AAG---AGTAG
AAGAGTAG
Readpileconsistentwitha3bpinser8on
ReadpileconsistentwiththereferencesequenceRealigning
determineswhichisbe`er
ThreeadjacentSNPs
IndelRealignmentsteps/tools
• Iden8fywhatregionsneedtoberealigned➔ RealignerTargetCreator
• Performtheactualrealignment
➔ IndelRealigner
RealignerTargetCreator
• Pre-processingsteptofindintervalsthatmayneedrealignment
• InputBAMfilenotnecessaryifprocessingonlyatknownindels
• Usingalistofknownindelswillbothspeedupprocessingandimproveaccuracy,butisnotrequired
Input BAM Target Intervals
Realigned BAM
RealignerTargetCreator
IndelRealigner
Known Sites
java –jar GenomeAnalysisTK.jar \ –T RealignerTargetCreator \ –R human.fasta \ –I original.bam \ –known indels.vcf \ –o realigner.intervals
IndelRealigner
• A`emptsrealignmentatRealignerTargetCreatortargetintervals
• Mustusesameinputfile(s)usedinRealignerTargetCreatorstep
• Processingop8ons- Onlyatknownindels:muchfaster,
accuratefor~90-95%ofindels- AtindelsseenintheoriginalBAM
alignments:therecommendedmode
- UsingfullSmith-Watermanrealignment:mostaccurate,butheavycomputa8onalcostandnotreallynecessarywiththenewtechs
Input BAM Target Intervals
Realigned BAM
IndelRealigner
Known Sites
java –jar GenomeAnalysisTK.jar \ –T IndelRealigner \ –R human.fasta \ –I original.bam \ –known indels.vcf \ –targetIntervals realigner.intervals \ –o realigned.bam
DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.
ThisiswhatarealignedBAMlookslike
Before AierOlddata
(lowerquality)
Newdata(higherquality)
CanIseetheeffectsofrealignment?
• IndelRealignerchangestheCIGARstringofrealignedreadsbutmaintainstheoriginalCIGAR(withOCtag)
->Cangrepforrealignedregionsandviewingenomebrowser(IGV)
20GAVAAXX100126:1:67:10041:180738 99 20 10011431 70 87M1D14M= 10011720 390
TTAAATGTGTTTATCTATTGTTCTACTATTCAGTTACCTGATTATAAAATCAAAGATTATTTCATGAAACTCAGTACCCCTTCAGGGAAAAAAAAAAAAAT
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGG X0:i:1 X1:i:0 MC:Z:101M OC:Z:101M PG:Z:MarkDuplicates RG:Z:20GAV.1XG:i:0 AM:i:37
NM:i:1SM:i:37 XM:i:1 XO:i:0
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@cccddc``a`^\[Y MQ:i:60 XT:A:
Isrealignments8llnecessarywithlatestsoiware?
• Variantcallerswithreassemblystep(HaplotypeCaller,MuTect2,Platypus)donotrequireindelrealignment
• BUTpoten8alimprovementforBaseQualityScoreRecalibra8onwhenrunonrealignedBAMfiles(ar8factualSNPsarereplacedwithrealindels).
• Alsos8llusefulforlegacytools– UnifiedGenotyper– MuTect1
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery
talks
Furtherreading
h`p://www.broadins8tute.org/gatk/guide/best-prac8ces
h`p://www.broadins8tute.org/gatk/guide/ar8cle?id=38
h`ps://www.broadins8tute.org/gatk/gatkdocs/org_broadins8tute_gatk_tools_walkers_indels_IndelRealigner.php
h`ps://www.broadins8tute.org/gatk/gatkdocs/
org_broadins8tute_gatk_tools_walkers_indels_RealignerTargetCreator.php