Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract...
Transcript of Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract...
![Page 1: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/1.jpg)
Sequence Preprocessing: A perspective
Dr. Matthew L. Settles
Genome CenterUniversity of California, Davis
![Page 2: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/2.jpg)
WhyPreprocessreads
• Wehavefoundthataggressively“cleaning”andprocessingreadscanmakealargedifferencetothespeed andquality ofassemblyandmappingresults.Cleaningyourreadsmeans,removingreads/basesthatare:• otherunwantedsequence(polyA tailsinRNA-seq data)• artificiallyaddedontosequenceofprimaryinterest(vectors,adapters,primers)
• joinshortoverlappingpaired-endreads• lowqualitybases• originatefromPCRduplication• notofprimaryinterest(contamination)
• Preprocessingalsoproducesanumberofstatisticsthataretechnicalinnaturethatshouldbeusedtoevaluate“experimentalconsistancy”
![Page 3: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/3.jpg)
ReadPreprocessingstrategies,manyovertime
• Identityandremovecontaminantandvectorreads• Readswhichappeartofullycomefromextraneoussequenceshouldberemoved.
• Qualitytrim/cut• “end”trimareaduntiltheaveragequality>Q(Lucy)• removeanyreadwithaveragequality<Q
• eliminatesingletons/duplicates• Ifyouhaveexcessdepthofcoverage,andparticularlyifyouhaveatleastx-foldcoveragewherexisthereadlength,theneliminatingsingletonsisanicewayofdramaticallyreducingthenumberoferror-pronereads.
• Readwhichappearthesame(particularlypaired-end)areoftenmorelikelyPCRduplicatesandthereforredundantreads.
• eliminateallreads(pairs)containingan“N”character• Ifyoucanaffordthelossofcoverage,youmightthrowawayallreadscontainingNs.
• Identityandtrimoffadapterandbarcodesifpresent• Believeitornot,thesoftwareprovidedbyIllumina,eitherdoesnotlookfor,ordoesamediocrejobof,identifyingadaptersandremovingthem.
![Page 4: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/4.jpg)
RibosomalRNA
• RibosomalRNAmakesup90%ormoreofatypicaltotalRNAsample.• LibraryprepmethodsreducetherRNA representationinasample
• oligoDt onlybindstopolyA tailstoenrichasampleformRNA• Ribo-depletionbindsrRNA sequences
Neithertechniqueis100%efficient
Canscreen(mapreadstorRNA sequences)todeterminerRNAefficiencyandpotentiallyremovethosereads.
![Page 5: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/5.jpg)
DNA/RNA,couldcontain‘contamination’Libraryprep,fragmentation,adapteraddition
PCRenrichment
FinalLibrary,sizedistribution PossibleadditionofphiX SequencingCharacteristics/Quality
![Page 6: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/6.jpg)
Preprocessing• Mapreadstocontaminants/PhiX andextractunmappedreads[bowtie2--local
• Removecontaminants(atleastPhiX),usesbowtie2thenextractsallreads(pairs)thataremarkedasunmapped.
• Super-Deduper [PEreadsonly]• RemovePCRduplicates(weusebases10-35ofeachpairedread)
• FLASH2[ PEreadsonly]• Joinandextend,overlappingpairedendreads• Ifreadscompletelyoverlaptheywillcontainadapter,removeadapters• Identifyandremoveanyadapterdimerspresent
• Scythe[SEReadsonly]• Identifyandremoveadaptersequence
• Sickle• Trimsequences(5’and3’)byqualityscore(IlikeQ20)
• cleanup• RunapolyA/Ttrimmer• Removeanyreadsthatarelessthentheminimumlengthparameter• Producepreprocessingstatistics
![Page 7: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/7.jpg)
WhyScreenforPhiX
• PhiX isacommoncontrolinIlluminaruns,facilitiesrarelytellyouif/whenPhiX hasbeenspikedin
• Doesnothaveabarcode,sointheoryshouldnotbeinyourdata
• However• WhenIknowPhiX hasbeenspikedin,Ifindsequenceeverytime• WhenIknowPhiX hasnotbeenspikedin,Idonot findsequence
• Bettersafethansorryandscreenforit.
![Page 8: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/8.jpg)
SuperDeduper
https://github.com/dstreett/Super-Deduper
Read1
Read2
Data AlignmentAlgorithm
MarkDuplicates Rmdup SuperDeduper FastUniq Fulcrum Total#ofReads
PhiX BWAMEM 1,048,278(0.25%)
1,011,145(1.05%)
1,156,700(13.7%)
4,202,526 3,092,155 4,750,299
Bowtie2Local 1,054,725(6.62%)
948,784(10.2%)
1,166,936(14.0%)
4,236,647 3,103,872 4,790,972
Bowtie2Global 799,524(0%)
800,868(0.12%)
896,487(9.92%)
3,768,641 2,704,114 4,293,787
Acroporadigitifera
BWAMEM 5,132,111(2.26%)
6,906,634(44.5%)
5,133,339(10.2%)
12,968,469 2,103,567 54,108,240
Bowtie2Local 4,688,809(4.03%)
5,931,862(38.9%)
3,971,743(9.32%)
9,893,903 4,259,619 41,728,154
Bowtie2Global 1,457,865(3.62%)
1,512,966(24.2%)
1,185,838(11.4%)
3,014,498 1,286,031 11,600,847
![Page 9: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/9.jpg)
SuperDeduper
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False Positive Rate
True
Pos
itive
Rat
e
10
0.875
0.900
0.925
0.950
0.975
0.03 0.04 0.05 0.06 0.07False Positive Rate
StartPosition
1
5
20
40
150
Figure 1: ROC curves. Only a representative subset of the different start positions is shown. The image on the left shows the full ROC curves and the image on the left is a zoomed in view of corner of the curves. Each curve represents a start position and each point represents a length. The labeled point in the image on the right is the default start and length for
Super Deduper.
We calculated the Youden Index for every combination tested and the point that acquired the highest index value (as compared to Picard MarkDuplicates) occurred at a start position of 5bp and a length of 10bps (20bp total over both reads)
![Page 10: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/10.jpg)
Flash2– overlappingofreadsandadapterremovalinpairedendreads
TargetRegion
Read1
InsertsizeRead2
TargetRegion
Read1
InsertsizeRead2
TargetRegion
Read1
InsertsizeRead2
Insertsize>lengthofthenumberofcycles
Insertsize<lengthofthenumberofcycles(10bpmin)
Insertsize<lengthofthereadlength
Product:ReadPair
Product:Extended,Single
Product:AdapterTrimmed,Single
https://github.com/dstreett/FLASH2
![Page 11: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/11.jpg)
QualityTrimming- Sickle
Remove“poor”qualitysequencefromboththe5’and3’ends
![Page 12: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/12.jpg)
QA/QC
• Beyondgenerating‘better’datafordownstreamanalysis,cleaningstatisticsalsogiveyouanideaastothequalityofthesample,librarygeneration,andsequencingqualityusedtogeneratethedata.
• Thiscanhelpinformyouofwhatyoumightdointhefuture.• I’vefounditbesttoperformQA/QConboththerunasawhole(poorsamplescanaffectothersamples)andonthesamplesthemselvesastheycomparetoothersamples (REMEMBER,BECONSISTANT).
• ReportssuchasBasespace forIllumina,aregreatwaystoevaluatetherunsasawhole.
• PCA/MDSplotsofthepreprocessingsummaryareagreatwaytolookfortechnicalbiasacrossyourexperiment
![Page 13: Sequence Preprocessing: A perspective · Preprocessing •Map reads to contaminants/PhiXand extract unmapped reads [bowtie2 --local • Remove contaminants (at least PhiX), uses bowtie2](https://reader033.fdocuments.net/reader033/viewer/2022051812/602c228334d931072411d8be/html5/thumbnails/13.jpg)
ComparingMappingRawvsPreprocessedwithstar