Optical Mapping Data: Data Generation and AlgorithmsThe Burrows-Wheeler Transform (BWT) is a...
Transcript of Optical Mapping Data: Data Generation and AlgorithmsThe Burrows-Wheeler Transform (BWT) is a...
OpticalMappingData:DataGenerationandAlgorithms
SamplePreparation
Sequencing
Assembly
Analysis
Fragments
Reads
Contigs
WhatisanOpticalMap?
GGCTTCCGACCACCACAACCGAATTATGAAGGATACCGAA
6,19,35
Opticalmapsareordered,genome-wide,high-resolutionrestrictionmaps.
- Muchlongerthanreads.Forexample,theaveragemapsizeforgoatcovers 360,000bp
- Nowcommerciallyavailable
.
IsolatedDNA Microfludic device
DNAiselongatedandcleavedontheopticalmappingsurface
Epiflourescence microscopewithCCDcamera
6 3 3 49
6 3 3 49
6 3 9 4
Genomewideopticalmap
“There is [..] a critical need for the continued development and public release of software tools for processing optical mapping data ..”
-GigaScience 2014
Goal:tooltoalignthecontigtoa segmentofan
opticalmap
SamplePreparation
Sequencing
Assembly
Analysis
Genome-wideopticalmap
contigs
OpticalMapData
• Previousapproachesusedynamicprogramming• Burrows-WheelerTransform(BWT)wouldimprovetimeefficiency
• ChallengesinapplyingBWT:(1)Sizingerrorand(2)alphabetsize
Challenges
6 3 9 4
5 4 9.5 6
ActualopticalmapvaluesOpticalmapobtainedfromexperiment
1 1 0.5 2SIZINGERROR
• Previousapproachesusedynamicprogramming• Burrows-WheelerTransform(BWT)wouldimprovetimeefficiency
• ChallengesinapplyingBWT:(1)Sizingerror and(2)alphabetsize
Challenges
6 3 9 4
5 4 9.5 6
ActualopticalmapvaluesOpticalmapobtainedfromexperiment
1 1 0.5 2SIZINGERROR
• Previousapproachesusedynamicprogramming• Burrows-WheelerTransform(BWT)wouldimprovetimeefficiency
• ChallengesinapplyingBWT:(1)Sizingerrorand(2)alphabetsize
Challenges
!𝑢𝑛𝑖𝑞𝑢𝑒𝑓𝑟𝑎𝑔𝑚𝑒𝑛𝑡𝑠𝑖𝑧𝑒𝑠 >�
�
16,000
Twin
SamplePreparation
Sequencing
Assembly
Analysis
Contigs
OpticalMapData
Alignmentofcontigstoopticalmap
Genome-wideopticalmap
Contig 1
Contig 2
Contig 3 Contig 5
Contig 4
TwinAlgorithm
1. Insilico digestcontigs intoopticalmaps.
TTTCCGACCACTTTTCCGAATTATGACCGAA
4,13,24
TwinAlgorithm
1. Insilico digestcontigs intoopticalmaps.2. BuildFM-index* andauxiliarydatastructures
onthegenome-wideopticalmap.
*adatastructurethatallowscompressionoftheinputtextwhilestillpermittingfastsubstringqueries
BWTandFM-indexAsuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.
1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n
3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n
acaaacgn
BWTandFM-indexAsuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.
The suffix array clusters all the occurrences of everypattern together into a contiguous range!
1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n
3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n
acaaacgn
Asuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.
The suffix array clusters all the occurrences of everypattern together into a contiguous range!
1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n
3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n
acaaacgn
BWTandFM-index
1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n
3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n
acaaacgn
Asuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.
The suffix array clusters all the occurrences of everypattern together into a contiguous range!
BWTandFM-index
3 aaacgn4 aacgn1 acaaacgn5 acgn2 caaacgn6 cgn7 gn8 n
1 acaaacgn2 caaacgn3 aaacgn4 aacgn5 acgn6 cgn7 gn8 n
acaaacgn
Asuffixarray(SA)ofstringSisanarrayofthesuffixesofSsortedintoalphabeticalorder.
The suffix array clusters all the occurrences of everypattern together into a contiguous range!
BWTandFM-index
TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].
3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg
acaaacgn
BWTandFM-index
canaaacg
ExtractlastcolumnofSA
TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].
rankK(i): returnthenumberofK’sinS[1,i]
3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg
acaaacgn
BWTandFM-index
canaaacg
00012310
BWT rank
TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].
rankK(i): returnthenumberofK’sinS[1,i]
3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg
acaaacgn
BWTandFM-index
canaaacg
00012310
BWT rank
ranka[5] = 2
TheBurrows-WheelerTransform(BWT)isapermutationofthestringsuchthatBWT[i] = S[SA[i] - 1].
FM-indexisthecompressedversionoftheBWT andrank.
3 aaacgnac4 aacgnaca1 acaaacgn5 acgnacaa2 caaacgna6 cgnacaaa7 gnacaaac8 nacaaacg
acaaacgn
BWTandFM-index
canaaacg
00012310
BWT rank
TwinAlgorithm
1. Insilico digestcontigs intoopticalmaps.2. BuildFM-indexandauxiliarydatastructures
onthegenome-wideopticalmap.3. UsingtheFM-indexwefindallalignments
betweentheopticalmapandtheinsilicodigestedcontigs.- ModifiedFM-indexBackwardSearchAlgorithm
FM-IndexBackwardSearchArecursivealgorithmforfindingsubstringsusingrank and BWT
rank[c]rank[a]
rank[a]
ModifiedFM-IndexBackwardSearch
• Sizingerrorandalphabet sizearechallengestoovercome
• Wecannotaffordabruteforceenumerationofthealphabetateachstepinthebackwardsearch
• Noveltyforopticalmaps:WaveletTree
WaveletTree
AWaveletTreeconvertsastringintoabalancedbinary-treeofbitvectors,wherea0replaceshalfofthesymbols,anda1replacestheotherhalf.Thisdefinitionisappliedrecursive
{A,C,G,T} is encoded as {0,0,1,1}
ACGTATATAGGAAGA001101010110010
WaveletTree
{A,C,G,T} is encoded as {0,0,1,1}
ACGTATATAGGAAGA001101010110010
WaveletTree
Noambiguity!
WaveletTree
ACGTATATAGGAAGA001101010110010
ACAAAAAA01000000
0
{A,C} is encoded as {0,1}
WaveletTree
ACGTATATAGGAAGA001101010110010
ACAAAAAA01000000
0
{G,T} is encoded as {0,1}
GTTTGGG0111000
1
Whichsymbolsin{A, G} existininputstring?
Tomatchx weneedtofindallthesubstringswithintherangex +/- y, fortolerancey.
ModifiedFM-IndexBackwardSearch
Tomatch9 weneedtofindallthesubstringswithintherange[6, 12] , fortolerance3.
ModifiedFM-IndexBackwardSearch
2,11,10,23,53,3,5,10,14,9,110, 1, 0, 1, 1,0,0, 0, 1,0, 1
Genomewideopticalmap
ModifiedFM-IndexBackwardSearch
2,11,10,23,53,3,5,10,14,9,110, 1, 0, 1, 1,0,0, 0, 1,0, 1
Tomatch9 weneedtofindallthesubstringswithintherange[6, 12] , fortolerance3.
2,10,3,5,10,90, 1,0,0, 1,1
11,23,53,14,110, 1, 1, 0, 0
2,3,50,0,1
10,9,100,1, 0
2,30,1
51
11,14,110, 1, 0
23,530, 1
ModifiedFM-IndexBackwardSearch
2,11,10,23,53,3,5,10,14,9,110, 1, 0, 1, 1,0,0, 0, 1,0, 1
Tomatch9 weneedtofindallthesubstringswithintherange[6, 12] , fortolerance3.
2,10,3,5,10,90, 1,0,0, 1,1
11,23,53,14,110, 1, 1, 0, 0
2,3,50,0,1
10,9,100,1, 0
2,30,1
51
11,14,110, 1, 0
23,530, 1
Arecursivealgorithmforfindingsubstringsusingrank and BWT
rank[c] rank[a]
rank[a]
ModifiedFM-IndexBackwardSearch
WaveletTreeQuery
TwinAlgorithm
1. Insilico digestcontigs intoopticalmaps.2. BuildFM-indexandauxiliarydatastructures
onthegenome-wideopticalmap.3. UsingtheFM-indexwefindallalignments
betweentheopticalmapandtheinsilicodigestedcontigs.
4. OutputthealignmentsinPSLformat.
TWINTestDatasets
TWINResults
Twinisthefirstalignmentmethodthatiscapableofhandlinglargegenomesizes
• Theonlyindex-basedtoolandisordersofmagnitudefasterthanexistingapproaches(patentpending)
• Pinetree(20Gb)wouldtake~84machineyearswithSOMAbutacouplehourswithTwin
TWIN:Optical Map Aligner
CORRECTINGERRORSINGENOMES
Mis-assemblyinGenomesMis-assembly: Significantlylargeinsertion,deletion,inversion,orrearrangementthatistheresultofdecisionsmadebytheassemblyprogram
Correctassembly
Rearrangement
Deletion
Insertion
A R R B
A R RB
A R B
A R R BR
Extensivevs.LocalMis-assemblies
ExtensiveMis-assembly:1kbp insizeandregionsaligntodifferentstrandsordifferentchromosomes.
LocalMis-assembly:smallerinsizeandonthesamestrandandsamechromosome.
DeBruijn GraphofaGenome
ExampleGenome:ABCDEFGHICDEFGKLExampleGenome:ABCDEFGHICDEFGKL
1 3
2
ABC BCD CDE DEF EFG FGK GKL
FGH
GHIHIC
ICD
DeBruijn GraphofaGenome
ABC BCD CDE DEF EFG FGK GKL
ExampleGenome:ABCDEFGHICDEFGKLExampleGenome:ABCDEFGHICDEFGKL
DeBruijn GraphofaGenome
ABC BCD CDE DEF EFG FGK GKL
ExampleGenome:ABCDEFGHICDEFGKLResultingErroneousGenome:ABCDEFGKL
1
SamplePreparation
Sequencing
Assembly
Analysis
Fragments
Reads
Contigs
misSEQuel*
RefinedContigs
Reads
Contigs
*(Muggli,Puglisi,Ronen,Boucher,ISMB2015)
SamplePreparation
Sequencing
Assembly
Analysis
Fragments
Reads
Contigs
OpticalMapData
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATAGGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.GGCTTCCGACCACCACAAATGGATATGAAGGATATATGGATTATGAAGGATATAGGCTTCCGACCACCACAAATGGATTATGAAGGATATATGGA
1 9
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.
2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.
SamplePreparation
Sequencing
ACGTAGAATCGACCATG
GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
Reads
Fragments
NextGenerationSequencing(NGS)
ACGTAGAATCGACCATGGGGACGTAGAATACGA
Paired-EndReads/Mate-PairReads
SamplePreparation
Sequencing
Fragments
ReadMatePairConcordance
A R R B
AR R B
A
R
R B
Correctassembly
Rearrangement
Inversion
ReadDepth
A R R B
A R BR R
RA B
Correctassembly
Insertion
Deletion
Red-BlackPositionalDeBruijn Graph
I. Chooseavalueof𝑘andΔ .II. Eachpositional𝑘-mer (sk)isanedgebetweentwo
positional𝑘–mers:prefix andsuffix ofsk.III. Positional𝑘–mers,sk-1 andsk-1’, aregluedifsk-1 andsk-1’
havethesamelabelandtheirdistancesdifferbyatmostΔ.IV. Ask-1 isredifthereaddepthistwostandarddeviationsfrom
themeanorthereisasignificantnumberofdisconcordinatereadalignments;otherwise,itisblack.
Apositional𝑘-mer isa𝑘-mer withanapproximateposition.
PositionalRedBlackdeBruijn GraphReadsaligned tocontigs:
Positionalk-mers withreaddepth:
PositionalRedBlackdeBruijnGraph:
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.
2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.
3. Removeallbulgesandwhirlsforthered-blackpositionaldeBruijn graph.
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.
2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.
3. Removeallbulgesandwhirlsforthered-blackpositionaldeBruijn graph.
Correctassembledcontigs Mis-assembledcontigs
A R R B A R RBA R BA R R BRA R R B
…
misSEQuel Algorithm
1. Alignsequencereadstocontigs usingastandardalignmenttool.
2. Buildthered-blackpositionaldeBruijn graphbasedonthealignment.
3. Removeallbulgesandwhirlsforthered-blackpositionaldeBruijn graph.
4. Contig refinementusingopticalmapalignment.
OpticalMapAlignment
NheI=G^CTAGC
E.Coliopticalmapsegment
A R R B
NheI=G^CTAGC
“GCTAGC”
OpticalMapAlignment
BA R R
NheI=G^CTAGC
CorrectlyAssembledContigs Align
BA R R
NheI=G^CTAGC
A R BR R
Mis-assembledContigs Don’tAlign
NheI=G^CTAGC
A R BR R
Mis-assembledContigs Don’tAlign
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonTularensis
ResultsonPine
B
BA R R
ImprovePrediction
A RR R
B
ImprovePrediction
A RR R
Deletionbetweentwoalignedregions