27th Reconfigurable Architectures Workshop (RAW 2020) - … · 2020. 6. 3. · Leveraging Succinct...

13
Leveraging Succinct Data Structures for DNA Sequence Mapping on FPGA 27 th IEEE Reconfigurable Architecture Workshop, May 18 th 2020 Guido Walter Di Donato [email protected] Alberto Zeni Lorenzo Di Tucci Marco D. Santambrogio [email protected] [email protected] [email protected] Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) - Politecnico di Milano

Transcript of 27th Reconfigurable Architectures Workshop (RAW 2020) - … · 2020. 6. 3. · Leveraging Succinct...

  • Leveraging Succinct Data Structures for DNA Sequence Mapping on FPGA

    27th IEEE Reconfigurable Architecture Workshop, May 18th 2020

    POLITECNICO DI MILANO

    MANUALE DI CORPORATE IDENTITY

    MILANO, MAGGIO 2015

    Guido Walter Di Donato [email protected] Zeni

    Lorenzo Di TucciMarco D. Santambrogio

    [email protected]@[email protected]

    Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) - Politecnico di Milano

  • 2

    DNA Sample

    Sequencing

    Genome

    Assembly

    Genome

    Analysis

    Drug

    Creation

    Computational Bottleneck

    DNA Analysis for Precision Medicine

  • Sequence Mapping

    3

    Sequence MappingSequenced reads

    Sequence mapping is a computational intensive step involved in genome assembly and

    genomic analysis pipelines, leading to long execution times and high computational costs.[1]

    [1] Beretta, S.. Algorithms for strings and sequences: Pairwise alignment. Encyclopaedia of Bioinformatics and Computational Biology. Oxford: Academic Press, 2019.

  • Sequence Mapping

    4

    Sequence MappingSequenced reads

    Sequence mapping is a computational intensive step involved in genome assembly and

    genomic analysis pipelines, leading to long execution times and high computational costs.[1]

    [1] Beretta, S.. Algorithms for strings and sequences: Pairwise alignment. Encyclopaedia of Bioinformatics and Computational Biology. Oxford: Academic Press, 2019.

    Sequence mapping is the computational bottleneck of

    many genome assembly and genome analysis pipelines:

    Long execution time

    High computational cost

  • 5

    Burrows-Wheeler Transform

    Reversible string permutation that can be searched directly and has long strings

    of repeated characters (great for compression).

    TACGACGTCGACT$

    1 TACGACGTCGACT$

    2 ACGACGTCGACT$T

    3 CGACGTCGACT$TA

    … …

    13 T$TACGACGTCGAC

    14 $TACGACGTCGACT Sort each row by their prefixes in

    lexicographic order 14 $TACGACGTCGACT

    2 ACGACGTCGACT$T

    5 ACGTCGACT$TACG

    … …

    1 TACGACGTCGACT$

    8 TCGACT$TACGACG

    TTGGATAACCCC$G

    14-2-5-11-3-9-6-12-4-10-7-13-1-8

    BWT output

    Suffix array

    Burrows, M. and Wheeler, D. J.: A block-sorting lossless data compression algorithm. 1994.

  • BWaveR Data Structure

    6

    A flexible combination of succinct data structures, namely Wavelet Tree and RRR

    Sequences, able to store long sequences in an amount of space "close" to the theoretic

    lower bound, while still allowing for efficient rank query operations.

    S., Beller, T., Moffat, A., and Petri, M.: From theory to practice: Plug and play with succinct data structures, 2014

    RRR Sequences

    Parameters: block size & superblock factor

  • Backward Search on BWaveR Data Structure

    start = C(X) + 1

    end = C(X+1)

    INITIALIZATION ITERATIONS

    start = C(X) + rankWT(X, start-1) +1

    end = C(X) + rankWT(X, end)end < pos($)

    start = C(X) + rankWT(X, start-1) +1

    end = C(X) + rankWT(X, end-1)start < pos($) ≤ end

    start = C(X) + rankWT(X, start-2) +1

    end = C(X) + rankWT(X, end-1)start ≥ pos($)C(X)

    $ A C G T

    0 1 4 8 11

    TTGGATAACCCC$G

    7

  • Backward Search on BWaveR Data Structure

    start = C(X) + 1

    end = C(X+1)

    INITIALIZATION ITERATIONS

    start = C(X) + rankWT(X, start-1) +1

    end = C(X) + rankWT(X, end)end < pos($)

    start = C(X) + rankWT(X, start-1) +1

    end = C(X) + rankWT(X, end-1)start < pos($) ≤ end

    start = C(X) + rankWT(X, start-2) +1

    end = C(X) + rankWT(X, end-1)start ≥ pos($)C(X)

    $ A C G T

    0 1 4 8 11

    TTGGATAACCCC$G

    8

    A pattern of length p is found in O(p) time,

    independently of the reference size!

  • Custom Sequence Mapping Architecture

    FPGA-tailored optimizations:

    1. Reference structure stored in BRAM to reduce memory access time

    2. Dimensioned query sequences structures to exploit memory burst

    3. Parallel mapping of each query sequence and its reverse complement

    4. Implemented rank queries using a reduction strategy9

    Load reference structureHost DRAM FPGA

    Load query sequence

    Send mapping result

    Compute reverse complement

    Mapping via backward search

    Initialization

    Loop

  • 10

    Results - Memory Footprint

    Data

    stru

    ct s

    ize [b

    ytes

    ]

    1M

    5M

    10M

    14M

    18M

    Superblock factor2 7 17 32 50 100

    3 7 11 15

    Data

    stru

    ct s

    ize [b

    ytes

    ]

    10M

    43M

    75M

    108M

    140M

    Superblock factor2 7 17 32 50 100

    Block size

    Memory reduction up to

    63.7% w.r.t BWT sequence

    98.1% w.r.t naive Occ matrix

    E.Coli Genome Human Chr. 21

    Memory reduction up to

    68.3% w.r.t BWT sequence

    98.4% w.r.t naive Occ matrix

  • 11

    Results - Mapping Time (HW)

    Map

    pin

    g T

    ime

    [ms]

    1

    10

    100

    1000

    10000

    100000

    1 million reads 10 millions reads 100 millions reads

    BWaveR - FPGA BWaveR - CPU Bowtie2 - 1 threadBowtie2 - 8 threads Bowtie2 - 16 threads

    Time for mapping 100 bp reads against Human Chr. 21 (b = 15, sf =50)

    14x7,8x

    1,4x0,7x

    62x 42x

    7,6x4x

    70x 51x

    9,5x4,9x

  • 12

    En

    ergy

    Co

    nsu

    mp

    tio

    n [J

    ]

    1

    10

    100

    1000

    10000

    1 million reads 10 millions reads 100 millions reads

    BWaveR - FPGA BWaveR - CPU Bowtie2 - 1 threadBowtie2 - 8 threads Bowtie2 - 16 threads

    Energy for mapping 100 bp reads against Human Chr. 21 (b = 15, sf =50)

    Results - Energy Consumption (HW)

    74x42x

    7,7x4x

    337x225x

    41x21x

    380x274x

    51x27x

  • 13

    Conclusions

    This paper presents BWaveR, a fast and memory efficient sequence mapper

    leveraging succinct data structures and HW acceleration on FPGA:

    ➡ Flexibility of usage

    ➡ Great compression capabilities

    ➡ Time independence from reference size

    ➡ Low energy consumption

    ➡ Low execution time

    For questions regarding this work, email:

    Guido Walter Di Donato: [email protected]