Download - Decompressing Snappy Compressed Files at the Speed of …...Decompressing Snappy Compressed Files at the Speed of OpenCAPI Speaker: Jian Fang . TU Delft. 2 NVMe SNAP OpenCAPI Decom-pression.

1

Decompressing Snappy Compressed Files at the Speed of OpenCAPI

Speaker: Jian Fang TU Delft

2

NVMe

SNA

P

Dec

om-

pres

sion

Flet

cher

Memory

ARROW

Tables

OpenCAPI OpenCAPI

SNA

P

Flet

cher

Act

ions

HBM HBM

FPGA FPGA

Sort

Join

Skyline

… CPU POWER9

Spark DB DNA

Seq

… Current Project

OpenCAPI

GPU N

V Li

nk

Scalable Heterogeneous Accelerated DatabasE

SHADE

3

Big Data Processing Database Analytics

Data Movement

4

• Moving Data from/to Storage • Slow, Bottleneck • 1st step in data processing and database analytics

• Decompress-filter Solutions

storage CPU Raw Data

storage FPGA Compressed Data CPU Decompressed

Data

Filtered Data

Netezza

GZIP

LZ4 RLE LZW

???

Decompress-filter

5

Snappy compression Algorithm

• LZ77-based, byte-level, lossless

• In Hadoop ecosystem

• Support Parquet, ORC, etc.

• Low compression ratio

• Fast compression and decompression

Core i7

500MB/s Per Core

OpenCAPI Speed

Goal

Mechanism of Snappy compression

• Two kinds of token: Literal token and copy token • Find match from previous data • Example:

7

ABCDEFGHABCD

offset 8, length4

Use (8,4) to replace “ABCD”

Snappy Compression

8

… ABCDEFGHABCD

Literal

copy : offset 8 length 4

No Match!

… ABCDEFGHABCD

Match!

… (Literal, “ABCDEFGH”) (Copy, 8, 4) Output

Input

Generate a literal token

Find next match

Generate a copy token

Snappy Decompression

9

… ABCDEFGHABCD

Copy

(Literal, “ABCDEFGH”)

… ABCDEFGH

…

(Copy, 8, 4)

10

Snappy compression Algorithm

• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4

Byte(assumption for design) • Highly data dependent

Benchmark Compression Ratio Average Token Size DNA1 3.31 5.79 DNA2 2.36 7.38 DNA3 2.57 7.03 TPC-H 2.53 2.72

… … …

DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1

11

Design in FPGA

• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4

Byte(assumption for design) • Highly data dependent

DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1

4KB BRAM

16X

… 64KB

history

13

Design Challenge

• Deal with different types of tokens

• Handle various size of tokens

• Parallelize token translation

• Keep up with the interface bandwidth

14

Design Challenge





15

Design Challenge





16

Token Structure

• LZ77-based, byte-level • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens

Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018.

Literal Token: (literal, length, data) Copy Token : (copy, offset, length)

Literal Token

17

Token Structure

• LZ77-based, byte-level • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens


Copy Token

Tokens are in various length

18

Prior Solutions: Decode all possiblities


19

Design Challenge





20

Token Dependency

Read Write

Read Write Read

Write Read Write

Token1 Token 2 Token3 Token 4

• Read – Read • No dependency

• Write – Write • Never write same bytes • No dependency

• Write – Read • Forwarding

21

Token Dependency – Address Conflict

16X

…

• Write – Write • Same block, same line

BRAMs

22


16X

…

• Write – Write • Same block, different lines

BRAMs

23


16X

…

• Read – Read

BRAMs

24

Design Challenge





token size * token/cycle * frequency * instances = OpenCAPI BW

Token size: 4B 2 tokens/cycle 250MHz --> 13 instances

25

IDEA – from token-level to BRAM-level

16X

…

16X

…

26

Architecture Overview

27


• Two-level Parser

28



29



• Parallel BRAM Operations

30



• Parallel BRAM Operations

• Relax Execution Model

31

Parallel BRAM Operations & Relax Execution Model

32

Results

• Implementation Result (on XCKU15P)

• Benchmark • Alice’s Adventures in Wonderland : 85KB (149KB) for

compressed (uncompressed) file.

• Throughput

Flip-Flop LUTs BRAMs Frequency Power

32K(3.32%) 46K(8.95%) 29(2.95%) 250MHz 2.9W

FPGA(single decompressor)

CPU(Core i7 with one thread)

Input 3GB/s 300MB/s

Output 5GB/s 500MB/s

33

Contact Me: [email protected] (Jian Fang)

Open Source

https://github.com/ChenJianyunp/ FPGA-Snappy-Decompressor.git

Publish Paper: J. Fang, J. Chen, P. Hofstee, Z. Al-Ars, J. Hidders, A High-Bandwidth Snappy Decompressor in Reconfigurable Logic (CODES+ISSS 2018)

mailto:[email protected]