1
Decompressing Snappy Compressed Files at the Speed of OpenCAPI
Speaker: Jian Fang TU Delft
2
NVMe
SNA
P
Dec
om-
pres
sion
Flet
cher
Memory
ARROW
Tables
OpenCAPI OpenCAPI
SNA
P
Flet
cher
Act
ions
HBM HBM
FPGA FPGA
Sort
Join
Skyline
… CPU POWER9
Spark DB DNA
Seq
… Current Project
OpenCAPI
GPU N
V Li
nk
Scalable Heterogeneous Accelerated DatabasE
SHADE
3
Big Data Processing Database Analytics
Data Movement
4
• Moving Data from/to Storage • Slow, Bottleneck • 1st step in data processing and database analytics
• Decompress-filter Solutions
storage CPU Raw Data
storage FPGA Compressed Data CPU Decompressed
Data
Filtered Data
Netezza
GZIP
LZ4 RLE LZW
???
Decompress-filter
5
Snappy compression Algorithm
• LZ77-based, byte-level, lossless
• In Hadoop ecosystem
• Support Parquet, ORC, etc.
• Low compression ratio
• Fast compression and decompression
Core i7
500MB/s Per Core
OpenCAPI Speed
Goal
Mechanism of Snappy compression
• Two kinds of token: Literal token and copy token • Find match from previous data • Example:
7
ABCDEFGHABCD
offset 8, length4
Use (8,4) to replace “ABCD”
Snappy Compression
8
… ABCDEFGHABCD
Literal
copy : offset 8 length 4
No Match!
… ABCDEFGHABCD
Match!
… (Literal, “ABCDEFGH”) (Copy, 8, 4) Output
Input
Generate a literal token
Find next match
Generate a copy token
Snappy Decompression
9
… ABCDEFGHABCD
Copy
(Literal, “ABCDEFGH”)
… ABCDEFGH
…
(Copy, 8, 4)
10
Snappy compression Algorithm
• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4
Byte(assumption for design) • Highly data dependent
Benchmark Compression Ratio Average Token Size DNA1 3.31 5.79 DNA2 2.36 7.38 DNA3 2.57 7.03 TPC-H 2.53 2.72
… … …
DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1
11
Design in FPGA
• An independent 64KB history for every 64KB data • Token: copy or literal, copy lenth<=64B • Token size: 2 Byte minimum, average ≈< 4
Byte(assumption for design) • Highly data dependent
DNA data source: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1
4KB BRAM
16X
… 64KB
history
13
Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
14
Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
15
Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
16
Token Structure
• LZ77-based, byte-level • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens
Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018.
Literal Token: (literal, length, data) Copy Token : (copy, offset, length)
Literal Token
17
Token Structure
• LZ77-based, byte-level • In Hadoop ecosystem • Support Parquet, ORC, etc. • Two types of tokens: literal tokens and copy tokens
Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018.
Copy Token
Tokens are in various length
18
Prior Solutions: Decode all possiblities
Source: Y. Qiao. An fpga-based snappy decompressor-filter. Master’s thesis, Delft University of Technology, 2018.
19
Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
20
Token Dependency
Read Write
Read Write Read
Write Read Write
Token1 Token 2 Token3 Token 4
• Read – Read • No dependency
• Write – Write • Never write same bytes • No dependency
• Write – Read • Forwarding
21
Token Dependency – Address Conflict
16X
…
• Write – Write • Same block, same line
BRAMs
22
Token Dependency – Address Conflict
16X
…
• Write – Write • Same block, different lines
BRAMs
23
Token Dependency – Address Conflict
16X
…
• Read – Read
BRAMs
24
Design Challenge
• Deal with different types of tokens
• Handle various size of tokens
• Parallelize token translation
• Keep up with the interface bandwidth
token size * token/cycle * frequency * instances = OpenCAPI BW
Token size: 4B 2 tokens/cycle 250MHz --> 13 instances
25
IDEA – from token-level to BRAM-level
16X
…
16X
…
26
Architecture Overview
27
Architecture Overview
• Two-level Parser
28
Architecture Overview
• Two-level Parser
29
Architecture Overview
• Two-level Parser
• Parallel BRAM Operations
30
Architecture Overview
• Two-level Parser
• Parallel BRAM Operations
• Relax Execution Model
31
Parallel BRAM Operations & Relax Execution Model
32
Results
• Implementation Result (on XCKU15P)
• Benchmark • Alice’s Adventures in Wonderland : 85KB (149KB) for
compressed (uncompressed) file.
• Throughput
Flip-Flop LUTs BRAMs Frequency Power
32K(3.32%) 46K(8.95%) 29(2.95%) 250MHz 2.9W
FPGA(single decompressor)
CPU(Core i7 with one thread)
Input 3GB/s 300MB/s
Output 5GB/s 500MB/s
33
Contact Me: [email protected] (Jian Fang)
Open Source
https://github.com/ChenJianyunp/ FPGA-Snappy-Decompressor.git
Publish Paper: J. Fang, J. Chen, P. Hofstee, Z. Al-Ars, J. Hidders, A High-Bandwidth Snappy Decompressor in Reconfigurable Logic (CODES+ISSS 2018)
Top Related