Post on 05-Jun-2020
HEADHardwarE Accelerated Deduplication
Final Report
CS710 Computing Acceleration with FPGA
December 9, 2016
Insu Jang
Seikwon Kim
Seonyoung Lee
Executive Summary
A-Z development of deduplication SW version of deduplication Chunking HW logic Fingerprinting HW logic HW device driver on PetalinuxMerge deduplication SW-HW
Performance BenefitARM+FPGA
8x faster than ARM only 3x faster than x86_64
With low power consumption
2
Background
3
Deduplication Concept
Data compression technique Eliminate duplicate copies
Reduce overall cost of storage
Manage data growth
4
Deduplication Concept
Cloud storage without deduplication
5
Deduplication Concept
Cloud storage with deduplication
6
Deduplication Process
File chunkingFixed size chunking
Variable size chunking
Chunk finger printing
Eliminating duplicates
7
File Chunking
Divide file into chunks
Chunk Portion of data, presumed to be duplicate
Parts that will reform a file
8
This is a chunk__
This might be a same chunk__
This is CS710__
This is a chunk__
Fixed Size Chunking
9
≠
Fixed Size Chunking
10
FF FF 00 AA BB CC 00 AA 11 AB00 DE AD BE EF CC DE AD 00 11
CC FF FF 00 AA BB CC 00 AA 11 ABCC 00 DE AD BE EF CC DE AD 00 11
5 Byte
5 Byte
Fixed Size Chunking
Pros fast way of chunking
Cons Deduplication ratio
11
FF FF 00 AA BB CC 00 AA 11 AB00 DE AD BE EF CC DE AD 00 11
CC FF FF 00 AA BB CC 00 AA 11 ABCC 00 DE AD BE EF CC DE AD 00 11
≠ ≠≠ ≠
0%
Variable Size Chunking
12
=
Variable Size Chunking
13
FF FF 00 AA BB CC 00 AA 11 AB00 DE AD BE EF CC DE AD 00 11
CC FF FF 00 AA BB CC 00 AA 11 ABCC 00 DE AD BE EF CC DE AD 00 11
Delimiter ‘CC’
Delimiter ‘CC’
Variable Size Chunking
Pros Deduplication ratio
Cons Slower than fixed chunking
14
=
50%
FF FF 00 AA BB CC 00 AA 11 AB00 DE AD BE EF CC DE AD 00 11
CC FF FF 00 AA BB CC 00 AA 11 ABCC 00 DE AD BE EF CC DE AD 00 11
=
Chunk Fingerprinting
Transform chunks into shorter values Faster to compare between chunks
15
13 00 14
13 00 14
13 00 14
00 BB 12
16 86 00
AA BE EF
Eliminating Duplicates
16
00 BB 12
16 86 00
AA BE EF
13 00 14
Restoring the Dedup File
17
Dedup_File.txt FP1 FP2 FP1 FP3 FP4 FP2 ……
FP1FP2FP3FP4……
This isdedup
good presentation
for
This is dedup This is good presentation for dedup
Deduplication Key Design Issues
How to create chunks Variable size chunks
Fast and simple rolling hash algorithm
How to finger print Collision-free hash
Murmur
• SHA-256
• FNV
18
Fast & Simple Rolling Hash
19
1 2 3 4 5 6 7 8 9 0
1 2 3+ + % X == VAL?
2 3 4+ % X == VAL?......
1-
3 4 5+ % X == VAL?2-
Finger Printing Murmur Hash
Non-cryptographic hash function
128 bit hash
Very low collision rate
Fast
20
Deduplication SW
Basic deduplication process in SWDo {
Read data from file
Perform chunking (rollingHash)
Calculate fingerprint (murmurHash)
Save <hash, chunk> to DB
Enqueue hash value into hash list
Eliminate chunk data from buffer
} while (!file.eof());
Save <filename, hash list> pair to DB
21
8KB buffer
ChunkCalculate murmurHash 0xe045f78ac…
Hardware Design
22
FPGA Key Challenges
Performance disadvantage Computation must be 7 times faster than CPU
PS 667Mhz VS PL 100Mhz
Extra overhead when FPGA is usedMemory copy from PS to PL
Memory copy from PL to PS
23
FPGA Key Technical Issues
What to parallelize Chunking hash
Finger printing hash
How to parallelize Unrolling
Memory placement
Pipelining
Communication between PS and PL DMA
24
Design in a Nutshell
25
ARM CPU
FIFO Memory
DMAMemory
Petalinux
Application
ChunkingModules
Finger PrintingModules
SD Card
Recall Fast & Simple Rolling Hash
26
1 2 3 4 5 6 7 8 9 0
1 2 3+ + % X == VAL?
2 3 4+ % X == VAL?......
1-
3 4 5+ % X == VAL?2-
SERIAL COMPUTATION1
2
BRAM Basics
Maximum partition: 128 To parallelize, BRAM must be partitioned
Read per partitioned BRAM
27
0 1 … 64 65 … 128 129 ….
BRAM Part 1 BRAM Part 2 BRAM Part 3
Can be read at onceCannot be read at once
Chunking
28
Hardware Design
Parallelized Chunk Hash
PS sends 8KB data to PL via DMA
Partition 8KB data into multiple windows
Unroll modules to calculate sum Parallelized computation
29
0 … 1023+ + 64 … 1087+ + 128 … 1151+ +
SUM Module1 SUM Module2 SUM Module3
0 1 … 64 65 … 128 129 ….
Hash Module3
Hash Module2
Parallelized Chunk Hash
Compute hash of each added values Parallelized computation
30
0 … 1023+ +
% X == VAL?
% X == VAL?
% X == VAL?Hash Module1
SUM Module1
SUM Module2
SUM Module3
64 … 1087+ +
128 … 1151+ +
Rolling Module3
Rolling Module2
Rolling Module1
Parallelized Chunk Hash
Compute rolling hash Parallelized computation
1 … 1024+ % X == VAL?0-
% X == VAL?
% X == VAL?
Hash Module1
Hash Module2
Hash Module3
65 … 1088+ 64-
129 … 1152+ 129-
Finger Printing
32
Hardware Design
Parallelized Finger Printing
Murmur hash for finger printing
str[0..3] str[4..7] str[8..11]
Multiply c1 Multiply c1 Multiply c1
ROL 15 ROL 15 ROL 15
Multiply c2 Multiply c2 Multiply c2
Seed
ROL 13 ROL 13 ROL 13
Multiply, Add
Multiply, Add
Multiply, Add
Seed Val
Hash
Parallelized Finger Printing
Hash can partly be calculated in parallel
str[0..3] str[4..7] str[8..11]
Multiply c1 Multiply c1 Multiply c1
ROL 15 ROL 15 ROL 15
Multiply c2 Multiply c2 Multiply c2
Seed
ROL 13 ROL 13 ROL 13
Multiply, Add
Multiply, Add
Multiply, Add
Seed Val
Hash
1. Compute parallelizable part simultaneously
Parallelized Finger Printing
str[0..3] str[4..7] str[8..11]
Multiply c1 Multiply c1 Multiply c1
ROL 15 ROL 15 ROL 15
Multiply c2 Multiply c2 Multiply c2
Key 1 Key 2 Key 3
Total up to 2048 keys (one key for 4 bytes data)in 45 cycles
…
Key 1 Key 2 Key 3
2. Serially compute non-parallelizable part
Parallelized Finger Printing
ROL 13 ROL 13 ROL 13
Multiply, Add
Multiply, Add
Multiply, Add
Seed Val
Hash
Calculate final hashin 8194 cycles
DMA
37
Hardware Design
PS – PL DMA Communication
Device driver for AXI DMAUsed AXI DMA library for Zynq in Github
https://github.com/bperez77/xilinx_axidmaaxi_dma_oneway_transfer(...)
Actual transfer seems to be done when HW module calls hls::stream.read()
38
SWaxi_dma_oneway
_transfer()
axi_dma_oneway
_transfer()
HWhls::stream.read()
hls::stream.write()
Device Driver
39
Hardware Design
Remaining Step
40
ARM CPU
FIFO Memory
DMAMemory
Petalinux
Application
ChunkingModules
Finger PrintingModules
SD Card
Invoke HW execution
Generic-uio doesn’t work..
Device Driver for Custom HW
Need a dedicated device driver for handling our HW module
Why not use generic-uio device driver?It seems not support custom interrupt handling
Our device driver handles interrupt when AP_DONE signal becomes high
4141
Registered interruptfor ‘hello’ HW module
Registered interrupt for AXI DMA
Device Driver for Custom HW
HW module interfaceFor standalone, auto-generated by Vivado
Generated when you export hardware
42
AXI Lite signal interface
Device Driver for Custom HW
Example: How to invoke HW execution?See device driver for standalone*(0x43c00000 + 0x0) |= 1;
In Linux device driver, it is equal to
ioremap()maps physical memory to kernel space virtual memory
43
u32 reg = ioread32(0x43c00000 + 0x0);
iowrite32(reg|0x1, 0x43c00000);
u32 reg = ioread32(0xe09a0000 + 0x0);
iowrite32(reg|0x1, 0xe09a0000);
Merge Deduplication HW/SW
Basic deduplication process in SW/HWDo {
Read data from file to fill 8KB buffer
Perform chunking to get a chunk
Calculate murmurHash for the chunk
Save <hash, chunk> pair to DB
Enqueue hash value into the hash list
Eliminate chunk data from the buffer
} while (!file.eof());
Save <filename, hash list> pair to DB
Chunking & Hashing are accelerated by HW44
Transfer buffer data via DMA to PL
Invoke PL execution and wait to be completed
Transfer result data via DMA from PL
Result
45
Experimentation
Compare 3 environment Intel x86_64 i7-6700
ARM only system (SW based Deduplication) ARM cortex A9
ARM + FPGA (HW based Deduplication)
Workload : 5 image files 88MB ~ 243MB
46
Intel x86_64 ARM cortex PS ARM + FPGA
Clock frequency 3.4GHz 667MHz 100MHz
Ratio (slower) 1 5.14 34.82
Performance : ARM vs PL [1/2]
Even Clock freq. 6 x slower than ARM SW only
ARM + FPGA shows 1.2x better performance47
ARM cortex PS ARM + FPGA
Clock freq. ratio 1 (667MHz) 6.6x (100MHz)
1.203 x1.150 x
1.227 x1.156 x
1.232 x1.193 x
1 x 1 x 1 x 1 x 1 x 1 x
0.7 x
1.0 x
1.3 x
87980600 105528056 95420240 168750056 243000056 GeomeanNo
rmal
ized
exe
cuti
on
tim
e
File size (bytes)
Deduplication time (lower is better)
ARM only ARM+FPGA
Performance : ARM vs PL [2/2]
8x better performance per unit clock
48
8.027 x 7.670 x8.184 x 7.709 x
8.219 x 7.958 x
1 x 1 x 1 x 1 x 1 x 1 x
0.0 x
2.0 x
4.0 x
6.0 x
8.0 x
10.0 x
87980600 105528056 95420240 168750056 243000056 Geomean
No
rmal
ized
exe
cuti
on
tim
e p
er u
nit
clo
ck
File size (bytes)
Deduplication time (per unit clock, lower is better)
ARM only ARM+FPGA
ARM cortex PS ARM + FPGA
Clock freq. ratio 1 (667MHz) 6.6x (100MHz)
Performance : Overview
49
8.027 x7.670 x
8.184 x7.709 x
8.219 x 7.958 x
2.873 x 2.681 x
3.693 x
2.722 x 2.903 x 2.953 x
1 x 1 x 1 x 1 x 1 x 1 x
0.0 x
1.0 x
2.0 x
3.0 x
4.0 x
5.0 x
6.0 x
7.0 x
8.0 x
9.0 x
87980600 105528056 95420240 168750056 243000056 Geomean
No
rmal
ized
exe
cuti
on
ti
me
per
un
it c
lock
File size (bytes)
Deduplication time (per unit clock, lower is better)
ARM only x86_64 ARM+FPGA
IP Utilization & latency
50
Latency (clock cycle)
Estimated utilization
3%
46%
17%
65%
0% 20% 40% 60% 80%
DSP
BRAM
FF
LUT
Power Consumption
Total on-chip power : 1.925W Dynamic : 1.752W
Static : 0.173W
51
Executive Summary
A-Z development of deduplication SW version of deduplication Chunking HW logic Fingerprinting HW logic HW device driver on PetalinuxMerge deduplication SW-HW
Performance Benefit ARM+FPGA
8x faster than ARM only 3x faster than x86_64
With low power consumption
52
53
Thank you
54
Backup
Performance : ARM vs x86_64 [1/2]
55
ARM cortex PS Intel x86_64
Clock frequency 667MHz 3.4GHz
14.242 x 14.584 x
11.297 x
14.436 x 14.432 x
1 x 1 x 1 x 1 x 1 x
0.0 x
5.0 x
10.0 x
15.0 x
87,980,600 105,528,056 95,420,240 168,750,056 243,000,056
No
rmal
ized
exe
cuti
on
tim
e
File size (bytes)
Deduplication Performance (lower is better)
ARM x86_64
Performance : ARM vs x86_64 [2/2]
56
ARM cortex PS Intel x86_64
Clock frequency 667MHz 3.4GHz
2.794 x 2.861 x
2.216 x
2.832 x 2.831 x
1 x 1 x 1 x 1 x 1 x
0.0 x
0.5 x
1.0 x
1.5 x
2.0 x
2.5 x
3.0 x
87,980,600 105,528,056 95,420,240 168,750,056 243,000,056
No
rmal
ized
exe
cuti
on
tim
e p
er u
nit
clo
ck
File size (bytes)
Deduplication Performance (per unit clock, lower is better)
ARM x86_64
Implemented Design
57