HEAD - GitHub Pages · HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing...

HEADHardwarE Accelerated Deduplication

Final Report

CS710 Computing Acceleration with FPGA

December 9, 2016

Insu Jang

Seikwon Kim

Seonyoung Lee

Executive Summary

A-Z development of deduplication SW version of deduplication Chunking HW logic Fingerprinting HW logic HW device driver on PetalinuxMerge deduplication SW-HW

Performance BenefitARM+FPGA

8x faster than ARM only 3x faster than x86_64

With low power consumption

Background

Deduplication Concept

Data compression technique Eliminate duplicate copies

Reduce overall cost of storage

Manage data growth

Cloud storage without deduplication

Cloud storage with deduplication

Deduplication Process

File chunkingFixed size chunking

Variable size chunking

Chunk finger printing

Eliminating duplicates

File Chunking

Divide file into chunks

Chunk Portion of data, presumed to be duplicate

Parts that will reform a file

This is a chunk__

This might be a same chunk__

This is CS710__

This is a chunk__

Fixed Size Chunking

FF FF 00 AA BB CC 00 AA 11 AB00 DE AD BE EF CC DE AD 00 11

CC FF FF 00 AA BB CC 00 AA 11 ABCC 00 DE AD BE EF CC DE AD 00 11

5 Byte

Fixed Size Chunking

Pros fast way of chunking

Cons Deduplication ratio

≠ ≠≠ ≠

Variable Size Chunking

Delimiter ‘CC’

Variable Size Chunking

Pros Deduplication ratio

Cons Slower than fixed chunking

Chunk Fingerprinting

Transform chunks into shorter values Faster to compare between chunks

13 00 14

00 BB 12

16 86 00

AA BE EF

Eliminating Duplicates

00 BB 12

16 86 00

AA BE EF

13 00 14

Restoring the Dedup File

Dedup_File.txt FP1 FP2 FP1 FP3 FP4 FP2 ……

FP1FP2FP3FP4……

This isdedup

good presentation

This is dedup This is good presentation for dedup

Deduplication Key Design Issues

How to create chunks Variable size chunks

Fast and simple rolling hash algorithm

How to finger print Collision-free hash

Murmur

• SHA-256

• FNV

Fast & Simple Rolling Hash

1 2 3 4 5 6 7 8 9 0

1 2 3+ + % X == VAL?

2 3 4+ % X == VAL?......

3 4 5+ % X == VAL?2-

Finger Printing Murmur Hash

Non-cryptographic hash function

128 bit hash

Very low collision rate

Deduplication SW

Basic deduplication process in SWDo {

Read data from file

Perform chunking (rollingHash)

Calculate fingerprint (murmurHash)

Save <hash, chunk> to DB

Enqueue hash value into hash list

Eliminate chunk data from buffer

} while (!file.eof());

Save <filename, hash list> pair to DB

8KB buffer

ChunkCalculate murmurHash 0xe045f78ac…

Hardware Design

FPGA Key Challenges

Performance disadvantage Computation must be 7 times faster than CPU

PS 667Mhz VS PL 100Mhz

Extra overhead when FPGA is usedMemory copy from PS to PL

Memory copy from PL to PS

FPGA Key Technical Issues

What to parallelize Chunking hash

Finger printing hash

How to parallelize Unrolling

Memory placement

Pipelining

Communication between PS and PL DMA

Design in a Nutshell

ARM CPU

FIFO Memory

DMAMemory

Petalinux

Application

ChunkingModules

Finger PrintingModules

SD Card

Recall Fast & Simple Rolling Hash

1 2 3 4 5 6 7 8 9 0

1 2 3+ + % X == VAL?

2 3 4+ % X == VAL?......

3 4 5+ % X == VAL?2-

SERIAL COMPUTATION1

BRAM Basics

Maximum partition: 128 To parallelize, BRAM must be partitioned

Read per partitioned BRAM

0 1 … 64 65 … 128 129 ….

BRAM Part 1 BRAM Part 2 BRAM Part 3

Can be read at onceCannot be read at once

Chunking

Hardware Design

Parallelized Chunk Hash

PS sends 8KB data to PL via DMA

Partition 8KB data into multiple windows

Unroll modules to calculate sum Parallelized computation

0 … 1023+ + 64 … 1087+ + 128 … 1151+ +

SUM Module1 SUM Module2 SUM Module3

0 1 … 64 65 … 128 129 ….

Hash Module3

Hash Module2

Compute hash of each added values Parallelized computation

0 … 1023+ +

% X == VAL?

% X == VAL?Hash Module1

SUM Module1

SUM Module2

SUM Module3

64 … 1087+ +

128 … 1151+ +

Rolling Module3

Rolling Module2

Rolling Module1

Compute rolling hash Parallelized computation

1 … 1024+ % X == VAL?0-

% X == VAL?

Hash Module1

Hash Module2

Hash Module3

65 … 1088+ 64-

129 … 1152+ 129-

Finger Printing

Hardware Design

Parallelized Finger Printing

Murmur hash for finger printing

str[0..3] str[4..7] str[8..11]

Multiply c1 Multiply c1 Multiply c1

ROL 15 ROL 15 ROL 15

Multiply, Add

Seed Val

Hash can partly be calculated in parallel

str[0..3] str[4..7] str[8..11]

Multiply, Add

Seed Val

1. Compute parallelizable part simultaneously

str[0..3] str[4..7] str[8..11]

Key 1 Key 2 Key 3

Total up to 2048 keys (one key for 4 bytes data)in 45 cycles

Key 1 Key 2 Key 3

2. Serially compute non-parallelizable part

Multiply, Add

Seed Val

Calculate final hashin 8194 cycles

Hardware Design

PS – PL DMA Communication

Device driver for AXI DMAUsed AXI DMA library for Zynq in Github

https://github.com/bperez77/xilinx_axidmaaxi_dma_oneway_transfer(...)

Actual transfer seems to be done when HW module calls hls::stream.read()

SWaxi_dma_oneway

_transfer()

axi_dma_oneway

_transfer()

HWhls::stream.read()

hls::stream.write()

Device Driver

Hardware Design

Remaining Step

ARM CPU

FIFO Memory

DMAMemory

Petalinux

Application

ChunkingModules

Finger PrintingModules

SD Card

Invoke HW execution

Generic-uio doesn’t work..

Device Driver for Custom HW

Need a dedicated device driver for handling our HW module

Why not use generic-uio device driver?It seems not support custom interrupt handling

Our device driver handles interrupt when AP_DONE signal becomes high

Registered interruptfor ‘hello’ HW module

Registered interrupt for AXI DMA

HW module interfaceFor standalone, auto-generated by Vivado

Generated when you export hardware

AXI Lite signal interface

Example: How to invoke HW execution?See device driver for standalone*(0x43c00000 + 0x0) |= 1;

In Linux device driver, it is equal to

ioremap()maps physical memory to kernel space virtual memory

u32 reg = ioread32(0x43c00000 + 0x0);

iowrite32(reg|0x1, 0x43c00000);

u32 reg = ioread32(0xe09a0000 + 0x0);

iowrite32(reg|0x1, 0xe09a0000);

Merge Deduplication HW/SW

Basic deduplication process in SW/HWDo {

Read data from file to fill 8KB buffer

Perform chunking to get a chunk

Calculate murmurHash for the chunk

Save <hash, chunk> pair to DB

Enqueue hash value into the hash list

Eliminate chunk data from the buffer

} while (!file.eof());

Save <filename, hash list> pair to DB

Chunking & Hashing are accelerated by HW44

Transfer buffer data via DMA to PL

Invoke PL execution and wait to be completed

Transfer result data via DMA from PL

Result

Experimentation

Compare 3 environment Intel x86_64 i7-6700

ARM only system (SW based Deduplication) ARM cortex A9

ARM + FPGA (HW based Deduplication)

Workload : 5 image files 88MB ~ 243MB

Intel x86_64 ARM cortex PS ARM + FPGA

Clock frequency 3.4GHz 667MHz 100MHz

Ratio (slower) 1 5.14 34.82

Performance : ARM vs PL [1/2]

Even Clock freq. 6 x slower than ARM SW only

ARM + FPGA shows 1.2x better performance47

ARM cortex PS ARM + FPGA

Clock freq. ratio 1 (667MHz) 6.6x (100MHz)

1.203 x1.150 x

1.227 x1.156 x

1.232 x1.193 x

1 x 1 x 1 x 1 x 1 x 1 x

87980600 105528056 95420240 168750056 243000056 GeomeanNo

File size (bytes)

Deduplication time (lower is better)

ARM only ARM+FPGA

Performance : ARM vs PL [2/2]

8x better performance per unit clock

8.027 x 7.670 x8.184 x 7.709 x

8.219 x 7.958 x

1 x 1 x 1 x 1 x 1 x 1 x

10.0 x

87980600 105528056 95420240 168750056 243000056 Geomean

File size (bytes)

Deduplication time (per unit clock, lower is better)

ARM only ARM+FPGA

ARM cortex PS ARM + FPGA

Clock freq. ratio 1 (667MHz) 6.6x (100MHz)

Performance : Overview

8.027 x7.670 x

8.184 x7.709 x

8.219 x 7.958 x

2.873 x 2.681 x

3.693 x

2.722 x 2.903 x 2.953 x

1 x 1 x 1 x 1 x 1 x 1 x

87980600 105528056 95420240 168750056 243000056 Geomean

File size (bytes)

Deduplication time (per unit clock, lower is better)

ARM only x86_64 ARM+FPGA

IP Utilization & latency

Latency (clock cycle)

Estimated utilization

0% 20% 40% 60% 80%

Power Consumption

Total on-chip power : 1.925W Dynamic : 1.752W

Static : 0.173W

Executive Summary

A-Z development of deduplication SW version of deduplication Chunking HW logic Fingerprinting HW logic HW device driver on PetalinuxMerge deduplication SW-HW

Performance Benefit ARM+FPGA

8x faster than ARM only 3x faster than x86_64

With low power consumption

Thank you

Backup

Performance : ARM vs x86_64 [1/2]

ARM cortex PS Intel x86_64

Clock frequency 667MHz 3.4GHz

14.242 x 14.584 x

11.297 x

14.436 x 14.432 x

1 x 1 x 1 x 1 x 1 x

10.0 x

15.0 x

87,980,600 105,528,056 95,420,240 168,750,056 243,000,056

File size (bytes)

Deduplication Performance (lower is better)

ARM x86_64

Performance : ARM vs x86_64 [2/2]

ARM cortex PS Intel x86_64

Clock frequency 667MHz 3.4GHz

2.794 x 2.861 x

2.216 x

2.832 x 2.831 x

1 x 1 x 1 x 1 x 1 x

87,980,600 105,528,056 95,420,240 168,750,056 243,000,056

File size (bytes)

Deduplication Performance (per unit clock, lower is better)

ARM x86_64

Implemented Design

HEAD - GitHub Pages · HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing...

Documents

Transcript of HEAD - GitHub Pages · HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing...

MCBC INSU UNEDITED

[insu-00311666, v1] Decadal variability of sea surface ...

Déploiement de ownCloud à la DT/INSU

Insu ficienţa hepatică acută

INSU 1/2013

Insu - cnch.fr

© D. Wong 2002 © D. Wong 2003 1 CS610 / CS710 Database Systems I Daisy Wong.

Changing Trends in Insu Sector in India

National Institute for Earth Sciences and Astronomy (INSU)

CS710 CS - Connect Systems · 2018. 8. 8. · CS Connect Systems Inc. Key Features: CS710. CONNECT SYSTEMS,INC.| TEL:(805)642-7184 FAX:(805)642-7271 5321 Derry Ave Suite B, Agoura

Michel de Labachelerie - retd-insu-2011.obs.ujf-grenoble.frretd-insu-2011.obs.ujf-grenoble.fr/Presentations/... · French national network of large academic technology facilities

Ing Vyasa Life Insu.

c 2016 by Insu Chang. All rights reserved.

INDOOR CLIMATE AND ENERGY CONSUMPTION OF SUPER INSU …

CARDIOLOGÍA PRINCIPALES HALLAZGOS EN LA ......Insu˜ciencia cardiaca izquierda. PERIFÉRICO Insu˜ciencia cardiaca derecha o global, insu˜ciencia tricuspídea, pericarditis constrictiva,

Insu Eid 20111018

comp2-20141023143248 interview_oct2014.pdf · Professor Stephen Cheung ' (end product) 42 Insu-VaIue Insurance Consultants Ltd. Insu-value Insurance Agency Ltd. Insu-Value Insurance

Mobile & Pervasive Computing CS710 Mobile & Pervasive …vumultan.com/Handouts/CS710-Handouts.pdf · 2018-04-20 · Mobile & Pervasive Computing Prepared by: Aamir Mustafa VU ID:

Présentation Chantier Med VILLENAVE INSU …charmex.lsce.ipsl.fr/.../2.Chantier-Med-INSU_June2009_EV.pdf · Participation to the CNES 2009 prospective ... CNRS-INSU-OA: Patrick MONFRAY,

Insu elementary school