Biscuit: AFrameworkforNear-Data ProcessingofBigD · PDF fileBiscuit: AFrameworkforNear-Data...
date post
07-Jul-2018Category
Documents
view
223download
0
Embed Size (px)
Transcript of Biscuit: AFrameworkforNear-Data ProcessingofBigD · PDF fileBiscuit: AFrameworkforNear-Data...
Biscuit: A Framework for Near-DataProcessing of Big Data Workloads
Boncheol GuAndre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon,
Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang
Memory Business, Samsung Electronics
Near-Data Processing (NDP)
Storage
Processing unit
Data
Traditional data processing
Storage
Processing unit
Data
Near-data processing
Processing
ResultHost interface
/ Network /
Near-data processing moves computation to data.
Computation is performed right at the data source.
Efficient when the cost of moving data is very high
Most prior work focuses on proving the concept of NDP.
Little attention to designing and realizing a practical framework
Realistic large application studies were omitted.
Boncheol Gu et al. 1/16
Near-Data Processing (NDP)
Storage
Processing unit
Data
Traditional data processing
Storage
Processing unit
Data
Near-data processing
Processing
ResultHost interface
/ Network /
Near-data processing moves computation to data.
Computation is performed right at the data source.
Efficient when the cost of moving data is very high
Most prior work focuses on proving the concept of NDP.
Little attention to designing and realizing a practical framework
Realistic large application studies were omitted.
Boncheol Gu et al. 1/16
Samsung NVMe SSD (PM1725)
Boncheol Gu et al. 2/16
Biscuit
A user-programmable NDP framework for SSDs and data-intensiveapplications
C++ language support (C++ standard library)
Dynamic loading of user programs
Mutithreading, multi-core support
The 1st reported product-strength NDP system
Boncheol Gu et al. 3/16
Biscuit
A user-programmable NDP framework for SSDs and data-intensiveapplications
C++ language support (C++ standard library)
Dynamic loading of user programs
Mutithreading, multi-core support
The 1st reported product-strength NDP system
Boncheol Gu et al. 3/16
SSD Hardware
DRAMDRAM
NAND flash memory
NAND flash memory
PCIe
inte
rfac
e
NDP-enabledSSD controller
[Inside of PM1725]
Item DescriptionHost interface PCIe Gen.3 4 (3.2 GB/s)Protocol NVMe 1.1Device density 1 TBSSD architecture Multiple channels/ways/coresStorage medium Multi-bit NAND flash memoryCompute resources Two ARM Cortex R7 coresfor Biscuit @750MHz with MPU
On-chip SRAM < 1 MiBDRAM 1 GiB
Hardware pattern matcher on each flash memory channel.
To search given bit patterns in the flash memory
Limitations
Low compute power, no cache coherence, a small amount of fast memory, noMMU, and restrictive synchronization primitives
Boncheol Gu et al. 4/16
SSD Hardware
DRAMDRAM
NAND flash memory
NAND flash memory
PCIe
inte
rfac
e
NDP-enabledSSD controller
[Inside of PM1725]
Item DescriptionHost interface PCIe Gen.3 4 (3.2 GB/s)Protocol NVMe 1.1Device density 1 TBSSD architecture Multiple channels/ways/coresStorage medium Multi-bit NAND flash memoryCompute resources Two ARM Cortex R7 coresfor Biscuit @750MHz with MPU
On-chip SRAM < 1 MiBDRAM 1 GiB
Hardware pattern matcher on each flash memory channel.
To search given bit patterns in the flash memory
Limitations
Low compute power, no cache coherence, a small amount of fast memory, noMMU, and restrictive synchronization primitives
Boncheol Gu et al. 4/16
Biscuit Runtime
Cooperative multithreading
A limited form of multithreading (fiber as a scheduling unit)
Less context switching overhead
Safe resource sharing without locking
Shared nothing architecture
All data transmission among threads through I/O ports
Enforced by the programming model and APIs
C++11 move semantics supported
Dynamic loader for user programs
User program as position-independent code (PIC)
Symbol relocation to locate each program in a separate address space
Boncheol Gu et al. 5/16
Biscuit Runtime
Cooperative multithreading
A limited form of multithreading (fiber as a scheduling unit)
Less context switching overhead
Safe resource sharing without locking
Shared nothing architecture
All data transmission among threads through I/O ports
Enforced by the programming model and APIs
C++11 move semantics supported
Dynamic loader for user programs
User program as position-independent code (PIC)
Symbol relocation to locate each program in a separate address space
Boncheol Gu et al. 5/16
Biscuit Runtime
Cooperative multithreading
A limited form of multithreading (fiber as a scheduling unit)
Less context switching overhead
Safe resource sharing without locking
Shared nothing architecture
All data transmission among threads through I/O ports
Enforced by the programming model and APIs
C++11 move semantics supported
Dynamic loader for user programs
User program as position-independent code (PIC)
Symbol relocation to locate each program in a separate address space
Boncheol Gu et al. 5/16
Biscuit System Architecture
Host system
libsisc
NVMe driver
Biscuit-enabled SSD
Biscuit runtime
SSD-side module
SSD-sideTask
SSD-sideTask
SSD-sideTask
fiber fiber fiber
SSD firmware
libslethost-side program
I/O requests
host-sideTask
host-sideTask
CPUs CPUs SRAM
DRAM
Hos
t i
nterf
ace c
ontro
ller
LLCNAND NAND
Channel #0FMC
NAND NAND
Channel #(nch-1)FMC
mainmemory
systemagent
PCIe I/F
hardwarepattern matcher
Boncheol Gu et al. 6/16
Development Process
Run the host program5
ARM cross compile
X86compile
Copy the module into /var/isc/slets (/dev/nvme0n1)
HostComputer
HostComputer
Host program
SSD-sidemodule
2 3
4
Write the codes1
SSD-side taskHost-side task
Boncheol Gu et al. 7/16
Development Process
Copy the module into /var/isc/slets (/dev/nvme0n1)
Run the host program HostComputer
HostComputer
4
5
Write the codes1
ARM cross compile
X86compile
SSD-side taskHost-side task
Host program
SSD-sidemodule
2 3
Boncheol Gu et al. 7/16
Development Process
Run the host program HostComputer
HostComputer
5
Write the codes1
ARM cross compile
X86compile
Copy the module into /var/isc/slets (/dev/nvme0n1)
SSD-side taskHost-side task
Host program
SSD-sidemodule
2 3
4
Boncheol Gu et al. 7/16
Development Process
Write the codes1
ARM cross compile
X86compile
Copy the module into /var/isc/slets (/dev/nvme0n1)
Run the host program
SSD-side taskHost-side task
HostComputer
HostComputer
Host program
SSD-sidemodule
2 3
4
5
Boncheol Gu et al. 7/16
Experimental Setup
H/W setup
System Dell PowerEdge R720 serverCPU 2 Intel Xeon(R) CPU E5-2640
(12 threads per socket) @2.50GHzMemory 64 GiB DRAMOS 64-bit Ubuntu 15.04
Basic performance results
Communication latency, data read bandwidth and latency
Application level results
String search, pointer chasing, and DB scan/filtering
Notations
Conv: system configuration with a default conventional SSD
Biscuit: system configuration with the Biscuit framework on the SSD
Boncheol Gu et al. 8/16
Experimental Setup
H/W setup
System Dell PowerEdge R720 serverCPU 2 Intel Xeon(R) CPU E5-2640
(12 threads per socket) @2.50GHzMemory 64 GiB DRAMOS 64-bit Ubuntu 15.04
Basic performance results
Communication latency, data read bandwidth and latency
Application level results
String search, pointer chasing, and DB scan/filtering
Notations
Conv: system configuration with a default conventional SSD
Biscuit: system configuration with the Biscuit framework on the SSD
Boncheol Gu et al. 8/16
Basic Performance Results - Data Read Bandwidth
Conv: transfer data to the host-side program
Biscuit: transfer data to the SSD-side module (i.e., internal read)
0 200 400 600 800 1000 1200Request size [KiB]
0
1
2
3
4
5
Band
widt
h [G
iB/s
]
Biscuit
Biscuit (pattern-matching)
Conv
PCIe Gen.3 x4
Config. overhead
Biscuit exploits the underutilized internal bandwidth.
Boncheol Gu et al. 9/16
Application Level Results
BiscuitSSD
Biscuit-awareDatabaseEngine
Earlyfiltering
Data analytics with a real DB engine
MariaDB 5.5.42 (XtraDB)
TPC-H dataset at a scale factor of 100 (160 GiB)
We modified the query planner to
1. identify a candidate table amenable foroffloading;
2. estimate its selectivity1 using a samplingmethod;
3. determine whether the table is indeed agood target (based on a selectivitythreshold);
4. and finally offload the identified filter to theSSD.
1a fraction of pages that satisfy filter conditions from the to