BlueSSD: Distributed Flash Store for Big Data Analytics
description
Transcript of BlueSSD: Distributed Flash Store for Big Data Analytics
BlueSSD: Distributed Flash Store for Big Data AnalyticsSang Woo Jun, Ming Liu, Kermin Fleming, ArvindComputer Science and Artificial Intelligence LaboratoryMIT
Introduction – Flash Storage
• Low latency, high density
• Throughput per chip is fixed• Many chips are organized into multiple busses
that can work concurrently• High throughput is achieved with more busses
• Read/write speed difference, limited write lifetime• Not the main focus… yet
Flash Deployment Goals
• High Capacity / Low Unit Cost • COREFU - Share distributed Storage over
commodity network• TBs of storage at <1ms latency, 1GB
throughput at high distribution• High Throughput / Low Latency
• FusionIO - Maximum performance using many busses/chips and PCIE
• 100s of GB at 100s of us latency, 3GB throughput
BlueSSD – Best of Both Worlds
• Shared distributed storage over faster custom network to accelerate big data analytics
• PCIE• 8x PCIe 2.0 (~1GB/s)
• Inter-FPGA SERDES• Low latency sideband network (<1us, ~1GB/s)• Automatic network/flow control synthesis
The Physical System (Old)
Sideband Link (~1GB/s)
Flash Board (~80MB/s)
PCIe (~1GB/s)
The Physical System (Now-4 Nodes)
System Configuration• 6 Xilinx ML605 Development Boards + Hosts• 4 Custom Flash Boards
• 4 busses with 8 chips, 16GB per board• 2 Xilinx XM104 Connector Expansion Boards• 5 SMA Connections
FPGA FPGAXM014 XM014
FPGA1 FPGA2CustomFlashBoard
CustomFlashBoard
FPGA3 FPGA4CustomFlashBoard
CustomFlashBoard
Host PC
PCIE
SMA
SMA
Hub node
Storage Node
The ML605 only has one SMA port, requiring hubs
System Configuration• Single software host can access all nodes• All nodes have identical memory maps of the
entire address space• Requests are redirected to nodes that have the
dataFPGA FPGA
XM014 XM014
FPGA1 FPGA2CustomFlashBoard
CustomFlashBoard
FPGA3 FPGA4CustomFlashBoard
CustomFlashBoard
Host PC
PCIE
SMA
SMA
Network Flash Controller
CustomFlashBoard
Host PC
PCIE
FPGAXM014
FPGA1FPGA1
RequestsData
Client InterfacePCIE
Flash Controller
Flash Board
AddressMappin
g
SMA
Host PC
Remote Node
ML605XM014
ML605
SMAML605
ML605
Network Hub
• Programmatically define high-level connections• N-to-N crossbar-like network is generated
ML605FPGA1
FPGA2
FPGA3
FPGA1
FPGA2
FPGA3
FPGA4 FPGA4
Software
• FUSE provides a file system abstraction• Custom FUSE module
interfaces with FPGA• The entire storage can be
accessed as a single regular file
• Currently running SQLite off-the-shelf• How to benchmark?
SQLite
stdioFile System
FUSE
PCIE Driver
FPGA
Storage Structure
• Focusing on read-intensive workloads• Writes are done offline, no coherence issues• Address is striped across FPGAs
• Concurrent writes will require more than coherence• SQLite assumes exclusive access to storage• If we are to have more than one file, file
system metadata will need o be synchronized
Performance Measurement
Nodes Page Read Latency (us)
1 1042 1414 (<180?)COREFU 600FusionIO 68
Nodes Throughput (MB/s)
1 852 1704 (340?)COREFU* 1500FusionIO 3000
Throughput bottlenecked by custom flash card*COREFU performance at 32 nodes
Scalability
• Latency increase is small enough to accommodate 16+ FPGAs
• Single SMA cable can accommodate 10+ Flash board throughput• More should be possible with good topology• Different story if flash boards are faster
(link compression?)
Future Work (1)
• Bring up the 4 node system• Bring up the 8 node system
• 8 more ML605 boards have been asked from Xilinx
• More capacity + throughput
Future Work (2)
• Offload computation to FPGA• Do computation near storage
• Relational algebra processor• Complex analytics?
• Looking for interesting application
Future Work (3)
• Multiple concurrent writers• Software level transaction management• Hardware level pseudo-filesystem is probably
required
The End
• Thank you!