Post on 21-Dec-2015
Lustre use cases in the TSUBAME2.0 supercomputer
Tokyo Institute of Technology
Dr. Hitoshi Sato
Outline Introduction of the TSUBAME2.0
Supercomputer Overview of TSUBAME2.0 Storage Architecture Lustre FS use cases
TSUBAME2.0 (Nov. 2010 /w NEC-HP) A green-, cloud-based SC at TokyoTech (Tokyo Japan)
〜 2.4 Pflops (Peak), 1.192 Pflops (Linpack) 4th in TOP500 Rank (Nov. 2010)
Next gen multi-core x86 CPUs + GPUs 1432 nodes, Intel Westmere/Nehalem EX CPUs 4244 NVIDIA Tesla (Fermi) M2050 GPUs
〜 95TB of MEM /w 0.7 PB/s aggregate BW Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) 1.2 MW Power , PUE = 1.28
2nd in Green500 Rank Greenest Production SC
VM Operation (KVM), Linux + Windows HPC
HDD-based Storage systems : Total 7.13PB (Parallel FS + Home)
Interconnets: Full-bisection Optical QDR Infiniband Network
Parallel FS 5.93PB
SupreTitenet
Home 1.2PB
SupreSinet3
StorageTek SL8500Tape Library
~4PB
OSS x20 MDS x10
MDS,OSS servers HP DL360 G6 30nodesStorage DDN SFA10000 x5 (10 enclosure x5)x5
Voltaire Grid Director 4700 ×12IB QDR: 324 ports
Core Switch
Edge Switch Edge Switch (/w 10GbE ports)
Voltaire Grid Director 4036 ×179IB QDR : 36 ports
Voltaire Grid Director 4036E ×6IB QDR:34ports 10GbE: 2port
12switches
6switches179switches
Storage Servers HP DL380 G6 4nodes BlueArc Mercury 100 x2Storage DDN SFA10000 x1( 10 enclosure x1 )
ManagementServers
Thin nodes
1408nodes (32nodes x44 Racks)
HP Proliant SL390s G7 1408nodes CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/nodeGPU: NVIDIA M2050, 3GPUs/nodeMem: 54GB (96GB)SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB)
Medium nodes
HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GBSSD: 120GB x 4 = 480GB
Fat nodes
HP Proliant DL580 G7 10nodesCPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem: 256GB (512GB)SSD: 120GB x 4 = 480GB
・・・・・・
Computing Nodes : 2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD
GSIC:NVIDIA Tesla S1070GPU
PCI –E gen2 x16 x2slot/node
TSUBAME2.0 Overview
NFS,CIFS 用 x4 NFS,CIFS,iSCSI 用 x2
High Speed DataTransfer Servers
Usage of TSUBAME2.0 storage Simulation, Traditional HPC
Outputs for intermediate and final results 200 times in 52GB × 128 nodes of MEMs → 1.3 PB Computation in 4 (several) patterns → 5.2 PB
Data-intensive computing Data analysis
ex.) Bioinformatics, Text, Video analyses Web text processing for acquiring knowledge from 1 bn.
pages (2TB) of HTML → 20TB of intermediate outputs→ 60TB of final results
Processing by using commodity tools MapReduce(Hadoop), Workflow systems, RDBs, FUSE, etc.
Storage problems in Cloud-based SCs Support for Various I/O workloads Storage Usage Usability
I/O workloads Various HPC apps. run on TSUBAME2.0
Concurrent Parallel R/W I/O MPI (MPI-IO), MPI with CUDA, OpenMP, etc.
Fine-grained R/W I/O Checkpoint, Temporal files Gaussian, etc.
Read mostly I/O Data-intensive apps, Parallel workflow, Parameter
survey Array job, Hadoop, etc.
Shared Storage I/O concentration
Storage Usage Data life cycle management
Few users occupy most of storage volumes on TSUBAME1.0 Only 0.02 % of users use more than 1TB of storage
volumes Storage resource characteristics
HDD : 〜 150MB/s, 0.16 $/GB, 10 W/disk SSD : 100 〜 1000 MB/s,
4.0 $/GB, 0.2 W/disk Tape : 〜 100MB/s, 1.0 $/GB,
low power consumption
Usability Seamless data access to SCs
Federated storage between private PCs, clusters in lab and SCs
Storage service for campus like cloud storage services
How to deal with large data sets Transfer big data between SCs e.g.) Web data mining on TSUBAME1
NICT (Osaka) → TokyoTech (Tokyo) : Stage-in 2TB of initial data
TokyoTech → NICT : Stage-out 60TB of results Transfer data via the Internet : 8days Fedex?
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
Systemapplication
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System VolumesHome Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
Lustre GPFS with HSM
“Thin node SSD” “Fat/Medium node
SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜 4PB Tape
3.6 PB
130 TB 〜
190 TB
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
Systemapplication
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System VolumesHome Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
Lustre GPFS with HSM
“Thin node SSD” “Fat/Medium node
SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜 4PB Tape
3.6 PB
130 TB 〜
190 TB
• Home storage for computing nodes• Cloud-based campus storage services
Concurrent Parallel I/O (e.g. MPI-IO)
Fine-grained R/W I/O(check point, temporal files)
Data transfer service between SCs/CCs
Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)
Backup
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
Lustre Configuration
MDS x2 OSS x4 MDS x2 OSS x4 MDS x2 OSS x4
LustreFS #1785.05TB
10bn. inodes
LustreFS #2785.05TB
10bn. inodes
LustreFS #3785.05TB
10bn. inodes
General purpose (MPI)
Scratch (Gaussian)
ReservedBackup
Lustre FS Performance
IO Throughput 〜 11GB/sec / FS
Client #1Client #2
Client #3Client #4
Client #5Client #6
Client #7Client #8
IB QDR Network
LustreFS(56OSTs)
OSS #1 OSS #2 OSS #3 OSS #4
Sequential Write/Read by IOR-2.10.3 1 〜 8clients, 1 〜 7proc/client
10.7GB/sec 11GB/sec
SFA 10k 600 SlotsData 560x 2TB SATA 7200 rpm
LustreFS, 1FS, 4 OSS, IOR@Client(checksum=1)
〜 33GB/sec w/ 3FSs
Failure cases in productive operations OSS hung up due to heavy I/O load
A user conducts small read() I/O operations via many java procs
I/O slow down on OSS OSS dispatches iterative failovers and rejects
connections from clients MDS hung up
Client holds unnecessary RPC connections at eviction processes
MDS keeps RPC lock from clients
Related Activities Lustre FS Monitoring MapReduce
TSUBAME2.0 Lustre FS Monitoring /w DDN Japan
Purposes :• Real time visualization for TSUBAME2.0 users• Analysis for productive operation and FS research
Lustre Monitoring in detail
Management Node
4 x OSS2 x MDS
mdt ost00 ost01...
56 x OST (14 OST per OSS)
gmetad LMT Server
Ganglia lustre module
gmod
cerebro
Ganglia lustre module
gmod
cerebro
LMT server agentLMT server
agent
WebApplication LMT GUI
Web Browser
ost55 ost56
MySQL
Web Server
OST statistics :write/read b/w, OST usages, etc
MDT statistics :open,close,getattr, setattr, link, unlink mkdir, rmdir, statfs, rename, getxattr, inode usages, etc..
MDT statistics OST statistics
For Realtime Cluster Visualization
For research and analysis
Monitoring Target
Monitoring System
MapReduce on the TSUBAME supercomputer• MapReduce
– A programming model for large data processing
– Hadoop is a common OSS-based implementation
• Supercomputers– A candidate for the execution
environment– Needs by various application users
• Text Processing, Machine learning, Bioinformatics, etc.
Problems :• Cooperation with the existing batch-job scheduler system (PBS Pro)
– All jobs including MapReduce tasks should be run under the scheduler’s control• TSUBAME2.0 supports various storage for data-intensive computation
– Local SSD storage, Parallel FSs (Lustre, GPFS)• Cooperation with GPU accelerators
– Not supported in Hadoop
• Script-based invocation– acquire computing nodes via PBS Pro– deploy a Hadoop environment
on the fly (incl. HDFS)– execute a user MapReduce jobs
• Various FS support– HDFS by aggregating local SSDs– Lustre, GPFS (to appear)
• Customized Hadoop for executing CUDA programs (experimental)– Hybrid Map Task Scheduling
• Automatically detects map taskcharacteristics by monitoring
• Scheduling map tasks to minimizeoverall MapReduce job execution time
• Extension of Hadoop Pipes features
Hadoop on TSUBAME (Tsudoop)
Thank you