HPC Solutions Architect193.62.125.70/CIUK-2016/AdamRoe.pdf · Adam Roe –HPC Solutions Architect...
Transcript of HPC Solutions Architect193.62.125.70/CIUK-2016/AdamRoe.pdf · Adam Roe –HPC Solutions Architect...
Adam Roe – HPC Solutions ArchitectHPDD Technical Consulting [email protected]
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
3D XPoint, Intel, the Intel logo, Intel Core, Intel Xeon Phi, Optane and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation
2
Lustre is an object-based, open source, distributed, parallel, clustered file system (GPLv2) Runs externally from compute cluster
Accessed by clients over network (Ethernet, Infiniband*)
Up to 512 PB file system size, 32 PB per file
Production filesystems have exceeded 2TB/sec
Designed for maximum performance at massive scale
POSIX compliant
Global, shared name space - All clients can access all data
Very resource efficient and cost effective
2003Lustre released
2007Sun acquires CFS
2009Oracle* acquires Sun
2010Whamcloud* founded
2012Whamcloudjoins Intel
3
November 2016: Intel’s Analysis of Top 100 Systems (top100.org)
4
9 of Top10 Sites
75% of Top100
Most Adopted PFS
Most Scalable PFS
Open Source GPL v2
Commercial Packaging
Vibrant Community
75%
19%1% 7%
Lustre GPFS Other Unknown
1 Source: Chris Morrone, Lead of OpenSFS Lustre Working Group, April 2016
Intel65%
8%
6%
6%
3%
3%2%
2%1%
2%
Intel ORNL* Seagate* Cray* DDN* Atos* LLNL* CEA* IU Other
Intel65%
18%
4%
2%
2%2%
1% 1% 1%
Intel ORNL Cray Atos Seagate DDN IU CEA Other
Commit per Organization Lines of codes per organization
5
Intel® Scalable System FrameworkA Holistic Solution for All HPC Needs
Small Clusters Through Supercomputers
Compute and Data-Centric Computing
Standards-Based Programmability
On-Premise and Cloud-Based
Intel® Xeon® Processors
Intel® Xeon Phi™ Processors
Intel® FPGAs and Server Solutions
Intel® Solutions for Lustre*
Intel® Optane™ Technology
3D XPoint™ Technology
Intel® SSDs
Intel® Omni-Path Architecture
Intel® Silicon Photonics
Intel® Ethernet
Intel® HPC Orchestrator
Intel® Software Tools
Intel® Cluster Ready Program
Intel Supported SDVis
ComputeFabric
Memory / Storage
Software
6
7
Why do small files present a challenge to Lustre and other parallel file system and how the use of new workloads is effecting this
Lustre is traditionally designed for large sequential streaming I/O
– This generally impacts the performance with very small files
Stripe layouts generally aren’t optimised for small files
– Data access can sometimes be to just one OST
– Many of these small files could impact the performance of a single OST slowing down other workloads on the filesystem
Smaller files need more RPC’s per MB of data compared to larger files
Latency can become a challenge
8
Ever changing workloads
– New usage models for HPC Clusters, no longer just O&G, CFD, etc.
– Now we see areas such as HPDA, ML/DL, Genomics etc.
Theses new work loads have a totally different set of storage requirements
– An example being FSI, small text files containing trade data
NERSC Edison Scratch filesystem: 70% of the files in the FS are smaller than 1MB
http://portal.nersc.gov/project/mpccc/baustin/NERSC_2014_Workload_Analysis_v1.1.pdf
9
10
Technical overview of Data-on-MDT and Distributed Name Space and how they actually work
DNE Phase I / II
– Phase I introduced as of Lustre 2.4, Phase II in Lustre 2.8
– Allows for additional MDT’s to scale metadata performance
– Useful for small files, as metadata is usually the initial bottleneck
Data-on-MDT
– Now* in Lustre master branch, planned for Lustre 2.11
– Allows for data to be written directly to metadata instead of going to OST’s
– Files can now stored directly on the MDT
– Reduce the number of RPCs & access time
11
• Data-on-MDT is dependent on the progressive file layout feature (PFL) landing in the same release.• EE – Intel® EE for Lustre* Software
12
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10+
DNE Phase I
DNE Phase II
Data on MDT*
EE 2.x EE 3.x
Today
13
Distributed Name Space Phase I Enable’s the metadata scaling with addition MDT’s within a single one name space
Remote Directories: Specify a directory tree to exist under a specific MDT
Manually load balance inode count/load per directory/MDT
Manual allocation and monitoring required per MDT to ensure it doesn’t get to full
Mature, highly sable technology since Lustre 2.4
Distributed Name Space Phase II Striping metadata, behaves more like OST’s.
Stripe metadata automatically across many MDT’s
A directory is no longer limited to a specific MDT, it could be many
Scale small file IOPS with multiple MDTs
New as of Lustre 2.8
Scaling the size of the namespace, more total MDT capacity equals more inodes within the filesystem
Scale the performance (Not quite linearly, but it isn’t too far off) of your file system, more MDT’s equals more filesystem IOPS
The filesystem feels better under high load: Distributing the work across many MDT’s increases general metadata performance for operations such as file stat
Providing granularity for sys admins for separating groups, users and different workloads
– DNE 1 Example: Student [a-m] on MDT0 and Student [n-z] on MDT1
– DNE 2 Example: Genomics on MDT[1-4] and Engineering on MDT[5-6]
14
15
Data-on-MDT optimizes small file IO Avoid OST overhead (data and lock RPCs)
High-IOPS MDTs (mirrored SSD vs. RAID-6 HDD)
Avoid contention with streaming IO to OSTs
Prefetch file data with metadata
Size on MDT for regular files
Manage MDT space usage by quota
Complementary with DNE 2 striped directories Scale small file IOPS with multiple MDTs
Increase scalability as capacity used on MDT increases
Clientlayout, lock, size, read data
MDS
open, write data, read, attr
Small file IO directly to MDS
16
Glimpse-aheadDoM File
1 RPC (2 with GLIMPSE)
Traditional File
2 RPCs (3 with GLIMPSE)
Lock on OpenDoM File
1 RPC
Traditional File
2 RPCs
Read on OpenDoM File
1 RPC + BULK if size >128k
Traditional File
3 RPCs + BULK
Client1 MDS
Client2
MDS_GETATTR+LOCK
Client1 MDS
Client2
MDS_GETARRT+LOCK w/o size. blocks
GLIMPSEOST
Client1 MDSOPEN + IO LOCK
Client1 MDSOPEN
OST
Client1 MDS
OPEN IO LOCK + DATA
IO BULK >128k
Client1 MDS
OPEN
OST
EXTEND LOCKREAD
IO BULK
Due to writing data directly to the MDT you now need more space on the Metadata side
– Historically the ratio is about 5% of the file system capacity in MDT with DoM this is now approximately 15%
– Leveraging things like ZFS* compression can help in this case
Leverage DNE to scale the performance and capacity
Implemented with PFL to allow growth beyond the MDT stripe size and scale in production environments
17
Throw flash at the problem
• High cost per Gigabyte, underutilised for some workloads, Low capacity, tuning can be difficult as most software isn’t used to such high performance
devices
Be smart about your storage
• Use flash for Lustre MDT’s with DoM and OST read caching with L2ARC or Intel® CAS
18
What sort of performance and scaling you can expect with DoM and DNE
19
20
ZFS Read caching for high I/O Hot data on Lustre
0
2000
4000
6000
8000
10000
12000
1
18
35
52
69
86
10
3
12
0
13
7
15
4
17
1
18
8
20
5
22
2
23
9
25
6
27
3
29
0
30
7
32
4
34
1
35
8
37
5
39
2
40
9
42
6
44
3
46
0
47
7
49
4
51
1
52
8
54
5
56
2
57
9
Op
s/se
cSeconds
L2ARC ON L2ARC OFF
The easiest simplest solution for scaling small file performance (Reads only)
4k Random IOPS measurement for a Lustre OST with a single NVMe SSD
6x Performance Increase in IOPS using L2ARC in addition to HDD’s
Imagine the possibilities with technologies such as 3D XPoint™
No need for HA or RAID protection
Single Intel® DC P3700 NVMe SSD for L2ARC on a ZPOOL comprising of 16 disks – Benchmark: IOZONE
21
Small file Creates directly on the Lustre MDTFile Create (4KiB): HDD vs. NVMe OST vs. DoM
Architecturally very different, both from a hardware and software perspective
Space used and load on the MDT is considerably higher
4x Speed up when using DoM for small files on an NVMe Lustre MDT (~4-32KiB tested)
1.9x of that is just from efficiency improvements in the network, i.e. less/better use of RPC’s
20039.38
44403.23
79689.85
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Fil
e C
rea
tes
pe
r S
eco
nd
1 NVMe MDT + 1 HDD ZFS OST 1 NVMe MDT + 1 NVMe OST 1 NVMe MDT (DoM)
See test platform at end of slide deck
22
Small file Creates directly on the Lustre MDT –Mixed File sizes (4k –32k)
82607.92
74542.09
63684.11
45873.5145422.87
35316.32
28675.0426583.39
21949.90
8935.82 9540.86
4657.22
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
4k 8k 16k 32k
Fil
e C
rea
tes
Pe
r S
eco
nd
Create DoM 1 MDT Create 1 NMVe OST Create 1 HDD OST
See test platform at end of slide deck
23
Small file Creates (4KiB) with DNE Phase 2File Create (4KiB): DoM 1 – 8 MDT’s
Take theses results with a pinch of salt, DNE 2 does scale better than this
Due to the lack of clients anymore than 3 MDT’s wasn’t very effective
This still shows potential
Even if we take the 2x MDT number, this still represents a 6.5x improvement vs. no DoM
See test platform at end of slide deck – Scaling limited due to lack of compute nodes (16x)
60000
80000
100000
120000
140000
160000
180000
1x MDT 2x MDT 3x MDT 4x MDT 5x MDT 6x MDT 7x MDT 8x MDT
The Data-on MDT functionality, despite being early technology is showing clear performance advantages versus traditional ways to interact with Lustre
DNE is showing high levels of stability and good levels of scaling to compliment Data-on MDT; a good combination for mixed workload filesystems
All be it not required, Data-on MDT is dependant in a production environment on Progressive file layout, extensive stability and integration testing will be
required for production.
24
25
#IntelLustre | [email protected] | intel.com/Lustre
26
Progressive File Layouts simplify usage and provide new options Optimize performance for diverse users/applications
Low overhead for small files, high bandwidth for large files
Lower new user usage barrier and administrative burden
Multiple storage classes within a single file
HDD or SSD, mirror or RAID
…
1 stripe[0, 32MB)
4 stripes[32MB, 1GB)
128 stripes[1GB, ∞)
Example progressive file layout with 3 components
Benchmarks
• MDTEST 1.9.3
Software
• Lustre Build:
• reviews-42406
• 3.10.0-327.36.3.el7 _lustre.x86_64
• Intel HPC Orchestrator GA
• OpenMPI 1.10.2
• GCC 4.8.5
Hardware
• 16 BDW-EP 2S Compute nodes with E5-2690v4 CPU’s
• 128GB DDR4 2400MHz Memory
• 2x Lustre MDS’s (2S HSW-EP)
• 4x DC P3700 800GB SSD’s each
• 8x Lustre OSS’s (2S HSW-EP)
• 4x DC P3600 2TB SSD’s each
• None blocking Fat-tree topology with a single hop
• 1x Omni-Path HFI per system
27