HPC Solutions Architect193.62.125.70/CIUK-2016/AdamRoe.pdf · Adam Roe –HPC Solutions Architect...

Adam Roe – HPC Solutions ArchitectHPDD Technical Consulting [email protected]

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

3D XPoint, Intel, the Intel logo, Intel Core, Intel Xeon Phi, Optane and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

* Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation

2

http://www.intel.com/performance

http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html

Lustre is an object-based, open source, distributed, parallel, clustered file system (GPLv2) Runs externally from compute cluster

Accessed by clients over network (Ethernet, Infiniband*)

Up to 512 PB file system size, 32 PB per file

Production filesystems have exceeded 2TB/sec

Designed for maximum performance at massive scale

POSIX compliant

Global, shared name space - All clients can access all data

Very resource efficient and cost effective

2003Lustre released

2007Sun acquires CFS

2009Oracle* acquires Sun

2010Whamcloud* founded

2012Whamcloudjoins Intel

3

November 2016: Intel’s Analysis of Top 100 Systems (top100.org)

4

9 of Top10 Sites

75% of Top100

Most Adopted PFS

Most Scalable PFS

Open Source GPL v2

Commercial Packaging

Vibrant Community

75%

19%1% 7%

Lustre GPFS Other Unknown

1 Source: Chris Morrone, Lead of OpenSFS Lustre Working Group, April 2016

Intel65%

8%

6%

6%

3%

3%2%

2%1%

2%

Intel ORNL* Seagate* Cray* DDN* Atos* LLNL* CEA* IU Other

Intel65%

18%

4%

2%

2%2%

1% 1% 1%

Intel ORNL Cray Atos Seagate DDN IU CEA Other

Commit per Organization Lines of codes per organization

5

Intel® Scalable System FrameworkA Holistic Solution for All HPC Needs

Small Clusters Through Supercomputers

Compute and Data-Centric Computing

Standards-Based Programmability

On-Premise and Cloud-Based

Intel® Xeon® Processors

Intel® Xeon Phi™ Processors

Intel® FPGAs and Server Solutions

Intel® Solutions for Lustre*

Intel® Optane™ Technology

3D XPoint™ Technology

Intel® SSDs

Intel® Omni-Path Architecture

Intel® Silicon Photonics

Intel® Ethernet

Intel® HPC Orchestrator

Intel® Software Tools

Intel® Cluster Ready Program

Intel Supported SDVis

ComputeFabric

Memory / Storage

Software

6

7

Why do small files present a challenge to Lustre and other parallel file system and how the use of new workloads is effecting this

Lustre is traditionally designed for large sequential streaming I/O

– This generally impacts the performance with very small files

Stripe layouts generally aren’t optimised for small files

– Data access can sometimes be to just one OST

– Many of these small files could impact the performance of a single OST slowing down other workloads on the filesystem

Smaller files need more RPC’s per MB of data compared to larger files

Latency can become a challenge

8

Ever changing workloads

– New usage models for HPC Clusters, no longer just O&G, CFD, etc.

– Now we see areas such as HPDA, ML/DL, Genomics etc.

Theses new work loads have a totally different set of storage requirements

– An example being FSI, small text files containing trade data

NERSC Edison Scratch filesystem: 70% of the files in the FS are smaller than 1MB

http://portal.nersc.gov/project/mpccc/baustin/NERSC_2014_Workload_Analysis_v1.1.pdf

9

10

Technical overview of Data-on-MDT and Distributed Name Space and how they actually work

DNE Phase I / II

– Phase I introduced as of Lustre 2.4, Phase II in Lustre 2.8

– Allows for additional MDT’s to scale metadata performance

– Useful for small files, as metadata is usually the initial bottleneck

Data-on-MDT

– Now* in Lustre master branch, planned for Lustre 2.11

– Allows for data to be written directly to metadata instead of going to OST’s

– Files can now stored directly on the MDT

– Reduce the number of RPCs & access time

11

• Data-on-MDT is dependent on the progressive file layout feature (PFL) landing in the same release.• EE – Intel® EE for Lustre* Software

12

2.3

2.4

2.5

2.6

2.7

2.8

2.9

2.10+

DNE Phase I

DNE Phase II

Data on MDT*

EE 2.x EE 3.x

Today

13

Distributed Name Space Phase I Enable’s the metadata scaling with addition MDT’s within a single one name space

Remote Directories: Specify a directory tree to exist under a specific MDT

Manually load balance inode count/load per directory/MDT

Manual allocation and monitoring required per MDT to ensure it doesn’t get to full

Mature, highly sable technology since Lustre 2.4

Distributed Name Space Phase II Striping metadata, behaves more like OST’s.

Stripe metadata automatically across many MDT’s

A directory is no longer limited to a specific MDT, it could be many

Scale small file IOPS with multiple MDTs

New as of Lustre 2.8

Scaling the size of the namespace, more total MDT capacity equals more inodes within the filesystem

Scale the performance (Not quite linearly, but it isn’t too far off) of your file system, more MDT’s equals more filesystem IOPS

The filesystem feels better under high load: Distributing the work across many MDT’s increases general metadata performance for operations such as file stat

Providing granularity for sys admins for separating groups, users and different workloads

– DNE 1 Example: Student [a-m] on MDT0 and Student [n-z] on MDT1

– DNE 2 Example: Genomics on MDT[1-4] and Engineering on MDT[5-6]

14

15

Data-on-MDT optimizes small file IO Avoid OST overhead (data and lock RPCs)

High-IOPS MDTs (mirrored SSD vs. RAID-6 HDD)

Avoid contention with streaming IO to OSTs

Prefetch file data with metadata

Size on MDT for regular files

Manage MDT space usage by quota

Complementary with DNE 2 striped directories Scale small file IOPS with multiple MDTs

Increase scalability as capacity used on MDT increases

Clientlayout, lock, size, read data

MDS

open, write data, read, attr

Small file IO directly to MDS

16

Glimpse-aheadDoM File

1 RPC (2 with GLIMPSE)

Traditional File

2 RPCs (3 with GLIMPSE)

Lock on OpenDoM File

1 RPC

Traditional File

2 RPCs

Read on OpenDoM File

1 RPC + BULK if size >128k

Traditional File

3 RPCs + BULK

Client1 MDS

Client2

MDS_GETATTR+LOCK

Client1 MDS

Client2

MDS_GETARRT+LOCK w/o size. blocks

GLIMPSEOST

Client1 MDSOPEN + IO LOCK

Client1 MDSOPEN

OST

Client1 MDS

OPEN IO LOCK + DATA

IO BULK >128k

Client1 MDS

OPEN

OST

EXTEND LOCKREAD

IO BULK

Due to writing data directly to the MDT you now need more space on the Metadata side

– Historically the ratio is about 5% of the file system capacity in MDT with DoM this is now approximately 15%

– Leveraging things like ZFS* compression can help in this case

Leverage DNE to scale the performance and capacity

Implemented with PFL to allow growth beyond the MDT stripe size and scale in production environments

17

Throw flash at the problem

• High cost per Gigabyte, underutilised for some workloads, Low capacity, tuning can be difficult as most software isn’t used to such high performance

devices

Be smart about your storage

• Use flash for Lustre MDT’s with DoM and OST read caching with L2ARC or Intel® CAS

18

What sort of performance and scaling you can expect with DoM and DNE

19

20

ZFS Read caching for high I/O Hot data on Lustre

0

2000

4000

6000

8000

10000

12000

1

18

35

52

69

86

10

3

12

0

13

7

15

4

17

1

18

8

20

5

22

2

23

9

25

6

27

3

29

0

30

7

32

4

34

1

35

8

37

5

39

2

40

9

42

6

44

3

46

0

47

7

49

4

51

1

52

8

54

5

56

2

57

9

Op

s/se

cSeconds

L2ARC ON L2ARC OFF

The easiest simplest solution for scaling small file performance (Reads only)

4k Random IOPS measurement for a Lustre OST with a single NVMe SSD

6x Performance Increase in IOPS using L2ARC in addition to HDD’s

Imagine the possibilities with technologies such as 3D XPoint™

No need for HA or RAID protection

Single Intel® DC P3700 NVMe SSD for L2ARC on a ZPOOL comprising of 16 disks – Benchmark: IOZONE

21

Small file Creates directly on the Lustre MDTFile Create (4KiB): HDD vs. NVMe OST vs. DoM

Architecturally very different, both from a hardware and software perspective

Space used and load on the MDT is considerably higher

4x Speed up when using DoM for small files on an NVMe Lustre MDT (~4-32KiB tested)

1.9x of that is just from efficiency improvements in the network, i.e. less/better use of RPC’s

20039.38

44403.23

79689.85

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Fil

e C

rea

tes

pe

r S

eco

nd

1 NVMe MDT + 1 HDD ZFS OST 1 NVMe MDT + 1 NVMe OST 1 NVMe MDT (DoM)

See test platform at end of slide deck

22

Small file Creates directly on the Lustre MDT –Mixed File sizes (4k –32k)

82607.92

74542.09

63684.11

45873.5145422.87

35316.32

28675.0426583.39

21949.90

8935.82 9540.86

4657.22

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

4k 8k 16k 32k

Fil

e C

rea

tes

Pe

r S

eco

nd

Create DoM 1 MDT Create 1 NMVe OST Create 1 HDD OST

See test platform at end of slide deck

23

Small file Creates (4KiB) with DNE Phase 2File Create (4KiB): DoM 1 – 8 MDT’s

Take theses results with a pinch of salt, DNE 2 does scale better than this

Due to the lack of clients anymore than 3 MDT’s wasn’t very effective

This still shows potential

Even if we take the 2x MDT number, this still represents a 6.5x improvement vs. no DoM

See test platform at end of slide deck – Scaling limited due to lack of compute nodes (16x)

60000

80000

100000

120000

140000

160000

180000

1x MDT 2x MDT 3x MDT 4x MDT 5x MDT 6x MDT 7x MDT 8x MDT

The Data-on MDT functionality, despite being early technology is showing clear performance advantages versus traditional ways to interact with Lustre

DNE is showing high levels of stability and good levels of scaling to compliment Data-on MDT; a good combination for mixed workload filesystems

All be it not required, Data-on MDT is dependant in a production environment on Progressive file layout, extensive stability and integration testing will be

required for production.

24

25

#IntelLustre | [email protected] | intel.com/Lustre

26

Progressive File Layouts simplify usage and provide new options Optimize performance for diverse users/applications

Low overhead for small files, high bandwidth for large files

Lower new user usage barrier and administrative burden

Multiple storage classes within a single file

HDD or SSD, mirror or RAID

…

1 stripe[0, 32MB)

4 stripes[32MB, 1GB)

128 stripes[1GB, ∞)

Example progressive file layout with 3 components

Benchmarks

• MDTEST 1.9.3

Software

• Lustre Build:

• reviews-42406

• 3.10.0-327.36.3.el7 _lustre.x86_64

• Intel HPC Orchestrator GA

• OpenMPI 1.10.2

• GCC 4.8.5

Hardware

• 16 BDW-EP 2S Compute nodes with E5-2690v4 CPU’s

• 128GB DDR4 2400MHz Memory

• 2x Lustre MDS’s (2S HSW-EP)

• 4x DC P3700 800GB SSD’s each

• 8x Lustre OSS’s (2S HSW-EP)

• 4x DC P3600 2TB SSD’s each

• None blocking Fat-tree topology with a single hop

• 1x Omni-Path HFI per system

27

HPC Solutions Architect193.62.125.70/CIUK-2016/AdamRoe.pdf · Adam Roe –HPC Solutions Architect...

Documents

Transcript of HPC Solutions Architect193.62.125.70/CIUK-2016/AdamRoe.pdf · Adam Roe –HPC Solutions Architect...