HPC Technology Update - Pawsey Supercomputing Centre · processors with an Intel Omni-Path...

45
HPC Technology Update 23rd September 2015 Pawsey Supercomputing Centre Christopher Harris George Beckett Chris Bording Darran Carey Ashley Chew Andrew Elwell Deva Deeptimahanti Daniel Grimwood Valerie Maxville Mark O’Shea Paul Ryan David Schibeci Mohsin Shaikh Brian Skjerven Kat Southall Kevin Stratford Neil Stringfellow Charlene Yang NCI Allan Williams Pawsey Uptake Strategy Group David Abramson (UQ) Amanda Barnard (CSIRO) Julian Gale (Curtin) Christopher Harris (Pawsey) Evatt Hawkes (UNSW) David Schibeci (Pawsey) Alf Uhlherr (CSIRO) Andreas Wicenec (ICRAR)

Transcript of HPC Technology Update - Pawsey Supercomputing Centre · processors with an Intel Omni-Path...

HPC Technology Update

23rd September 2015

Pawsey Supercomputing Centre

Christopher HarrisGeorge BeckettChris BordingDarran CareyAshley Chew

Andrew Elwell

Deva DeeptimahantiDaniel GrimwoodValerie Maxville

Mark O’SheaPaul Ryan

David Schibeci

Mohsin ShaikhBrian SkjervenKat Southall

Kevin StratfordNeil StringfellowCharlene Yang

NCI

Allan Williams

Pawsey Uptake Strategy Group

David Abramson (UQ)Amanda Barnard (CSIRO)

Julian Gale (Curtin)Christopher Harris (Pawsey)

Evatt Hawkes (UNSW)David Schibeci (Pawsey)

Alf Uhlherr (CSIRO)Andreas Wicenec (ICRAR)

Contents

1 Introduction 2

2 Compute Technologies 3

2.1 ARM Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 FPGA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 GPU Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Many-core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Multi-core x86 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 SPARC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 OpenPower Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Memory Technologies 10

4 Interconnect Technologies 11

4.1 Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Omni-Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Storage Technologies 13

5.1 GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.3 NVRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Software Technologies 16

6.1 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.2 Debuggers and Profilers . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.3 Parallel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.4 Resource Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Australian Research Facilities 24

7.1 Pawsey Supercomputing Centre . . . . . . . . . . . . . . . . . . . . . . . 24

7.2 National Computational Infrastructure . . . . . . . . . . . . . . . . . . . 25

7.3 Victorian Life Sciences Computation Initiative . . . . . . . . . . . . . . 25

7.4 MASSIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.5 NCI Specialised Facility in Bioinformatics . . . . . . . . . . . . . . . . . 26

7.6 The University of Queensland . . . . . . . . . . . . . . . . . . . . . . . . 27

8 Current Leadership HPC Systems 28

9 Future Leadership HPC Systems 31

10 Conclusion 34

1

1 Introduction

In High Performance Computing (HPC), there is currently a number of technologiesemerging at different places in the system architecture. This document provides anoverview of the current state and public road maps for relevant compute, memory,interconnect, storage and software technologies; as well as the current and plannedleadership systems at both the Australian national and international level.

While a wide range of technologies and systems are presented, it is very challengingto capture the entirety of the complex international scientific computing ecosystem. Thisdocument is the result of the efforts of the Pawsey Supercomputing Team to presentan overview of the technologies and systems that are relevant to supercomputing atscale, and likely to be of interest to the Pawsey user community, to inform the expres-sion of interest process for the Advanced Technology Cluster (ATC). The primary roleof the ATC is to provide an opportunity for Pawsey users to engage with emergingHPC technologies that could potentially be incorporated into the next main Pawseysupercomputing system.

This document also identifies several key emerging technologies that are likely to beadopted by leadership HPC systems, and are relevant to the Pawsey user community. Inparticular, it identifies three compute architectures of potential interest: Intel Xeon Phiprocessors with an Intel Omni-Path interconnect, IBM Power processors with a MellanoxInfiniband interconnect and NVIDIA GPUs, and 64-bit ARM processors. Of thesearchitectures, the first two have a mature software stack for HPC code development,while the ARM architecture is a newcomer to the HPC field. In addition, the use ofNVRAM burst buffers are also identified as a candidate for investigation to improvestorage performance. For software, the upcoming versions of MPI (3.0) and OpenMP(4.0) look to include a number of features that expand their scope, in particular thelatter with respect to accelerators, and in combination will be capable of utilising themajority of the available architectures.

In terms of structure, this document reviews several different aspects of the currentand future state of the High Performance Computing (HPC) community, to encompasslocal, national, and international facilities, and a variety of technologies. The varioustechnologies are presented in Sections 2, 3, 4, 5, and 6. The current state of Australiansupercomputing facilities available to researchers nationally is presented in Section 7, fol-lowed by the current and announced future leadership international systems in Sections 8and 9 respectively. Finally, based on this information, several potential architectures ofinterest are identified in Section 10.

2

2 Compute Technologies

This section describes the current processor architectures commonly used in HPC, de-scribing the currently available hardware and publicly available road maps for the future.As modern memory controllers are typically integrated into the processor, emergingmemory systems integrated into the processor are also discussed in this section with therelevant host processor architecture. In particular, the architectures detailed includeARM, FPGA, GPU, Many-core, Multi-core, SPARC, and OpenPower.

2.1 ARM Architectures

ARM is a family of instruction set architectures for computer processors that are de-veloped by ARM Holdings, based on a reduced instruction set computing (RISC) ar-chitecture. The ARM business model involves the designing and licensing of IP ratherthan the manufacturing and selling of actual semiconductor chips. They license IP toa network of partners, which includes the world’s leading semiconductor and systemscompanies. These partners utilise ARM IP designs to create and manufacture system-on-chip designs, paying ARM a license fee for the original IP and a royalty on every chipor wafer produced. In addition to processor IP, they provide a range of tools, physicaland systems IP to enable optimised system-on-chip designs [1].

Historically, ARM designs have been very low power system-on-chips (SOC) aimedat the mobile and tablet markets, with typical power usage less than 10 watts. Re-cently, ARM has developed 64-bit addressing and floating-point instructions and SIMDvectorisation, and is looking to compete with more traditional HPC architectures. In2014, the US Department of Energy awarded Cray a research and development contractcalled FastForward 2 [2], to explore a range of topics around architectures of 64-bitARM processors for high performance computing. Additionally, Cray is working withCavium to deliver Cray clusters based on Cavium’s 48-core work load optimised Thun-derX ARM processors. The goal of this collaboration is to analyse ARM solutions forselected workloads to investigate the ARM value proposition in supercomputing.

As well as potentially serving as the primary processor for computation, the ARMarchitecture can potentially be used as a host processor in accelerator architectures.In 2013, the Barcelona Supercomputing Centre installed a test system based on theNVIDIA Tegra 3 architecture [3]. This used the ARM Cortex-A9 host processor withNVIDIA Tesla K20 GPU accelerators, and a Mellanox QDR InfiniBand interconnect.This was shown to produce energy savings of 18-22% for a Lattice QCD workload com-pared to an Intel i5 CPU host processor with the same NVIDIA GPU [4].

In terms of software support, the ARM architecture is supported by a numberof compilers. Pathscale EKOPath ARMv8 [5] is a commercial compiler that sup-ports OpenACC [6] and OpenMP 4.0 [7] for C, C++ and Fortran. The NAG For-tran Compiler [8, 9] is a commercial compiler suite that supports OpenMP 3.1 for For-tran95/2003/2008. The GNU Compiler Collection [10] is an open source compiler suitefor C, C++ and Fortran that supports OpenMP 4.0 and will has preliminary supportfor OpenACC 2.0a. LLVM [11] is a compiler infrastructure available under a BSD-stylelicence that supports a wide range of languages and architectures, including ARM, andthe Clang front end for C/C++/Objective-C supports OpenMP 3.1 with partial supportfor 4.0. Given Cray’s high profile collaboration with ARM it is also possible that theCray Compiler, which supports OpenMP and OpenACC, will target the ARM architec-

3

ture in the future. Debugging and trace information is available via the ARM CoreSighttechnology [12].

There are two main routes for ARM partner organisation to produce ARM products.They can either license and then fabricate complete ARM system-on-a-chip referencedesigns, or purchase an architecture license that provides greater flexibility. Companiesthat purchase the ARM architecture license are able to design their own processors thatinterface with upper levels of the software-hardware stack via a standard ARM interface.This section first outlines the low power ARM Reference architectures, and then customARM products from Cavium and Broadcom that are relevant to HPC workloads.

ARM Reference Designs

The ARM Cortex-A57 [13], based on ARMv8, is currently the fastest reference de-sign produced by the partners; including AMD, Broadcom, Calxeda, Freescale, HiSili-con, NVIDIA, Qualcomm, Samsung, STMicroelectronics, and T-Platforms. The ARMCortex-A72 [14] is the latest design. These reference designs have the following features:

Processor ARM Cortex-A57Architecture ARMv8AMemory Addressing 32/64 bitFloating-point Precision 16/32/64 bitClock Frequency 4.1 to 4.76 MHzCores Up to 4SIMD 128 bit registersL1 Cache 48 KiB Instruction, 32 KiB DataL2 Cache Up to 2 MiBDebug and Trace CoreSight DK-A57

Processor ARM Cortex-A72Architecture ARMv8AMemory Addressing 32/64 bitFloating-point Precision 16/32/64 bitClock Frequency 6.3 to 7.35 MHzCores Up to 4SIMD 128 bit registersL1 Cache 48 KiB Instruction, 32 KiB DataL2 Cache Up to 4 MiBDebug and Trace CoreSight DK-A57

Cavium

Cavium [15] is a provider of highly integrated semiconductor processors that enableintelligent networking, communications, storage, video and security applications. It hasdesigned and produced the ThunderX ARM processor [16], which has the followingfeatures:

Processor Cavium ThunderXArchitecture ARMv8AProcess 28nm

4

Memory Addressing 32/64 bitFloating-point Precision 16/32/64 bitClock Frequency 2.5 GHzCores 48 coresSIMD 128 bit registersL1 Cache 78 KiB Instruction, 32 KiB DataL2 Cache 16 MiBChip Interconnect Cavium Coherent Processor Interconnect (CCPI)Memory Controller 4-channel DDR3/4, up to 1 TB dual socket

Broadcom

Broadcom [17] is one of the largest semiconductor companies in the world. Broadcomhas one of the industry’s broadest portfolio of state-of-the-art products for seamless andsecure transmissions of voice, video, data and multimedia. It has announced the Vulcanprocessor [18], with 90% of the performance of the Intel Xeon Haswell processor.

2.2 FPGA Architectures

The Field Programmable Gate Array (FPGA) is an integrated circuit that can be repro-grammed after manufacture. It consists of a matrix of logic blocks that are connectedvia programmable interconnects. This allows the circuitry to be created for a specific ap-plication, resulting in a reduced power consumption compared to more general purposeprocessors. FPGA vendors include Altera [19] and Xilinx [20], and Intel has announcedit will be producing a Xeon processor with an integrated FPGA [21]. FPGAs can beprogrammed with VHDL [22], a hardware description language, and there is also anAltera SDK for using OpenCL [23].

2.3 GPU Architectures

The Graphics Processing Unit (GPU) is a co-processor traditionally used to offloadgraphics rendering operations from the CPU. Once a fixed function configurable pipeline,the architecture became fully programmable around 2009. Since then, these accelera-tors have been shown to provide improved performance for a wide range of scientificalgorithms. The key feature of this architecture is a very wide SIMD length, and theexecution model of one thread per element rather than the entire vector. As a result,algorithms must utilise a large number of lightweight threads to take advantage of thehardware. Presently, there are two main vendors in the GPU computing space for HPC,AMD and NVIDIA.

Advanced Micro Devices (AMD)

Advanced Micro Devices (AMD) [24] offer two main lines of GPUs as an accelerator card:Radeon, which are consumer focused processors; and FirePro, which are enterprise-focused processors. In the FirePro line, the S-Series are double-precision server GPUs,which currently include the S9050, S9100, S9150 and S9170, the latter of which is sum-marised below. These cards support the OpenCL 2.0 standard via the GNU compilercollection [10], and OpenMP 4.0 with the Pathscale compiler suite [25, 26].

5

Processor AMD FirePro S9170 [27]GPU Cores 2,816Memory Controller 32 GB ECC GDDR5, 2560 GbpsInterface PCI Express 3.0 x16, 128 GbpsPerformance Theoretical 2.62 DP TFLOP/sPower 275 W

NVIDIA

NVIDIA [28] offer three main lines of GPUs as an accelerator card: GeForce, whichare consumer-focused processors for single precision performance; Quadro, which areenterprise-focused processors for single-precision performance; and Tesla, which areenterprise-focused for double-precision performance. For the Tesla models, the latestcards are the K40 with 2880 GPU cores and a dual GPU K80 with 4992 GPU Cores,based on the NVIDIA Kepler architecture. The specification of the K40 is detailedbelow.

Processor NVIDIA K40 [29, 30]GPU Cores 2,880Memory Controller 12 GB ECC GDDR5, 2304 GbpsInterface PCI Express 3.0 x16, 128 GbpsPerformance Theoretical 1.66 DP TFLOP/sPower 245 W

NVIDIA GPUs support the CUDA [31], OpenCL [32], and OpenACC [6] standardsfor compute via the GNU [10], NVCC [33], and PGI [34] compilers. There is support fora number of languages including C/C++/Fortran/Python, and NVIDIA offers a numberof optimised numerical libraries for a range of algorithms including Fourier transforms,dense and sparse linear algebra, neural networks, signal and image processing, computervision, computational fluid mechanics, statistics and sorting [35]. There are a numberof debugging and profiling tools for these GPUs [36, 37], including NVIDIA Nsight [38],NVIDIA Visual Profiler [39], TAU Performance System [40], VampirTrace [41], AllineaDDT [42], TotalView [43], CUDA-GDB [44] and CUDA-MEMCHECK [45].

The next generation of Tesla will use the NVIDIA Pascal architecture. NVIDIA alsoproduce a system-on-chip (SoC) called Tegra. The latest model, Tegra X1, couples a64-bit 8-core ARM CPU with a 256-core Maxwell GPU [46]. At this stage these hybridarchitectures are not marketed for HPC applications. A number of current leadershipHPC systems utilise GPU accelerators, including the ORNL Titan and CSCS Piz Daintas detailed in Section 8, and several future systems have been announced that will useGPUs, including the LLNL Sierra and ORNL Summit systems as detailed in Section 9.

2.4 Many-core Architectures

Many-core architectures are either main socket or accelerator cards that utilise a largenumber of traditional processing cores for HPC work loads. The cores are more numerousbut less powerful than in the mainstream multi-core architectures, and less numerousbut more powerful than in the GPU architectures. The main HPC processor in thisspace is the Intel Xeon Phi. While there are also a number of other vendor’s processors

6

in this space, including the Kalray MPAA [47, 48], PEZY Computing PEZY-SC [49, 50]and Tilera EZchip TILE [51, 52], they are yet to see widespread adoption.

Intel Xeon Phi

The Xeon Phi is a processor developed by Intel [53]. It uses the Many Integrated Core(MIC) architecture, which is based on x86. Initially, it was packaged as a co-processorattached via PCI Express v3 in a similar manner to GPUs, giving the host processorthe ability to offload computationally intensive portions of programs. In addition to itslarge number of cores, the Xeon Phi also features a very wide SIMD unit. The currentgeneration of Xeon Phi is code named Knights Corner (KNC), and features the followingspecification.

Processor Intel Xeon Phi (KNC) [54, 55]Core Architecture 22nmCores 57-61Threading 4 SMT threads per coreVector Instructions AVX-512 (512 bit) Instruction UnitAccelerator Memory 6-16 GBPerformance Theoretical 1.056 DP TFLOP/sPower 245-200 W

The next version of the Xeon Phi will be available as a primary processor, removingthe need for data transfer between the host and accelerator. It will also feature integratedMCDRAM; a high-bandwidth, low-latency memory technology. According to Intel, thenew MCDRAM will offer roughly 5 times the power efficiency and 3 times the densityof standard forms of memory. The MCDRAM will operate in different modes, includingas traditional region of memory, as large cache, and in hybrid mode of the previous twomodes. The current generation of Xeon Phi is code named Knights Landing (KNL),and features the following specification.

Processor Intel Xeon Phi KNL [56, 57]Core Architecture 14nm AirmontCores 60+Threading 4 SMT threads per coreVector Instructions 2x AVX-512 (512 bit) Instruction Units per coreFast Memory 16 GB MCDRAMPerformance 3+ DP TFLOP/s announced by IntelPower 160-215 W

The Xeon Phi can be programmed with the MPI [58], OpenACC [6], OpenCL [32],OpenMP [7] with support from the Cray [59, 60], GNU [10], Intel [61] compiler suites.While most of the architectures support hybrid MPI for inter-node communication, theXeon Phi also supports a pure MPI programming model for parallelisation across itscores.

There are a number of numerical libraries for the Xeon Phi. The Intel Parallel Stu-dio [62] contains MPI libraries and the Math Kernel Libraries (MKL) which containhighly-optimised BLAS, LAPACK and FFT routines that have been tuned for Intel

7

architectures, including the Xeon Phi. The MKL libraries also supports automatic of-floading; computationally intensive portions of MKL routines will automatically be doneon the Xeon Phi where it is configured as an accelerator. MAGMA [63] is a linear alge-bra package designed for heterogeneous architectures. It uses a hybrid approach; thatis, it implements algorithms that can make use of both CPUs and accelerators. Besidesproviding an LAPACK/BLAS-like interface for the Xeon Phi, it also features a numberof standard algorithms found in MKL and other linear algebra packages such as LUand QR decompositions, sparse solvers, and eigensolvers. BOOST [64] is a large col-lection header-based libraries the provide support for linear algebra routines, advanceddata structures, geometry routines, image processing, etc. While lacking in offloadcapabilities, BOOST can be used in native mode on Xeon Phi. ViennaCL [65] andPARALUTION [66] both provide sparse matrix iterative solvers and preconditioners foraccelerators.

For debugging and profiling, a number of tools are available. The Intel Parallel Stu-dio [62] provides AdvisorXE, Vtune Amplifier XE, and Inspector XE. Other debuggersthat support the Xeon Phi include Allinea DDT [42], GDB [67] and TotalView[43]. Forprofiling, there is also Allinea MAP [68], PAPI [69], and TAU [40].

A number of current leadership HPC systems utilise Xeon Phi (KNC) accelerators,including the NUDT Tianhe-2 and TACC Stampede as detailed in Section 8. Severalsystems have been announced that will use the future generations of Xeon Phi as detailedin Section 9. For Knights Landing (KNL), these include the LANL Trinity and NERSCCori. Following this, the next Xeon Phi architecture is Knights Hill (KNH) and it willbe used in the ANL Aurora system.

2.5 Multi-core x86 Architecture

The x86-64 architecture refers to the 64-bit extensions of the x86 architecture whichwere developed by AMD under the Opteron brand and adopted by Intel. Intel andAMD are the two main vendors for this architecture.

Intel Xeon

For Intel [53], the server-class x86-62 CPU are released under the Xeon brand, andare also suitable for use in HPC environments. The current HPC micro-architectureis Haswell-based (Xeon E5-26xx v3), which introduced the AVX2 instruction set. His-torically, Intel has developed a tick-tock development strategy for its CPUs. A tickindicates a die shrink of an existing micro-architecture and a tock indicates an newmicro-architecture, making Haswell is a tock iteration. The tick iteration correspond-ing to the die shrink of the Haswell micro-architecture is code-name Broadwell andis currently scheduled for later this year. It will have up to 22 cores and maximumsupported memory speed is quad-channel DDR4-2400 [70]. The architecture followingBroadwell is called Skylake, which will possibly include include PCI Express 4.0 andbuilt-in Omni-Path support.

AMD

AMD [24] server processors are marketed under the Opteron brand. Their last main-stream release is the Operation 6300/4300 series released in 2012. The Opteron X series

8

(X1150/X2150) was released in 2013. AMD’s next release of Opteron has not been an-nounced, and instead the AMD Zen micro-architecture [71] will target the HPC classprocessors, however more detail will not be available until closer to its launch.

2.6 SPARC Architecture

The SPARC architecture is managed by SPARC International [72]. The most recentversion is SPARC Version 9, which is a 64-bit architecture [73]. Fujitsu has producedthe 16-core SPARC64 IXfx processor [74] used in the K Computer, which was #4 in themost recent Top500 list [75]. This processor has the following specification.

Processor Fujitsu SPARC64 IXfx [74]Core Architecture 40nmCores 16Memory Bandwidth 680 GbpsVector Instructions 8 SP operations per clock per corePerformance 118.25 DP GFLOP/sPower 110 W

2.7 OpenPower Architecture

The long-running IBM Power series architecture has recently been re-invigorated by thecreation of the OpenPower organisation [76] as a vehicle to allow development of thePower architecture together with other partners, notably Google [77], NVIDIA [28], andMellanox [78]. The current Power8 architecture has the following specification.

Processor IBM Power 8 [79]Core Architecture 22nmCores Up to 12Threading 8 SMT threads per coreMemory Bandwidth 1840 GbpsMemory Feature Coherent Accelerator Processor Interface (CAPI)

OpenPower has led to IBM’s involvement with a number of very high profile newservices, namely the ORNL Summit and LLNL Sierra, detailed in Section 9. Bothwill be based on the next generation Power9 architecture, and are slated for H2 2017.IBM has produced a NVIDIA GPU-accelerated Power8 server branded Firestone, whichit has provided to the US Department of Energy for prototyping and development inpreparation for these two future systems [80].

A number of current leadership HPC systems utilise Power processors, including theLLNL Sequoia, ANL Mira, FZJ JUQUEEN, and LLNL Vulcan as detailed in Section 8,and several future systems have been announced, including the LLNL Sierra and ORNLSummit systems as detailed in Section 9.

9

3 Memory Technologies

This section describes the current memory technologies used in HPC, describing the cur-rently available hardware and publicly available road maps for the future. As modernprocessors integrate memory controllers, memory technologies of such processors havebeen detailed in the previous section. In terms of current and emerging memory tech-nologies that are not integrated into the processor, two key areas are shared memorysystems and non-volatile memory.

Shared memory systems present a unified address space to multiple processors, andcan be classified based on access-time from the processors into Uniform Memory Access(UMA), and Non-Uniform Memory Access (NUMA). Modern leadership supercomputersare hybrid memory systems, in which the memory space is distributed between nodes,but shared within nodes. Single socket nodes are typically UMA, while dual socket areNUMA due to data transfer between the processors. There are also shared memorysystems that present a large memory capacity as a unified address space to a largenumber of sockets, such as the SGI UV 3000 [81], which supports up to 256 processorsand 64 TB of memory. However, the effective scale of such systems remains significantlysmaller than the current leadership supercomputers.

The use of non-volatile memory, which retains data when not powered, is emergingin the role of the main system memory in the form of the Non-Volatile Dual In-lineMemory Module (NVDIMM). Vendors that have announced NVDIMM products includeIntel and Micron [82], as well as Sony and Viking Technology [83]. However, the roadmap for operating system and application support is not clear.

10

4 Interconnect Technologies

This section describes the current interconnect technologies used in HPC, describingthe currently available hardware and publicly available road maps for the future. Inparticular, the Ethernet, InfiniBand, and Omni-Path technologies have been identifiedas technologies that have or are likely to see wide spread adoption and are covered inmore detail. Other technologies of note also include the Fujitsu Tofu [84] and IBM BlueGene/Q [85] interconnects.

4.1 Ethernet

Ethernet is a networking standard that has been widely adopted in commodity equip-ment for local area networks [86, 87]. In supercomputing centres, it is one of the tech-nologies used for data transfer between the various systems, and the interconnect forsome cluster systems. Often, 10 Gbps Ethernet is used to provide data transfer be-tween the HPC systems and the external network, as well as 1 Gbps for low bandwidthmanagement networks.

In terms of high bandwidth connections, Ethernet currently supports 10, 40 and 100Gbps. The latter are achieved by using multiple 10 Gbps links, and tend to have agreater expense due to this overhead.

The most recent development for Ethernet is the announcement of plans to increasethe single link speed from 10 to 25 Gbps [88], allowing 50 Gbps connections with twolinks, and 100 Gbps with four links instead of ten. It is anticipated that this will reducethe cost of the higher bandwidth connections, due to the reduction in links.

While these developments are interesting, they are less relevant to HPC intercon-nects. Ethernet does not feature as a primary interconnect in any of the current orannounced future leadership systems described in Sections 8 and 9.

4.2 InfiniBand

InfiniBand is a very popular interconnect technology used in a number of Supercomput-ers due to its low latency and high bandwidth. Until recently it was available from twocompanies: Mellanox [78] and Intel [53] (previously QLogic). However, it appears thatIntel is moving to creating its own interconnect, called Omni-Path, which is detailed inSection 4.3.

Until recently the highest speed was Mellanox’s Fourteen Data Rate (FDR) In-finiBand capable of up to 56 Gb/s speeds [89, 90]. However, Mellanox has releasedEnhanced Data Rate (EDR) InfiniBand capable of up to 100 Gb/s speeds. They aremarketed as ConnectX-4 in adapter form and SB7XXX/CS7XXX switches, and shouldbe backwards compatible with 10/20/40/56 Gb/s InfiniBand. The next version is HighData Rate (HDR) InfiniBand [91] which is currently scheduled for 2017.

Mellanox has its own InfiniBand driver based on the OFED distribution. The currentversion is 3.0-1.0.1 which supports the ConnectX-4 and ConnectX-3/ConnectX-3 Proadapters.

11

4.3 Omni-Path

Omni-Path, also code named Storm Lake, is fabric architecture from Intel [53] for HPCsystems [92]. It is a continuation of their current True Scale Fabric, with improvementsin performance, efficiency, latency, and power consumption. A particular feature thatOmni-Path will offer is 48 port switches. The benefit here is in scaling; more nodes withfewer switches reduces the average number of hops between nodes as the system scales.

The first generation of Omni-Path will have a bandwidth of 100 Gbps per port.Omni-Path will also make use of Intel’s Performance Scaled Messaging library, which hasbeen designed to improve short message efficiency, although no performance statisticsare currently available.

A key point of difference and potentially the main advantages of Omni-Path appearsto be its hardware-fabric integration. While traditional PCI Express boards will beavailable, Omni-Path will also be integrated into both Xeon processors and Xeon Phiprocessors starting with the Skylake and Knights Landing architectures respectively. Intheory, this should offer benefits in terms of power consumption, latency, and bandwidth.

12

5 Storage Technologies

This section describes the current storage technologies used in HPC, describing the cur-rently available hardware and publicly available road maps for the future. In terms oftraditional, high performance, parallel, shared filesystems, the GPFS and Lustre filesys-tems are considered in detail. In addition, an emerging use of Non-Volatile RandomAccess Memory (NVRAM) as burst buffers is also detailed.

5.1 GPFS

The General Parallel File System (GPFS) [93, 94] is a high performance file systemdeveloped by IBM [95]. It provides concurrent high-speed file access to applications ex-ecuting on multiple nodes of clusters. Many of the world’s largest commercial companiesand scientific supercomputers are powered by GPFS.

GPFS supports a range of operating systems including AIX, Linux and Windows andcan run on a heterogeneous cluster as well. It allows for hierarchical storage management.Disks with different performance, locality or reliability characteristics can be groupedto different storage pools and a file system can have tiers, for example SATA and tape.The limits on the number of nodes in the system and the number of files in a directoryare relatively high compared to some other file systems.

GPFS is available via commercial software license. The latest version is 4.1 and hasenhanced security (native encryption and secure erase), improved performance (flashlocal read only cache, etc) and more extensive usability (active file management optimi-sation, etc). IBM now is selling GPFS as part of IBM Spectrum Scale [96], a brandingfor software-defined storage (SDS).

Some features of GPFS include:

• High performance support for traditional applications. It manages meta-data byusing the local node when possible rather than reading meta-data into memory un-necessarily, caches data on the client side to increase throughput of random reads,supports concurrent reads and writes by multiple programs, and provides sequen-tial access that enables fast sorts, improving performance for query languages suchas Pig and Jaql.

• High availability. It has no single directory controller or index server in chargeof the whole system, but instead the meta-data information is distributed acrossserver nodes, which reduces the chance of single point of failure. Files are striped.Each file is broken into blocks and those blocks are distributed onto different nodes.To avoid data loss caused by disk failure, multiple copies of each block are kepton different disks. Other attributes that help with the availability of the systeminclude the use of node quorums, automatic distributed node failure recovery andreassignment, etc.

• POSIX compliance. It supports full POSIX file system semantics and brings thebenefit of being able to support a wide range of traditional applications and UNIXutilities such as FTP and SCP.

• Online system maintenance and capacity upgrade. Many maintenance chores suchas adding new disks or re-balancing data across disks can be performed while thesystem is live which reduces production outages and keeps the computer itselfavailable more often.

13

• Handy tools for management and administration. Snapshots, data and meta-datareplication, and online disk removal can all be done easily using these tools. In-formation life-cycle management (ILM) tools allow administrators to define thepolicies to be applied to the system or a subset of it and automate the file place-ment, migration and deletion.

5.2 Lustre

Lustre [97] is a high-speed, parallel filesystem allowing thousands of clients to share thesame filesystem. Version 2.0 was the last version released by Sun Microsystems beforethey were bought out by Oracle. It involved a major internal restructure of the sourcecode to enable a number of new features, the most important of which was to supportmore than ldiskfs as the back-end filesystem. Originally based on ext3, more modernversions of ldiskfs became the source of ext4.

Whamcloud took over the community development of Lustre and released version2.1 which was an attempt to make the 2.0 code-path production ready to give largeSupercomputing Centres the confidence to move from the 1.8.x versions. This wasmarked as a maintenance release which would receive long term support (ie. 2.1.1,2.1.2, ...). Feature releases would not receive any point releases, and were where newfeatures would land.

Version 2.4 was the next version to be marked as a maintenance release. But asonly half of the HSM code landed with 2.4, It was decided to shift the maintenancerelease to Version 2.5 which completed the HSM code landings. Version 2.5.3 is the lastcommunity maintenance release, as Intel (who bought Whamcloud) have decided to nolonger provide a community maintenance release. They will continue to release 2.5.xversions to their customers.

The current feature release is 2.7, though version 2.8 should be released soon. Thefeature releases should be governed by the community and there is a road map for version2.9 and 2.10. It is unclear which version will be used for the next maintenance releasebut there a number of features in 2.8:

• ZFS as a backing file system (this landed in 2.1, but has had continuous develop-ment)

• Project quotas (being able to assign a quota to a directory)• Distributed Name space - scaling over multiple Meta-data servers to improve IOPS

5.3 NVRAM

A burst-buffer constructed from Non-Volatile Random Access Memory (NVRAM) offersa mechanism to bridge the gap between relatively high-speed on-board dynamic RAMand much slower access to a traditional off-board parallel file system. NVRAM wouldthen provide high a throughput, low latency, I/O buffer to the file system for which onecould imagine various uses [98]. The burst buffer represents something of a new ideain HPC to the extent that there is little actual experience in the community of its use,but new systems will be almost certain to include some capacity of this sort in the nearfuture.

NVRAM is sometimes referred to as Solid State Drive SSD, as opposed to spinningdisk or Hard Disk Drive (HDD), both of which are non-volatile. The term flash memoryis also often used to encompass solid state NVRAM. However, flash is also sometimes

14

used in the context of lower-quality, less reliable, and slower memory hardware used inother types of storage such as USB drives.

In HPC systems, NVRAM could take the role of local disk at the node level. Adata-intensive application could stage frequently accessed data to local NVRAM whichcurrently cannot be held in memory, but is slow to access repeatedly from disk.

It is probable that all vendors will have some NVRAM offering in the near future,for example Cray currently offers it under the DataWarp branding [99]. This offers theability to configure available NVRAM to different uses or degree of locality, such asprivate per-job usage, general file system buffering, or some mix of the two.

15

6 Software Technologies

This section provides an overview of the software technologies used in HPC, describingthe currently available suites and any announced features to be supported in the future.In particular, details of compilers, debuggers and profilers, parallel models, and resourcemanagers are presented.

6.1 Compilers

Most high performance computing programs utilise compilers to turn the source codeinto executable binaries that can be run by the system. This section details the majorcompilers used in high performance computing.

The Cray Compiler [59, 60] is a compiler suite provided by Cray on their systems.It supports C/C++/Fortran codes, and OpenMP 3.1.

The GNU Compiler Collection [10] is an open source collection of compilersfrom the GNU project. It supports C/C++/Fortran codes, and OpenMP 3.1 for ARM,Power, and x86-64 processors, and many others.

The Intel Compiler [61] is a commercial compiler suite produced by Intel, andis freely available for academic researchers. It supports C/C++/Fortran codes andOpenMP 3.1 for Intel x86-64 processors.

The LLVM Compiler [11] is a compiler infrastructure available under a BSD-style licence. It supports a wide range of languages and architectures, including ARMand x86-64. The Clang front end for C/C++/Objective-C supports OpenMP 3.1 withpartial support for 4.0, and there are a number of other front ends for other languages,including Fortran.

The NAG Fortran Compiler [8] is a commercial compiler suite from the NumericalAlgorithms Group (NAG). It supports OpenMP 3.1 for Fortran95/2003/2008 for ARMand x86-64 processors.

Pathscale has a number of commercial compilers available for different architectures.The EKOPath [5] is for ARM and x86-64 processors and supports C/C++/Fortrancodes with OpenMP 3.0. The ENZO [25] is for both NVIDIA and AMD GPUs with andsupports C/C++/Fortran codes with OpenACC and OpenMP 4.0. The latter supportsARM and x86-64 host processors, and Pathscale has announced Power support in thefuture.

The PGI Compiler [34] is a commercial compiler suite from PGI. It supportsC/C++/Fortran codes with CUDA and OpenACC on NVIDIA GPUs and x86-64 pro-cessors.

Oracle Solaris Studio [100] is a development platform produced by Oracle [101],freely available under a license agreement. It supports compilation of C/C++/Fortrancodes, and the OpenMP 4.0 specification, for both SPARC and x86-based Oracle sys-tems. It also contains profiling, debugging, and high performance libraries.

XL [102, 103] is a commercial compiler suite produced by IBM. It supports C/C++/Fortrancodes for Power processors with OpenMP 3.1 and partial OpenMP 4.0 support. It alsosupports integration with the Clang front end.

16

6.2 Debuggers and Profilers

Developing or porting scientific applications on modern supercomputers is a challengingtask, especially when it comes to fixing bugs, profiling and optimising their workload. Forthis reason, tools such as debuggers and profilers play a significant role in allowing usersto resolve the above issues quickly and focus on their core research activity. Currently,there are many licensed and open-source tools available online, where some performdebugging or profiling or both.

Debuggers

Allinea Forge [104] (licensed) is a complete graphical tool suite that provides every-thing needed to debug, profile, optimise, edit and build parallel and multi-threaded C,C++ and F90 applications on CPUs, GPUs and co-processors. This includes AllineaDistributed Debugging Tool (DDT), a general purpose parallel debuggers whichallows users to interactively control the pace of execution of a program graphically; andAllinea MAP, an optimisation and profiling tool that shows which lines of code areslow without instrumenting the code and thereby avoid the danger of creating large,unmanageable data files.

TotalView [43] (licensed) is a GUI-based debugging tool, allows simultaneous de-bugging of many processes and threads in a single window. TotalView works with C,C++, and Fortran applications written for Linux; including Cray, Blue Gene, and XeonPhi coprocessor; and supports OpenMP, MPI, OpenACC and CUDA.

GNU debugger (gdb) [67] (GPL licensed) provides a command line interface toquickly and easily examine a core file that was produced when an execution crashed togiven an approximate traceback.

Valgrind [105] (GPL license) provides number of debugging and profiling tools toautomatically detect many memory management and threading bugs. Memcheck, themost popular tool is used to detect many memory-related errors that are common in Cand C++ programs.

Stack Trace Analysis Tool (STAT) [106] (Open-source) can gather and mergestack traces from a parallel application’s processes to identify groups of processes ina large scale parallel application that exhibit similar behaviour, and visualise this in-formation using 2D spatial and 3D spatial-temporal call-graphs. This is similar to theCray Abnormal Termination Processing (ATP) tool.

The other debuggers include CUDA-GDB [44], CUDA-MEMCHECK [45], andPython debugger (pdb) [107].

Profilers

Open |Speedshop [108] is an open source multi platform Linux performance toolwhich is targeted to support performance analysis of applications running on both singlenode and large scale IA64, IA32, EM64T, AMD64, Power, ARM, Blue Gene and Crayplatforms. This tool is build on top of a broad list of community infrastructures, suchas DynInst, MRNet, PAPI, HPCToolkit.

HPCToolkit [109] (BSD license) supports measurement and analysis of serial,threaded, MPI and hybrid (MPI+threads) parallel codes. It uses statistical sampling(low overhead compared to tracing) of timers and hardware performance counters, to

17

collect accurate measurements of a program’s work, resource consumption, and ineffi-ciency and attributes them to the full calling context in which they occur.

Cray Performance tools [110] (for Cray systems) is used to profile and optimisecode. It contains three main tools; CrayPat, a performance analysis tool is used tohelp its users analyse the performance of codes running on them in terms of executiontime, load imbalance time, communication, cache, FLOPS etc. It first samples theapplication to identify the key events that are to be traced during the tracing experiment;Apprentice2, used to display the performance information gathered by CrayPat in agraphical form; and Reveal, an extension to CrayPAT, this integrated performanceanalysis and code optimisation tool is developed to provide GUI support to navigatethrough the annotated source code, while displaying information about loops and inlinedcode. Reveal can work out the OpenMP scope of variables within loops and suggestsOpenMP directives that needs to be inserted in the source code.

The IBM High Performance Computing Toolkit [111] (for IBM systems) is asuite of performance-related tools and libraries to assist in application tuning of sequen-tial and parallel applications using the MPI and OpenMP paradigms. The dimensionsof the performance data provided by this toolkit include, application profiling, hard-ware performance counters, MPI profiling/tracing, OpenMP profiling, application I/Oprofiling.

Intel Trace Analyzer and Collector (ITAC) [112] is a tool for visualise andunderstand the behavior of MPI applications, evaluate load balancing, learn more aboutcommunication patterns and identify communication hotspots on Intel architectures.

The NVIDIA Profiler (nvprof) [113] enables the collection of a time line ofCUDA-related activities on both CPU and GPU, including kernel execution, memorytransfers, memory set and CUDA API calls and view that data from the command-line.The NVIDIA Visual Profiler [39] (navvy and insight) is also available to visualise andoptimise the performance of CUDA applications.

Other popular open-source profiling tools include mpiP [114], Integrated Per-formance Monitoring (IPM) [115], Tuning and Analysis Utilities (TAU) [40],VampirTrace [41], Scalasca [116], Paraver [117], and gprof [118].

6.3 Parallel Models

A parallel program may be classified according to different parallel programming paradigms.

Shared memory model where the computational resource (usually multicore/manycores) shares the available memory resource with a single address space. Someexamples are POSIX threads or Pthread library and OpenMP, a shared memoryAPI.

Distributed memory model where physically or logically partitioned computationalresources have a local memory resource, and are connected via high-speed in-terconnect. Message passing model commonly implemented by Message PassingInterface or MPI library is one example of this model paradigm.

Hybrid model where computational units, homogeneous (multiple identical cores ona same chip, with two or more chip on a same board forming a node) or hetero-geneous (a host CPU and a specialised computational subsystem acting as a co-processor or accelerator e.g. a GPGPU or an Intel MIC (Many Integrated Cores),sit together on a node, each having private and shared memory and connected with

18

some form of high-speed fabric connecting the two to work as a united computa-tional unit), may be connected together via finely tuned high-speed interconnectcan form a machine. The programming paradigm for such architectures arises froma mix of pure parallel programming models, e.g. MPI+OpenMP, MPI+CUDA.CUDA is a programming API by NVIDIA to program GPGPUs.

Distributed-Shared Model (DSM) where the notion of shared memory is supportedin distributed architectures on the software level. An implementation example ofthis model is Partitioned Global Address Space or PGAS, which has served as afoundation for parallel languages and programming frameworks like Unified Par-allel C, Co-array Fortran, Titanium, X-10 and Chapel.

Pure shared memory architectures are limited in their scale as they become veryexpensive as the cluster grows in size. The more prevalent architecture is a mix of sharedmemory nodes tied as a distributed architecture. There is a new wave of supercomputerscomprising of tuned clusters of heterogeneous nodes with either GPUs or Intel MIC asco-processors. Thus we concentrate our discussion here on OpenMP, MPI and CUDAfor their current status and where they are heading. Other relevant standards includeOpenACC [6] and OpenCL [32], which are not discussed in detail as they have notbeen widely adopted by the Pawsey user community and many of their features will bepresent in OpenMP 4.0.

OpenMP

OpenMP [7] is a shared memory API specifically designed for High Performance Com-puting. It is portable across shared memory architectures. A collection of compilerdirectives, runtime routines and environment variables, it requires compiler support,so the API is available in three major languages used in scientific computing, i.e. C,C++ and Fortran. The parallelism is implemented in a fork/join model, giving riseto lightweight processes called threads at the fork stage. It is common to replicate atask over multiple threads but it is also possible to distribute disjoint tasks to differentprocesses, termed as work-sharing e.g. loop parallelisation.

An Architecture Review Board (ARB) standardises OpenMP. With the current re-lease of OpenMP 4.0 there are new features, which support portability of code froma shared memory architecture to accelerator devices, such as a GPU or an Intel XeonPhi. There are a number of significant additions to OpenMP in version 4.0, as outlinedbelow.

Decisions are made at compile time for the execution model employed by the OpenMP-4.0 program based on the availability of an accelerator. In case of unavailability or non-existence of an accelerator, the same code can run on the host CPUs. Thus the samecode is portable to homogeneous or heterogeneous platforms.

The code can either follow an execution model which moves work or function betweenthe host and the target devices or a data model which moves data (access) between thehost and the target device. The execution model is host-centric where host can offloadwork to target device or do it itself in case it a target device is not available or non-existent. The data model is data centric. A data environment is initialised on each targetdevice at the beginning of an OpenMP program. The API can handle the mapping ofthe data residing on the host and the corresponding instance of that data on the targetimplicitly or user can control the mapping explicitly to avoid unnecessary data transfers(requires compile time support).

19

Array section syntax for C/C++ has been introduced in version 4.0. Portions of ar-rays can be offloaded to target device. Sections of arrays can be marked for dependenciesto resolve task dependence graphs on array elements.

SIMD (Single Instruction Multiple Data) construct has been introduced in version 4.0where the API, under the guidance of the user, employs the instruction level parallelismto enable concurrency in handling loops. Multiple levels of parallelism can be achievedby combining SIMD constructs and work-sharing loop constructs. Use of team constructis also included to exploit extra level of parallelism on the GPUs.

Cancellation construct in version 4.0 implement the error model where a thread candemand early exit from a function if the goal is achieved by one of the threads. Thiscan be done on the fly or a synchronisation point or cancellation point can be imposedby the user where API looks for a cancellation request.

Thread Affinity has been improved in version 4.0 which is desirable in some NUMAapplications to increase performance.

Task group construct has been introduced to implement deeper synchronisation.Subgroups or Global groups of tasks can be synchronised on synchronisation points.

Version 4.0 offers User-defined reduction with user defined operators, which can bean expression or a function.

Message Passing Interface (MPI)

MPI [58] today is a de facto standard for communication on distributed memory ar-chitectures. With compiler support for the major scientific languages like C, C++ andFortran, it has become a backbone of much scalable code design and there is no sign ofMPI being dethroned of its status in parallel computing in foreseeable future (thoughHPC is evolving quickly so no very long term promises hold). The latest MPI version3.0 was released in 2012. Here are some of the new capabilities which enable futurecodes to enter the Petascale era.

One improvement is the inclusion of Scalable Communication Topology mappingfunctions. In MPI-2.2 the creation of Cartesian grids as virtual topologies was veryscalable but the graph topology was not. In MPI-3.0, this has been improved. The newdistributed graph topology implementation is scalable and lets the user define arbitrarycommunication relations.

Another is the Neighborhood Collective Communication on Process topologies. Cur-rent collective communication that spans over all member processes is dense (global).Neighborhood Collectives are attractive for sparse communication patterns or irregularcommunication patterns (e.g. Load balancing, adaptive mesh refinement etc.). Thusneighborhood communicators gives more freedom to map physical communication pat-terns to that of the topology patterns very closely. This significantly simplifies thenearest neighbor communication, e.g. point to point communication with north, south,east and west can be done with simple one or two function calls with neighborhoodcollectives. It also has blocking and nonblocking variant of collective communicationcalls. Given that a virtual topology has been implemented/committed, operations onedges of the topology can be performed with this facility.

For the Nonblocking collective communication, it implies asynchronous collectivessimilar to previously known asynchronous point to point communicators. The bene-fits may include immediate return from a collective call (like BCast, Scatter, Gather

20

etc.) which will allow communication and computation overlap. Also, more than onecollective call can be active at one time on a given communicator. Nonblocking bar-rier synchronisation has been implemented in MPI-3.0 standard, which means less stalltime for processes that finish early and wait, thus wasting valuable compute time. Non-blocking reduction has been implemented in MPI-3.0. Significant code changes may berequired but the gains are may be worth the effort.

Addition of nonblocking collective operations like alltoallw, both in neighborhoodcollectives and global nonblocking collectives give more leverage to the user as to howmuch mix of data type message can be sent to the member processes in one call.

There is also the addition of matched probing, via the use of MPI Mprobe in MPI-3.0 to find size of the message before receiving it. With increasing asynchronous com-munication in hybrid models, this feature may play a significant part in maximisingcomputation communication overlap.

In MPI-1, a send and receive operation was two sided, i.e. sender will wait until thereceiver call MPI Recv. One sided communication, introduced in MPI-2 decouples datatransfer from system synchronisation by introducing Remote Memory Access (RMA).This reduces the MPI-Send and MPI-Recv pair on two processors to one single MPI Getor MPI Put call which access the data located on the Sender’s or Receiver’s address space(the view of the memory address space on remote process is limited). To implementRMA, a process must select a contiguous region of memory and open it onto the otherprocesses. This accessible region is called a window and the existence of this regionis made known to other processes by using some special collective calls. The MPI-3.0 standard has improved the efficiency of RMA of MPI-2. In MPI-3.0, one-sidedcommunication operations are nonblocking, though synchronisation can also be forcedby the user.

This results in a huge benefit of reducing the number of calls to a minimum, betweenprocesses which communicate frequently and in a repetitive pattern. Although race con-ditions may occur if it is implemented incorrectly. MPI-3.0 allows non-contiguous win-dowing of shared memory to cater for NUMA architecture and allowing the placementof windowed memory in a NUMA region of a process.

In classic hybrid mode MPI+OpenMP required two stage parallelising for domaindecomposition. This is in addition to OpenMP’s loop level parallelisation. In MPI-3Shared memory windows will allow shared memory to be shared across different MPI-processes with a single level of domain decomposition (using MPI), and OpenMP canbe used at the loop level only.

Heterogeneous systems are specially built to perform by adhering to a certain pro-gramming model that is best suited for certain classes of scientific problems. Todaythe list of such scientific problems is growing and so dose the nature of these specialarchitectures is become more and more less special i.e. general purpose. Programmingsuch platforms is still a special issue. For Intel MIC, OpenMP can suffice and is withnew features added to OpenMP 4.0, it is claim to be a best suited programming modelfor the architectures like Intel’s Knights Landing. A cluster of such devices (preferablywith burst buffers) can be programmed for with a hybrid model of MPI+OpenMP. Al-though OpenMP 4.0 can be used to program for GPGPUs, NVIDIA’s CUDA remainsGPGPU’s native programming library, purpose built for programming these yet specialdevices set as accelerators in a heterogeneous architecture.

21

Compute Unified Device Architecture (CUDA)

NVIDIA’s Compute Unified Device Architecture (CUDA) [31] is an API for program-ming the GPU. Computationally, a GPU implements thread model in a two level hier-archy, a block and a grid. A block consists of tightly coupled threads, where as grid is acollection of loosely coupled blocks. The blocks in a grid are not synchronised, and arelocated on a single GPU. The GPU contains a number of multiprocessors that handleone or more blocks of a grid. Threads within a block can share data through the multi-processor shared memory, since they will be co-located on the same multiprocessor. Allthreads in the grid have access to the GPU’s global memory, on which data can stagedfrom the host.

CUDA suits the implementation of a SPMD (Single Program Multiple Data) model.The worker management i.e. threads creation and destruction is handled implicitly bythe API but the user is responsible to allocate the workload and computational resourcefor that workload.

CUDA 7.0 is a major update to the CUDA platform, with a number of notablefeatures. One such feature is the support for the C++11 standard, a major update toC++. nvcc supports the C++11 standard so both the host code and the device codecan be written as C++11 compliant.

Another CUDA 7.0 feature is the improved performance of Thrust library. Thrust isa C++ template library for CUDA (based on STL) that simplified the parallel paradigmon high-level abstraction. It provides a collection of data parallel primitives, whichinterfaces with Thrust library to implement most suitable and efficient implementations.Thrust1.8 allows customising algorithms meant to run on the device. There is some levelof atomic control provided on thread execution on the device. The classical Thrust’sdata manipulation functions have improved noticeably in version 1.8.

A new CUDA library, cuSolver, is also available. It contains dense and sparse nu-merical direct linear and Eigen Solvers. The idea is to match LAPACK’s functionalcapability. In addition, the CUDA FFT library, cuFFT, sees a 3.5x performance im-provement compared to CUDA 6.5. The new CUDA 7.0 also allows runtime compilationof CUDA-C++ source code. CUDA 7.5 has been released, with numerous updates tothe compilers, tools, and libraries.

6.4 Resource Managers

Resources managers are essential to ensure the optimal utilisation of supercomputersand being able to apply the correct allocations to projects to ensure the fair distributionof resources among them. They are the method by which researcher interact with thesystems. There are a number of resources managers available, including Platform LSF,MOAB, PBS Pro, and SLURM.

Platform LSF [119] is a commercial resource manager that was bought out byIBM [95] and is now called IBM Platform LSF.

MOAB [120] is a commercial resource manager from Adaptive Computing [121],replacing a predecessor called Maui [122], which also used Torque [123]. MOAB can alsoutilise Torque, a fork of OpenPBS. The latest version of Torque is 5.0.2 and significantwork has been done in allowing it to scale to millions of cores. The current version ofMOAB is 7.2.

22

PBS Professional [124] is a commercial resource manager from Altair [125]. PBSProfessional 13.0 has just been released, and its newest feature is being able to scale tothe exa-scale.

SLURM [126] is an open-source resource manager supported by SchedMD [127].Its greatest strength is having an allocation manager built-in, and being able to supportCrays natively.

23

7 Australian Research Facilities

There are a range of supercomputing facilities that are currently available to the Aus-tralian research community. This section outlines a number of supercomputing systemsavailable at the Pawsey Supercomputing Centre, National Computational Infrastruc-ture (NCI), Victorian Life Sciences Computation Initiative (VLSCI), MASSIVE, NCISpecialised Facility in Bioinformatics, and the University of Queensland. The systemslisted here were available through the National Computational Merit Allocation Scheme(NCMAS) for the 2015 or 2016 calls.

7.1 Pawsey Supercomputing Centre

The Pawsey Supercomputing Centre [128] is encouraging and energising research usingsupercomputing, large scale data storage and visualisation in Western Australia. Itprovides facilities and expertise to the research, education and industrial communities.Application areas include nanotechnology, radio astronomy, high energy physics, medicalresearch, mining and petroleum, architecture and construction, multimedia, and urbanplanning, amongst others.

The Pawsey Supercomputing Centre is an unincorporated joint venture betweenCSIRO, Curtin University, Edith Cowan University, Murdoch University and the Uni-versity of Western Australia and is supported by the Western Australian and FederalGovernments. In 2015, it offered compute time via NCMAS on two systems, Magnus,its primary system, and Fornax, a smaller GPU-enabled system that reached its end ofservice midway through 2015.

Magnus

System Cray XC40 [129, 130]Nodes 1,488Processor 2x 12-core Intel Xeon (Haswell) per nodeMemory 64 GB per nodeAccelerator NoneInterconnect Cray Aries (Dragonfly Topology, 72 Gbps)Global Storage 3.5 PB LustrePerformance #41 November 2014 TOP500, 1,097.6 LINPACK TFLOP/s [131]Commissioned 2013

Fornax

System SGI Linux Cluster [132, 133]Nodes 96Processor 2x 6-core Intel Xeon (Westmere) per nodeMemory 72 GB per nodeAccelerator 1x NVIDIA Tesla C2075 per nodeInterconnect Dual Infiniband QDR 4x (Fat Tree Topology, 64 Gbps)Local Storage 7 TB per nodeGlobal Storage 500 TB LustrePerformance Theoretical Peak GPU 49.5 DP TFLOP/sCommissioned 2012

24

7.2 National Computational Infrastructure

The National Computational Infrastructure (NCI) [134] is one of Australia’s nationalresearch computing facility, providing world-class services to Australian researchers, in-dustry and government. NCI is supported by the Australian Government’s National Col-laborative Research Infrastructure Strategy, with operational funding provided througha formal Collaboration incorporating the Bureau of Meteorology, CSIRO, ANU andGeoscience Australia. In 2015, it offered compute time via NCMAS on its primarysystem, Raijin.

Raijin

System Fujitsu PRIMERGY CX250 S1 [135, 136]Nodes 3,592Memory 32/64/128 GB per nodeProcessor 2x 8-core Intel Xeon (Sandy Bridge) per nodeAccelerator NoneInterconnect Infiniband FDR 4x (Fat Tree Topology, 56 Gbps)Global Storage 10 PB LustrePerformance #24 November 2012 TOP500, 978.6 LINPACK TFLOP/s [137]Commissioned 2012

7.3 Victorian Life Sciences Computation Initiative

The Victorian Life Sciences Computation Initiative (VLSCI) [138] is an initiative of theVictorian Government in partnership with the University of Melbourne and the IBMLife Sciences Research Collaboratory, Melbourne. Other major stake holders includekey Victorian health and medical research institutions, major Universities and publicresearch organisations. In 2015, it offered compute time via NCMAS on its primarysystem, Avoca.

Avoca

System IBM Blue Gene Q [139, 140]Nodes 4,096Processor 1x 16-core IBM PowerPC A2 per nodeMemory 16 GB per nodeAccelerator NoneInterconnect Blue Gene/Q interconnection network (5D Torus Topology, 40 Gbps)Global Storage 700 TB GPFSPerformance #31 June 2012 TOP500, 690.2 LINPACK TFLOP/s [141]Commissioned 2012

7.4 MASSIVE

The Multi-modal Australian ScienceS Imaging and Visualisation Environment (MAS-SIVE) [142], established in 2010 with the sponsorship of the Australian Synchrotron,Monash University, CSIRO, NCI, VPAC and the State Government of Victoria. Theformation of this Victorian-based facility follows the identification of specialised re-sources to assist in the analysis of characterisation imagery (both in batch and real-time

25

modes), principally, but not exclusively associated with the Australian Synchrotron, andthe identification of related image processing and analysis needs nationally, during inthe expressions of interest that were sought for the NCI Specialised Facilities program.In 2015, it offered compute time via NCMAS on the the MASSIVE systems.

MASSIVE1

System IBM iDataplex [143]Nodes 42Processor 2x 6-Core Intel Xeon per nodeMemory 48 GB per nodeAccelerator 2x NVIDIA Tesla M2070 GPUs per nodeInterconnect Infiniband QDR 4x (Fat Tree Topology, 32 Gbps)Global Storage 152 TB GPFSPerformance Theoretical Peak GPU 43.3 DP TFLOP/sCommissioned 2011

MASSIVE2

System IBM iDataplex [143]Nodes 118Processor 2x 6/8-Core Intel Xeon per nodeMemory 48/64/128/192 GB per nodeAccelerator 244x K20/M2070/M2070Q NVIDIA Tesla; 20x Intel Xeon Phi KNCInterconnect Infiniband QDR 4x (Fat Tree Topology, 32 Gbps)Global Storage 345 TB GPFSPerformance Theoretical Peak GPU 175.5 DP TFLOP/sCommissioned 2013

7.5 NCI Specialised Facility in Bioinformatics

The NCI Specialised Facility in Bioinformatics (NCI-SFB) [144] is a research and supportorganisation dedicated to the advancement of the life sciences through computationalmethods. Its mission is to facilitate the adoption of existing computational methodsand to empower the field’s development of new methods and techniques. This involvesa range of activities including training, advise, and support. In 2015, it offered computetime via NCMAS on its primary system, Barrine.

Barrine

System SGI Linux Cluster [132, 145]Nodes 384Processor 2x 4-Core Intel Xeon (Nehalem) per nodeMemory 24/48/72 GB per nodeAccelerator NoneInterconnect Infiniband QDR 4x (Fat Tree Topology, 32 Gbps)Global Storage 50 TB PanasasPerformance Theoretical Peak CPU 27.9 DP TFLOP/sCommissioned 2010

26

7.6 The University of Queensland

The FlashLite system [146] is funded by the Australian Research Council, in conjunctionwith CSIRO, Griffith University, Monash University, Queensland Cyber InfrastructureFoundation, Queensland University of Technology, the University of Queensland, andthe University of Technology Sydney. It has been designed for data intensive operations,to maximise Input/Output Operations Per Second (IOPS). It was made available viathe NCMAS scheme for the first time in 2016.

FlashLite

System Xenon Systems [147]Nodes 68Processor 2x 12-Core Intel Xeon (Haswell) per nodeMemory 512 GB per nodeAccelerator NoneInterconnect 2x Infiniband QDR 4x (Fat Tree Topology, 64 Gbps)Local Storage 4.8 TB per nodePerformance Theoretical Peak CPU 32.6 DP TFLOP/sCommissioned 2015

27

8 Current Leadership HPC Systems

The top leadership systems, in terms of sustained floating point performance, provide auseful snapshot of the current state of high-end systems. Using the recently announcedTop500 June 2015 list [75], the architecture of top ten systems are summarised in thissection. It is worth noting there are also a number of other rankings in HPC, includingthe Green500 [148], Graph500 [149], and the HPCG Benchmark [150]. Many of thesystems below also appear in those rankings.

#1 Tianhe-2

System Inspur TH-IVB-FEP Cluster [151, 152, 153]Nodes 16,000Processor 2x 12-core Intel Xeon E5-2692 (Ivy Bridge) per nodeMemory 64 GB per nodeAccelerator 3x Intel Xeon Phi 31S1P per nodeInterconnect TH Express-2 (Fat Tree Topology, 96 Gbps)Storage 12.4 PB global shared parallel storage systemPerformance 33.9 LINPACK PFLOP/sOrganisation NUDT, ChinaCommissioned 2013

#2 Titan

System Cray XK7 [154, 155, 156]Nodes 18,688Processor 1x 16-core AMD Opteron 6274 (Interlagos) per nodeMemory 32 GB per nodeAccelerator 1x NVIDIA K20x per nodeInterconnect Cray Gemini (3D Torus Topology, 80 Gbps)Global Storage 40 PB, 1.4 TB/s IO Lustre filesystemPerformance 17.6 LINPACK PFLOP/sOrganisation DOE/SC/ORNL, United StatesCommissioned 2012

#3 Sequoia

System IBM BlueGene/Q [157, 158]Nodes 98,304Processor 1x 16-core Power BQC per nodeMemory 16 GB per nodeAccelerator NoneInterconnect IBM Blue Gene/Q (5D Torus Topology, 40 Gbps)Global Storage 55 PB Lustre filesystemPerformance 17.2 LINPACK PFLOP/sOrganisation DOE/NNSA/LLNL, United StatesCommissioned 2011

28

#4 K Computer

System Fujitsu [159, 160]Nodes 82,944Processor 1x 8-core SPARC64 VIIIfx per nodeMemory 16 GB per nodeAccelerator NoneInterconnect Tofu (6D Torus Topology, 100 Gbps)Local Storage 11 PB Fujitsu Exabyte File SystemGlobal Storage 30 PB Fujitsu Exabyte File SystemPerformance 10.5 LINPACK PFLOP/sOrganisation RIKEN AICS, JapanCommissioned 2011

#5 Mira

System IBM Blue Gene/Q [161, 162, 163]Nodes 49,152Processor 1x 16-core Power BQC per nodeMemory 16 GB per nodeAccelerator NoneInterconnect IBM Blue Gene/Q (5D Torus Topology, 40 Gbps)Global Storage 70 PBPerformance 8.59 LINPACK PFLOP/sOrganisation DOE/SC/ANL, United StatesCommissioned 2012

#6 Piz Daint

System Cray XC30 [164, 165, 166]Nodes 5,272Processor 1x 8-core Xeon E5-2670 (Sandy Bridge) per nodeMemory 32 GB per nodeAccelerator 1x NVIDIA K20X per nodeInterconnect Cray Aries (Dragonfly Topology, 72 Gbps)Global Storage 2.5 PB LustrePerformance 6.27 LINPACK PFLOP/sOrganisation CSCS, SwitzerlandCommissioned 2012

#7 Shaheen II

System Cray XC40 [167, 168]Nodes 6,144Processor 2x 16-core Xeon E5-2698v3 (Haswell) per nodeMemory 128 GB per nodeAccelerator NoneInterconnect Cray Aries (Dragonfly Topology, 72 Gbps)Global Storage 17.6 PB LustrePerformance 5.54 LINPACK PFLOP/s

29

Organisation King Abdullah University of Science and Technology, Saudi ArabiaCommissioned 2015

#8 Stampede

System Dell PowerEdge C8220 [169, 170]Nodes 6,400Processor 2x 8-core Xeon E5-2680 per nodeMemory 32 GB per nodeAccelerator 1x Intel Xeon Phi SE10P per nodeInterconnect Infiniband FDR 4x (Fat Tree Topology, 56 Gbps)Local Storage 1.6 PBGlobal Storage 14 PB LustrePerformance 5.17 LINPACK PFLOP/sOrganisation TACC/University of Texas, United StatesCommissioned 2012

#9 JUQUEEN

System IBM BlueGene/Q [171, 172]Nodes 28,672Processor 1x 16-core Power BQC per nodeMemory 16 GB per nodeAccelerator NoneInterconnect IBM Blue Gene/Q (5D Torus Topology, 40 Gbps)Global Storage 6 PBPerformance 5.01 LINPACK PFLOP/sOrganisation FZJ, GermanyCommissioned 2012

#10 Vulcan

System IBM BlueGene/Q [173, 174]Nodes 24,576Processor 1x 16-core Power BQC per nodeMemory 16 GB per nodeAccelerator NoneInterconnect IBM Blue Gene/Q (5D Torus Topology, 40 Gbps)Global Storage 5 PBPerformance 4.29 LINPACK PFLOP/sOrganisation DOE/NNSA/LLNL, United StatesCommissioned 2012

30

9 Future Leadership HPC Systems

While the current leadership HPC systems of the previous section provide an insightinto the current state of the HPC field, a number of relevant international institutionshave announced their plans for the future. This includes both upgrade paths for existingsystems, as well as entirely new procurements. In some cases, information regarding theproposed architecture is available, which has been collated in this section.

Argonne National Laboratory (ANL)

Located in Illinois, United States, the Argonne National Laboratory (ANL) currentlyoperates Mira, which is #5 on the June 2015 Top500 List. The next large-scale ANLsystem will be called Aurora [175], and is scheduled to be commissioned in 2019. It willcontain over 50,000 compute nodes using the KNH Xeon Phi architecture, over 7 PBof traditional and non-volatile memory, and over 150 PB of Lustre global storage. Foran interconnect, Aurora will use Intel Omni-path. The peak power consumption of thesystem will be 13 MW. To prepare for Aurora, an early production system based onKNL Xeon Phi will be commissioned in 2016.

Edinburgh Parallel Computing Centre (EPCC)

Located in Scotland, United Kingdom, the Edinburgh Parallel Computing Centre (EPCC)currently operates Archer, which is #35 on the June 2015 Top500 List. There arepresently no announced plans for upgrades or a successor.

Forschungszentrum Julich (FZJ)

Located in Julich, Germany, the Forschungszentrum Julich (FZJ) currently operatesJUQUEEN, which is #9 on the June 2015 Top500 List. There are presently no an-nounced plans for upgrades or a successor to JUQUEEN. However, the centre is in theprocess of commissioning a new, smaller system called JURECA [176]. This heteroge-neous system utilises dual 12-core Intel Xeon (Haswell) processors with 128/256/512GB of memory, a Mellanox EDR Infiniband (100 Gbps) interconnect, and some nodeshave NVIDIA K40/K80 GPU accelerators.

King Abdullah University of Science and Technology (KAUST)

Located in Thuwal, Saudi Arabia, the King Abdullah University of Science and Tech-nology (KAUST) currently operates Shaheen II, which is #7 on the June 2015 Top500List. As Shaheen II is a relatively new system, there are presently no announced plansfor upgrades or a successor.

Lawrence Livermore National Laboratory (LLNL)

Located in California, United States, the Lawrence Livermore National Laboratory(LLNL) currently operates Sequoia, which is #3 on the June 2015 Top500 List. Aspart of the Collaboration of ORNL, ANL, and LLNL (CORAL) [177] the LLNL has an-nounced a contract with IBM to deliver a next-generation supercomputer in 2017. Thesystem is called Sierra [178], and will be based on the OpenPower Architecture, with a

31

peak performance well in excess of 100 PFLOP/s. The IBM press release also mentionsthe NVIDIA NVLink interconnect technology, NVIDIA Volta GPU architecture, andcollaboration with Mellanox [179].

Los Alamos National Laboratory (LANL)

Located in New Mexico, United States, the Los Alamos National Laboratory (LANL)currently operates Cielo, which is #57 on the June 2015 Top500 List. The next systemfor LANL is a Cray XC40, called Trinity, and will be commissioned in the latter half of2015 [180]. With over 19,000 compute nodes, it will use Intel Xeon (Haswell) processorswith Intel Xeon Phi (KNC) accelerators, have over 2 PB of memory and 80 PB of globalstorage. Trinity is expected to have a peak performance over 40 PFLOP/s, and use lessthan 10 MW of power.

National Energy Research Scientific Computing Center (NERSC)

Located in California, United States, the National Energy Research Scientific ComputingCenter (NERSC) currently operates Edison, which is #34 on the June 2015 Top500 List.NERSC has announced their next system, called Cori [181], will be undergo procurementin two phases.

The first phase, due in the third quarter of 2015, will be a Cray XC system based ondual 16-core Haswell multi-core processors. It will have approximately 1,400 computenodes with 128 GB of memory. It will also incorporate 750 TB of non-volatile memory,in addition to 28 PB of traditional global storage.

The second phase, due mid-2016, will have over 9,300 single socket nodes with self-hosted Intel Xeon Phi (KNL) processors. It will have 96 GB of memory per node,in addition to the 16 GB of high-bandwidth memory on the Xeon Phi itself. Therecommended programming model will be MPI and OpenMP.

National University of Defense Technology (NUDT)

Located in Changsha, China, the National University of Defense Technology (NUDT)currently operates Tianhe-2, which is #1 on the June 2015 Top500 List. Tianhe-2 willbe upgraded in 2016 and renamed Tianhe-2A via the addition of Matrix2000 GPDSP ac-celerators [182]. These custom co-processors are being developed by NUDT as a generalpurpose expansion of the capabilities of Digital Signal Processors (DSP), in a simi-lar manner to how modern GPU computing arose from earlier and less programmableGraphics Processing Units.

The Matrix2000 supports both 64 bit memory addressing and floating-point arith-metic, with a theoretical performance of 2.4 double precision TFLOP/s per card. Theaccelerator communicates with a node via standard PCIe3, and will be programmablevia the OpenMP 4.0 standard. The interconnect will also be upgraded to TH-Express2+, capable of adaptive routing.

Oak Ridge National Labs (ORNL)

Located in Tennessee, United States, the Oak Ridge National Labs (ORNL) currentlyoperates Titan, which is #2 on the June 2015 Top500 List. As part of the Collaboration

32

of ORNL, ANL, and LLNL (CORAL) [177] the LLNL have a contract with IBM todeliver a next-generation supercomputer in 2018. The system called Summit [183].According to the ORNL website [183], it will consist of 3,400 compute nodes withmultiple IBM Power 9 processors, multiple NVIDIA Volta GPUs, and over 512 GBof memory per node. The nodes will also have 800 GB of local storage in the formof non-volatile RAM. The CPU and GPUs will be connected via NVLink technology5-12x faster than existing PCIe3, and the system interconnect will be Dual Rail EDRInfiniband in a Fat Tree topology capable of 185 Gbps per node. For global storage, a120 PB GPFS filesystem will be used. The peak power consumption of the system willbe 10 MW.

RIKEN Advanced Institute for Computational Science (AICS)

Located in Kobe, Japan, the RIKEN Advanced Institute for Computational Science(AICS) currently operates the K Computer, which is #4 on the June 2015 Top500 List.While it has been announced that Fujitsu has been selected to co-design the the successorto the K Computer [184], there are little details available about the architecture, otherthan it is aiming for exascale in 2020.

Swiss National Supercomputing Centre (CSCS)

Located in Lugano, Switzerland, the Swiss National Supercomputing Centre (CSCS)currently operates Piz Daint, which is #6 on the June 2015 Top500 List. There arepresently no announced plans for upgrades or a successor.

Texas Advanced Computing Center (TACC)

Located in Texas, United States, the Texas Advanced Computing Center (TACC) cur-rently operates Stampede, which is #8 on the June 2015 Top500 List. TACC hasrecently announced its next system will be a Cray XC 40 named Lonestar 5 [185]. Itwill contain over 1,252 compute nodes with dual 12-core Intel Xeon (Haswell) processorswith an expected peak performance of 1.25 PFLOP/s, a Cray Aries interconnect, and1.2 PB of storage from DDN. It will be in operation in 2016.

33

10 Conclusion

Following from the survey of compute, interconnect, storage, and software technologiesof Sections 2, 4, 5, and 6 respectively; the current Australian and international leader-ship systems of Sections 7 and 8, as well as the announced future leadership systemsof Section 9; several potential hardware architectures emerge. These architectures pre-sented here are both relevant to the international leadership systems that the researchersat the Pawsey Supercomputing Centre may aspire to use, as well as likely to see broaderadoption across the HPC community.

The Xeon Phi and Omni-Path architecture is a new emerging architecture. Whileother centres have worked with previous accelerator versions of the Xeon Phi, the newmain socket architecture with an integrated Omni-Path interconnect controller has po-tential for a high performance processor tightly coupled to a new interconnect technology.It has the advantage of leveraging the existing MPI and OpenMP parallel approachesused by the majority of existing HPC software, and a strong ecosystem of developmentsoftware.

The Power, NVIDIA GPU, and Mellanox InfiniBand architecture takes theexisting NVIDIA-Mellanox combination that has been shown high performance for agrowing subset of HPC codes, and replaces the x86-64 host processor with a Power hostprocessor. In addition to leveraging existing CUDA, OpenACC and OpenCL codes,the new OpenMP 4.0 standard will also support offload directives and widen the easeof porting to GPU accelerated architectures. The Power host processor also providesgood performance for codes more suited to large core processing, and supports MPI andOpenMP approaches. There is also a strong ecosystem of development software for thisarchitecture.

The ARM architecture is a newcomer to the HPC community, and an interestingchoice given its traditional focus on power efficiency. While it will support MPI andOpenMP, the ecosystem of supporting development software and numerical libraries forHPC are likely to be significantly less mature than the previous two architectures. TheCavium ThunderX and Broadcom Vulcan are interesting potential processors for thisarchitecture.

For storage, NVRAM buffers look to be an emerging trend to be investigated.Based on non-volatile random access memory, these buffers can be utilised in a numberof ways, such as a shared filesystem cache or a very high performance scratch space.This could improve data transfer performance, which can be a limiting factor to scalingon larger systems.

In terms of software, MPI and OpenMP are the programming standards thatare portable across most of the emerging HPC architectures. This is largely due tothe addition of offload directives in OpenMP 4.0 for accelerator-based architectures.The directive-based approach of OpenMP, as well as OpenACC on GPU architectures,obtains reasonable performance outcomes. However, optimal performance is achievedusing processor specific SIMD intrinsics for CPU architectures and CUDA and OpenCLimplementations for GPU architectures, at the cost of portability.

34

References

[1] ARM, 2015. URL http://www.arm.com/.

[2] Cray. Cray to Explore Alternative Processor Technologies for Supercom-puting, 2014. URL http://investors.cray.com/phoenix.zhtml?c=98390&p=

irol-newsArticle&ID=1990117.

[3] Barcelona Supercomputing Center. BSC building first supercomputer tocombine ARM CPUs, GPU accelerators and InfiniBand, 2013. URLhttps://www.isc-events.com/ir/files/pressbox/2013/10/61/Press%

20Release%20on%20PRACEpedraforca_v6.pdf.

[4] Vishal Mehta. Exploiting CUDA Dynamic Parallelism for Low-PowerARM-Based Prototypes. In NVIDIA GPU Technology Conference 2015,2015. URL http://on-demand.gputechconf.com/gtc/2015/presentation/

S5384-Vishal-Mehta.pdf.

[5] Pathscale. EKOPath ARMv8 Compiler Suite, 2015. URL http://www.

pathscale.com/ARMv8.

[6] OpenACC. OpenACC Directives for Accelerators, 2015. URL http://www.

openacc-standard.org/.

[7] OpenMP. The OpenMP API specification for parallel programming, 2015. URLhttp://openmp.org/.

[8] NAG. NAG Fortran Compiler, 2015. URL http://www.nag.co.uk/nagware/np.

asp.

[9] NAG. NAG to broaden 64-bit ARMv8-A ecosystem, 2013. URL http://www.

nag.com/market/articles/nag-to-broaden-64-bit-armv8-a-ecosystem.

[10] GNU. GCC, the GNU Compiler Collection, 2015. URL https://gcc.gnu.org/.

[11] LLVM. The LLVM Compiler Infrastructure, 2015. URL http://llvm.org/.

[12] ARM. CoreSight Debug and Trace, 2015. URL http://www.arm.com/products/

system-ip/debug-trace/.

[13] ARM. Cortex-A57 Processor, 2015. URL http://www.arm.com/products/

processors/cortex-a/cortex-a57-processor.php.

[14] ARM. Cortex-A72 Processor, 2015. URL http://www.arm.com/products/

processors/cortex-a/cortex-a72-processor.php.

[15] Cavium, 2015. URL http://www.cavium.com/.

[16] Cavium. ThunderX ARM Processors, 2015. URL http://www.cavium.com/

ThunderX_ARM_Processors.html.

[17] Broadcom, 2015. URL http://www.broadcom.com/.

[18] John De Glas. ARM Challenging Intel in the Server Market:An Overview, 2014. URL http://www.anandtech.com/show/8776/

arm-challinging-intel-in-the-server-market-an-overview/6.

35

[19] Altera, 2015. URL https://www.altera.com/.

[20] Xilinx, 2015. URL http://www.xilinx.com/.

[21] ExtremeTech. Intel unveils new Xeon chip with integrated FPGA, touts20x performance boost, 2014. URL http://www.extremetech.com/extreme/

184828-intel-unveils-new-xeon-chip-with-integrated-fpga-touts-20x-performance-boost.

[22] IEEE 1076. VHDL Analysis and Standardization Group (VASG), 2015. URLhttp://www.eda.org/twiki/bin/view.cgi/P1076/WebHome.

[23] Altera. Altera SDK for OpenCL, 2015. URL https://www.altera.com/

products/design-software/embedded-software-developers/opencl/

overview.html.

[24] AMD, 2015. URL http://www.amd.com/.

[25] Pathscale. ENZO Compiler Suite, 2015. URL http://www.pathscale.com/enzo.

[26] PRWire. Amd unleashes worlds most powerful server gpufor hpc, 2014. URL http://prwire.com.au/pr/45810/

amd-unleashes-world-s-most-powerful-server-gpu-for-hpc.

[27] AMD. FirePro S9170 Server GPU, 2015. URL http://www.amd.com/en-us/

products/graphics/server/s9170.

[28] NVIDIA, 2015. URL http://www.nvidia.com/.

[29] NVIDIA. Tesla GPU Accelerators for Servers, 2015. URL http://www.nvidia.

com/object/tesla-servers.html.

[30] Wikipedia. NVIDIA Tesla, 2015. URL https://en.wikipedia.org/wiki/

Nvidia_Tesla.

[31] NVIDIA. CUDA Parallel Computing Platform, 2015. URL http://www.nvidia.

com/object/cuda_home_new.html.

[32] Khronos Group. OpenCL: The open standard for parallel programming of hetero-geneous systems, 2015. URL https://www.khronos.org/opencl/.

[33] NVIDIA. CUDA Compiler Driver NVCC, 2015. URL http://docs.nvidia.com/

cuda/cuda-compiler-driver-nvcc/.

[34] PGI. PGI Compilers and Tools, 2015. URL http://www.pgroup.com/.

[35] NVIDIA. GPU-Accelerated Libraries, 2015. URL https://developer.nvidia.

com/gpu-accelerated-libraries.

[36] NVIDIA. Debugging Solutions, 2015. URL https://developer.nvidia.com/

debugging-solutions.

[37] NVIDIA. Performance Analysis Tools, 2015. URL https://developer.nvidia.

com/performance-analysis-tools.

[38] NVIDIA. Nsight, 2015. URL http://www.nvidia.com/object/nsight.html.

36

[39] NVIDIA. Nvidia visual profiler, 2015. URL https://developer.nvidia.com/

nvidia-visual-profiler.

[40] University of Oregon. TAU Performance System, 2015. URL https://www.cs.

uoregon.edu/research/tau/home.php.

[41] Vampir. Vampir - Performance Optimization, 2015. URL https://www.vampir.

eu/.

[42] Allinea. Allinea DDT: The standard for debugging complex code from worksta-tions to supercomputers, 2015. URL http://www.allinea.com/products/ddt.

[43] Rogue Wave Software. TotalView Debugger, 2015. URL http://www.roguewave.

com/products-services/totalview.

[44] NVIDIA. CUDA-GDB, 2015. URL https://developer.nvidia.com/cuda-gdb.

[45] NVIDIA. CUDA-MEMCHECK, 2015. URL https://developer.nvidia.com/

CUDA-MEMCHECK.

[46] NVIDIA. Tegra Processors, 2015. URL http://www.nvidia.com/object/

tegra-x1-processor.html.

[47] Kalray, 2015. URL http://www.kalrayinc.com/.

[48] Kalray. MPPA: The Supercomputing on a chip solution, 2015. URL http://www.

kalrayinc.com/kalray/products/#processors.

[49] PEZY, 2015. URL http://pezy.co.jp/en/index.html.

[50] PEZY. PEZY-SC Many Core Processor, 2015. URL http://pezy.co.jp/en/

products/pezy-sc.html.

[51] Tilera, 2015. URL http://www.tilera.com/.

[52] Tilera. EZchip Multicore Processors, 2015. URL http://www.tilera.com/

products/?ezchip=585.

[53] Intel, 2015. URL http://www.intel.com/.

[54] Intel. Intel Xeon Phi Coprocessor 31S1P, 2015. URL http://ark.

intel.com/products/79539/Intel-Xeon-Phi-Coprocessor-31S1P-8GB-1_

100-GHz-57-core.

[55] Intel. Intel Xeon Phi Core Micro-architecture, 2015. URLhttps://software.intel.com/sites/default/files/article/393195/

intel-xeon-phi-core-micro-architecture.pdf.

[56] Intel. Next-Generation Intel Xeon Phi Processor with IntegratedIntel Omni Scale Fabric to Deliver Up to 3 Times the Perfor-mance of Previous Generation at Lower Power, 2014. URL http:

//newsroom.intel.com/community/intel_newsroom/blog/2014/06/23/

intel-re-architects-the-fundamental-building-block-for-high-performance-computing.

[57] Wikipedia. Xeon Phi, 2015. URL https://en.wikipedia.org/wiki/Xeon_Phi.

37

[58] MPI Forum. MPI Documents, 2015. URL http://www.mpi-forum.org/docs/

docs.html.

[59] Cray. Cray C/C++ Reference Manual, 2015. URL http://docs.cray.com/

books/004-2179-001/html-004-2179-001/lymwlrwh.html.

[60] Cray. Cray Fortran Compiler Commands and Directives Reference Manual, 2015.URL http://docs.cray.com/books/S-3901-55/.

[61] Intel. Intel C++ and Fortran Compilers, 2015. URL https://software.intel.

com/en-us/intel-compilers.

[62] Intel. Intel Parallel Studio XE 2015, 2015. URL https://software.intel.com/

en-us/intel-parallel-studio-xe.

[63] The University of Tennessee Knoxville. Matrix Algebra on GPU and MulticoreArchitectures, 2015.

[64] Boost. Boost C++ Libraries, 2015. URL http://www.boost.org/.

[65] ViennaCL. ViennaCL, 2015. URL http://viennacl.sourceforge.net/.

[66] PARALUTION Labs. PARALUTION Library, 2015. URL https://www.

paralution.com/.

[67] GNU. GDB: The GNU Project Debugger, 2015. URL http://www.gnu.org/

software/gdb/.

[68] Allinea. Allinea MAP - the C, C++ and F90 profiler for high performance and mul-tithreaded Linux applications, 2015. URL http://www.allinea.com/products/

map.

[69] The University of Tennessee Knoxville. Performance Application ProgrammingInterface, 2015. URL http://icl.cs.utk.edu/papi/.

[70] Usman Pirzada. Skylake Purley to be the Biggest Advance-ment Since Nehalem, 2015. URL http://wccftech.com/

massive-intel-xeon-e5-xeon-e7-skylake-purley-biggest-advancement-nehalem/.

[71] Ryan Smith. AMDs 2016-2017 x86 Roadmap: Zen Is In, Sky-bridge Is Out, 2015. URL http://www.anandtech.com/show/9231/

amds-20162017-x86-roadmap-zen-is-in.

[72] SPARC International, 2015. URL http://sparc.org/.

[73] SPARC International. The SPARC Architecture Manual, Version 9, 2015. URLhttp://sparc.org/technical-documents/#V9.

[74] Fujitsu. SPARC64 VIIIfx and IXfx Specifications, 2015. URL http://www.

fujitsu.com/downloads/TC/sc11/sparc64-ixfx-sc11.pdf.

[75] TOP500 List, 2015. URL http://www.top500.org/list/2015/06/.

[76] OpenPower, 2015. URL http://openpowerfoundation.org/.

[77] Google, 2015. URL https://www.google.com/intl/en/about/.

38

[78] Mellanox, 2015. URL http://www.mellanox.com/.

[79] Jeff Stuecheli. POWER8, 2013. URL http://www.hotchips.org/

wp-content/uploads/hc_archives/hc25/HC25.20-Processors1-epub/HC25.

26.210-POWER-Studecheli-IBM.pdf.

[80] Tiffany Trader. IBMs First OpenPOWER Server Targets HPCWorkloads, 2015. URL http://www.hpcwire.com/2015/03/19/

ibms-first-openpower-server-targets-hpc-workloads/.

[81] SGI. SGI UV: The World’s Most Powerful In-Memory Supercomputers, 2015.URL https://www.sgi.com/products/servers/uv/.

[82] PCWorld. Intel’s crazy-fast 3D XPoint Optane memory heads for DDRslots, 2015. URL http://www.pcworld.com/article/2973549/storage/

intels-crazy-fast-3d-xpoint-optane-memory-heads-for-ddr-slots-but-with-a-catch.

html.

[83] TechEYE. Viking Technology and Sony in ReRAM mem-ory mashup, 2013. URL http://www.techeye.net/business/

viking-technology-and-sony-in-reram-memory-mashup.

[84] Yuuichirou Ajima, Tomohiro Inoue, Shinya Hiramoto, and Toshiyuki Shimizu.Tofu: Interconnect for the K computer, 2015. URL http://www.fujitsu.com/

downloads/MAG/vol48-3/paper05.pdf.

[85] Dong Chen, Noel A. Eisley, Philip Heidelberger, Robert M. Senger, YutakaSugawara, Sameer Kumar, Valentina Salapura, David L. Satterfield, BurkhardSteinmacher-Burow, and Jeffrey J. Parker. The IBM Blue Gene/Q Interconnec-tion Network and Message Unit, 2015. URL http://mmc.geofisica.unam.mx/

edp/SC11/src/pdf/papers/tp19.pdf.

[86] IEEE 802.3. Ethernet Working Group, 2015. URL http://www.ieee802.org/3/.

[87] Wikipedia. Ethernet, 2015. URL https://en.wikipedia.org/wiki/Ethernet.

[88] IEEE 802.3. 25 Gb/s Ethernet Study Group, 2015. URL http://www.ieee802.

org/3/25GSG/index.html.

[89] InfiniBand Trade Organisation. About InfiniBand, 2015. URL http://www.

infinibandta.org/content/pages.php?pg=about_us_infiniband.

[90] Wikipedia. InfiniBand, 2015. URL https://en.wikipedia.org/wiki/

InfiniBand.

[91] InfiniBand Trade Organisation. InfiniBand Roadmap, 2015. URL http://www.

infinibandta.org/content/pages.php?pg=technology_overview.

[92] Intel. Intel Omni-Path Architecture, 2014. URL http://www.

intel.com/content/www/us/en/high-performance-computing-fabrics/

omni-path-architecture-fabric-overview.html.

[93] IBM. General Parallel File System, 2015. URL http://www-01.ibm.com/

support/knowledgecenter/SSFKCN/gpfs_welcome.html?lang=en.

39

[94] Wikipedia. IBM General Parallel File System, 2015. URL https://en.

wikipedia.org/wiki/IBM_General_Parallel_File_System.

[95] IBM, 2015. URL http://www.ibm.com/.

[96] IBM. IBM Spectrum Scale, 2015. URL http://www-03.ibm.com/systems/

storage/spectrum/scale/.

[97] Lustre. Lustre, 2015. URL http://lustre.org/.

[98] NERSC. Trinity / NERSC-8 Use Case Scenarios, 2013. URLhttp://www.nersc.gov/assets/Trinity--NERSC-8-RFP/Documents/

trinity-NERSC8-use-case-v1.2a.pdf.

[99] Cray. DataWarp Applications I/O Accelerator, 2015. URL http://www.cray.

com/sites/default/files/resources/CrayXC40-DataWarp.pdf.

[100] Oracle. Oracle Data Sheet: Oracle Solaris Studio 12.4, 2015. URL http://www.

oracle.com/us/products/tools/050864.pdf.

[101] Oracle, 2015. URL http://www.oracle.com/us/products/tools/050864.pdf.

[102] IBM. XL C/C++ for Linux, 2015. URL http://www.ibm.com/software/

products/en/xlcpp-linux.

[103] IBM. XL Fortran for Linux, 2015. URL http://www.ibm.com/software/

products/en/xlfortran-linux.

[104] Allinea. Develop with Allinea Forge, 2015. URL http://www.allinea.com/

products/develop-allinea-forge.

[105] Valgrind. Valgrind, 2015. URL http://valgrind.org/.

[106] Paradyn. Stack Trace Analysis Tool, 2015. URL http://www.paradyn.org/STAT/

STAT.html.

[107] Python Software Engineering. pdb - The Python Debugger, 2015. URL https:

//docs.python.org/3/library/pdb.html.

[108] The Krell Institute. Open—SpeedShop, 2015. URL https://openspeedshop.

org/.

[109] HPCToolkit. HPCToolkit, 2015. URL http://hpctoolkit.org/.

[110] Cray. Using Cray Performance Measurement and Analysis Tools, 2015. URLhttp://docs.cray.com/books/S-2376-60/.

[111] IBM. IBM High Performance Computing Toolkit, 2015. URL http://

researcher.watson.ibm.com/researcher/view_group.php?id=2754.

[112] Intel. Intel Trace Analyzer and Collector, 2015. URL https://software.intel.

com/en-us/intel-trace-analyzer.

[113] NVIDIA. Profiler User’s Guide, 2015. URL http://docs.nvidia.com/cuda/

profiler-users-guide/.

40

[114] Jeffrey Vetter and Chris Chambreau. mpiP: Lightweight, Scalable MPI Profiling,2015. URL http://mpip.sourceforge.net/.

[115] IPM-HPC. Integrated Performance Monitoring, 2015. URL http://ipm-hpc.

sourceforge.net/.

[116] Scalasca. Scalasca, 2015. URL http://www.scalasca.org/.

[117] Barcelona Supercomputing Center. Paraver: a flexible performance analysis tool,2015. URL http://www.bsc.es/computer-sciences/performance-tools/

paraver/general-overview.

[118] GNU. GNU gprof, 2015. URL https://sourceware.org/binutils/docs/

gprof/.

[119] IBM. IBM Platform LSF, 2015. URL http://www.ibm.com/systems/au/

platformcomputing/products/lsf/.

[120] Adaptive Computing. Moab HPC Suite Basic Edition, 2015.URL http://www.adaptivecomputing.com/products/hpc-products/

moab-hpc-basic-edition/.

[121] Adaptive Computing, 2015. URL http://www.adaptivecomputing.com/.

[122] Adaptive Computing. Maui Cluster Scheduler, 2015. URL http://www.

adaptivecomputing.com/products/open-source/maui/.

[123] Adaptive Computing. TORQUE Resource Manager, 2015. URL http://www.

adaptivecomputing.com/products/open-source/torque/.

[124] Altair. PBS Professional, 2015. URL http://www.pbsworks.com/Product.aspx?

id=1.

[125] Altair, 2015. URL http://www.altair.com/.

[126] SchedMD. Slurm Workload Manager, 2015. URL http://slurm.schedmd.com/.

[127] SchedMD, 2015. URL http://www.schedmd.com/.

[128] Pawsey Supercomputing Centre, 2015. URL http://www.pawsey.org.au/.

[129] Cray. Cray XC40 Series Specifications, 2015. URL http://www.cray.com/sites/

default/files/resources/cray_xc40_specifications.pdf.

[130] Pawsey Supercomputing Centre. User Portal - System Descriptions - Magnus,2015. URL https://portal.pawsey.org.au/docs/Supercomputers/System_

Descriptions#Magnus.

[131] TOP500 List, 2014. URL http://www.top500.org/list/2014/11/.

[132] SGI, 2015. URL http://www.sgi.com/.

[133] Pawsey Supercomputing Centre. User Portal - System Descriptions - Fornax,2015. URL https://portal.pawsey.org.au/docs/Supercomputers/System_

Descriptions#Fornax.

41

[134] National Computational Infrastructure, 2015. URL http://nci.org.au/.

[135] Fujitsu. Fujitsu Server PRIMERGY, 2015. URL http://www.fujitsu.com/

global/products/computing/servers/primergy/.

[136] NCI. HPC Systems - Raijin, 2015. URL http://nci.org.au/

systems-services/national-facility/peak-system/raijin/.

[137] TOP500 List, 2012. URL http://www.top500.org/list/2012/11/.

[138] Victorian Life Sciences Computation Initiative, 2015. URL https://www.vlsci.

org.au/.

[139] IBM. Blue Gene/Q, 2015. URL http://www.ibm.com/systems/au/

technicalcomputing/solutions/bluegene/.

[140] Victorian Life Sciences Computation Initiative. Computer & SoftwareConfiguration - Avoca, 2015. URL http://www.vlsci.org.au/page/

computer-software-configuration.

[141] TOP500 List, 2012. URL http://www.top500.org/list/2012/06/.

[142] MASSIVE, 2015. URL https://www.massive.org.au/.

[143] MASSIVE. High Performance Computing - Resources, 2015. URL https://www.

massive.org.au/high-performance-computing/resources.

[144] NCI Specialised Facility in Bioinformatics, 2015. URL https://ncisf.org/.

[145] NCI Specialised Facility in Bioinformatics. Barrine HPC Hardware, 2015. URLhttps://ncisf.org/barrinehpc/hardware.

[146] The University of Queensland. Research Computing Centre: Infrastructure, 2015.URL http://rcc.uq.edu.au/infrastructure#flashlite.

[147] Xenon Systems, 2015. URL http://www.xenon.com.au/.

[148] Green500, 2015. URL http://www.green500.org/.

[149] Graph500, 2015. URL http://www.graph500.org/.

[150] HPCG Benchmark, 2015. URL http://www.hpcg-benchmark.org.

[151] Wikipedia. Tianhe-2, 2015. URL https://en.wikipedia.org/wiki/Tianhe-2.

[152] TOP500 List. Tianhe-2, 2015. URL http://www.top500.org/system/177999.

[153] Zhengbin Pang, Min Xie, Jun Zhang, Yi Zheng, Guibin Wang, Dezun Dong, andGuang Suo. The TH Express high performance interconnect networks. Frontiersof Computer Science, 8(3):357–366, 2014.

[154] TOP500 List. Titan, 2015. URL http://www.top500.org/system/177975.

[155] Wikipedia. Titan (supercomputer), 2015. URL https://en.wikipedia.org/

wiki/Titan_(supercomputer).

[156] Wikipedia. Cray XK7, 2015. URL https://en.wikipedia.org/wiki/Cray_XK7.

42

[157] TOP500 List. Sequoia, 2015. URL http://www.top500.org/system/177556.

[158] Wikipedia. IBM Sequoia, 2015. URL https://en.wikipedia.org/wiki/IBM_

Sequoia.

[159] TOP500 List. K Computer, 2015. URL http://www.top500.org/system/

177232.

[160] Wikipedia. K Computer, 2015. URL https://en.wikipedia.org/wiki/K_

computer.

[161] TOP500 List. Mira, 2015. URL http://www.top500.org/system/177718.

[162] Wikipedia. IBM Mira, 2015. URL https://en.wikipedia.org/wiki/IBM_Mira.

[163] Wikipedia. Blue Gene, 2015. URL https://en.wikipedia.org/wiki/Blue_

Gene.

[164] TOP500 List. Piz Daint, 2015. URL http://www.top500.org/system/177824.

[165] Wikipedia. Swiss National Supercomputing Centre, 2015. URL https://en.

wikipedia.org/wiki/Swiss_National_Supercomputing_Centre.

[166] Swiss National Supercomputing Centre, 2015. URL http://www.cscs.ch/.

[167] TOP500 List. Shaheen II, 2015. URL http://www.top500.org/system/178515.

[168] KAUST Supercomputing Laboratory. Facilities - Shaheen II, 2015. URL http:

//ksl.kaust.edu.sa/Pages/shaheen2.aspx.

[169] TOP500 List. Stampede, 2015. URL http://www.top500.org/system/177931.

[170] Wikipedia. Texas Advanced Computing Center, 2015. URL https://en.

wikipedia.org/wiki/Texas_Advanced_Computing_Center.

[171] TOP500 List. JUQUEEN, 2015. URL http://www.top500.org/system/177722.

[172] Wikipedia. Forschungszentrum Jlich, 2015. URL https://en.wikipedia.org/

wiki/Forschungszentrum_Jlich.

[173] TOP500 List. Vulcan, 2015. URL http://www.top500.org/system/177732.

[174] Blaise Barney. Using the Sequoia and Vulcan BG/Q Systems, 2015. URL https:

//computing.llnl.gov/tutorials/bgq/.

[175] Argonne National Laboratory. Aurora, 2015. URL http://aurora.alcf.anl.

gov/.

[176] Forschungszentrum Jlich. JURECA, 2015. URL http://www.fz-juelich.de/

ias/jsc/EN/Expertise/Supercomputers/JURECA/JURECA_node.html.

[177] US Department of Energy. Department of Energy Awards $425 Million in NextGeneration Supercomputing Technologies, 2014. URL http://science.energy.

gov/~/media/_/pdf/news/111414-CORAL.

[178] Don Johnston. Next-generation supercomputer coming to Lab, 2014. URL https:

//www.llnl.gov/news/next-generation-supercomputer-coming-lab.

43

[179] IBM. U.S. Department of Energy Selects IBM Data Centric Systems to AdvanceResearch and Tackle Big Data Challenges, 2014. URL http://www-03.ibm.com/

press/us/en/pressrelease/45387.wss.

[180] Los Alamos National Laboratory. Trinity - Technical Specifications, 2015. URLhttp://www.lanl.gov/projects/trinity/specifications.php.

[181] National Energy Research Scientific Computing Center. Cori, 2015. URL https:

//www.nersc.gov/users/computational-systems/cori/.

[182] Nicole Hemsoth. Inside Chinas Next Generation DSP SupercomputerAccelerator, 2015. URL http://www.theplatform.net/2015/07/15/

inside-chinas-next-generation-dsp-supercomputer-accelerator/.

[183] Oak Ridge National Laboratory. Summit, 2015. URL https://www.olcf.ornl.

gov/summit/.

[184] RIKEN Advanced Institute for Computational Science. Exascale SupercomputerProject launched, 2014. URL http://www.aics.riken.jp/en/exe_project/

Exascale-Supercomputer-Project-launched.html.

[185] Faith Singer-Villalobos. TACC Continues Legacy of Lonestar Su-percomputers, 2015. URL https://www.tacc.utexas.edu/-/

tacc-continues-legacy-of-lonestar-supercomputers.

44