HPC @ NCI...nci.org.au nci.org.au @NCInews HPC @ NCI Deployment, Monitoring, Management, Uniformity...

nci.org.aunci.org.au

@NCInews

HPC @ NCIDeployment, Monitoring, Management, Uniformity

Dr Muhammad Atif

Manager HPC Systems and Cloud Services

nci.org.au

Agenda

• About NCI

• Infrastructure

• Cluster Deployment

• Management

• Monitoring

• Uniformity Tests & Benchmarks (Brief as we are in an open-tender)

nci.org.au

NCI: an overview

Mission: World-class, high-end computing services for Australian research and innovation

What is NCI:

• Australia’s most highly integrated e-infrastructure environment

• Petascale supercomputer + highest performance research cloud + highest performance storage in the southern hemisphere

• Comprehensive and integrated expert service

• National/internationally renowned support team

NCI is national and strategic:

• Driven by national research priorities and excellence

• Engaged with research institutions/collaborations and industry

• A capability beyond the capacity of any single institution

• Sustained by a collaboration of agencies/universities

NCI is important to Australia because it:

• Enables research that otherwise would be impossible

• Enables delivery of world-class science

• Enables interrogation of big data, otherwise impossible

• Enables high-impact research that matters; informs public policy

• Attracts and retains world-class researchers for Australia

• Catalyses development of young researchers’ skills

Research Outcomes

Communities and Institutions/

Access and Services

Expertise Support and

Development

HPC ServicesVirtual Laboratories/

Data-intensive Services

Integration

Compute (HPC/Cloud) Storage/Network

Infrastructure

Res

ea

rch

Ob

jec

tive

s


Our Partners

nci.org.au

Supports the full gamut of research

pure strategic applied industry

• Fundamental

sciences

• Mathematics,

physics,

chemistry,

astronomy,

• ARC Centres of

Excellence

(ARCCSS,

CAASTRO,

CUDOS)

• Research with an

intended strategic

outcome

• Environmental,

medical,

geoscientific

• e.g., energy

(UNSW), food

security (ANU),

geosciences

(Sydney)

• Supporting

industry and

innovation

• e.g., ANU/UNSW

startup, Lithicon,

sold for $76M to

US company FEI

in 2014;

multinational miner

• Informing public

policy; real

economic impact

• Climate variation,

next-gen weather

forecasting,

disaster

management

(CoE, BoM,

CSIRO, GA)

nci.org.au

Current Infrastructure: comprehensive, integrated and research-directed

Cloud• Dell: 3,200 Intel Xeon cores; 56G

Ethernet; 25 TB memory; 160 TB SSD, 1PB of Ceph

• 100G Ethernet being deployed.Storage • Global storage system IB <->

supercomputer and cloud• 20 PB raw disk; 22.6 PB

uncompressed tape (Dual Site Redundant)

Compute• Current Supercomputer, Raijin — Fujitsu Primergy

Cluster + Agility System • 83872 Cores, 4519 Compute nodes

• Initially 57,472 cores (Intel Xeon Sandy Bridge, 2.6 GHz)—3592 nodes

• Added GPU queue with a total of 720 cores and 120 K80s

• Added KNL queue with a total of 2048 cores

• Added 22792 Cores of Intel Xeon –Broadwell, 2.6GHz)

• ~330TBytes main memory• FDR Infiniband interconnect + EDR for Agility

System• ~10 PBytes dedicated scratch storage (150 GB/sec

bandwidth)• Three additional high performing file-systems for

persistent storage• gdata1a – 12PB @ 50GB/Sec• gdata1b – 12PB @ 70GB/Sec• gdata2 – 6.7PB @ 80GB/Sec• gdata3 – 7.7PB @120GB/Sec

nci.org.au

NCI: comprehensive and integrated, quality and innovation

• Services and Technologies (~30 staff) – Operations— robust/expert/secure (20 staff) – HPC

• Expert user support (9) • Largest research software library in Australia (300+ applications in all fields)

– Cloud • High-performance: VMs, Clusters • Secure, high-performance filesystem, integrated into NCI workflow environment

– Storage • Active (high-performance Lustre parallel) and archival (dual copy HSM tape); • Partner shares; Collections; Partner dedicated

– Staff Scientiests• Domain Experts

• Research Engagement and Innovation (~20 staff) – HPC and Data-Intensive Innovation

• Upscaling priority applications (e.g., Fujitsu-NCI collaboration on ACCESS), • Bioinformatics pipelines (APN, melanoma, human genome)

– Virtual Environments • Climate/Weather, All-sky Astrophysics, Geophysics, etc. (NeCTAR)

– Data Collections• Management, publication, citation— strong environmental focus + other

– Visualisation• Drishti, Voluminous, Interactive presentations

nci.org.au

10 GigE

/g/data 56Gb FDR IB Fabric

/g/data1(a

&b)12+12PB

/g/data2~6.5PB

/short7.6PB

/home, /system,

/images, /apps

Cache 1.0PB,

Tape 12.3PB

Massdata /g/data Raijin FS

VMwareOpenStack

Private

NCI data movers

To

Hu

xle

y D

C

Raijin 56Gb FDR IB Fabric

Internet

Raijin Compute

Raijin Login + Data movers

/g/data3~7.3PB

OpenStackNeCTAR

Ceph (3-way rep)

NeCTAR0,.5 PB

Tenjin0.5PB

NCI Systems Connectivity


Network

FFT Raijin (Fujitsu)[FDR FFT]

Raijin Director Switches (Fujitsu) Agility System

(Lenovo/Xenon)[2:1 BFT EDR]24 Compute

nodes/switch with 12 uplinks to core switches

gdata1a

gdata1b

gdata2

gdata3

/short /homsys /imapps

Lnest-ga1 Lnets-g3Lnets-g2Lnets-g1

FFT 36xE/FDR Edge Switches

Single OpenSM (with failovers)

Routing is based on host names

Raijin Management servers (Fujitsu)

Raijin Login/data-mover nodes (Fujitsu/Lenovo)

GPUs KNLs P8 GPUs 1TB Nodes

3TB FusionIO ARM


Heterogeneous Cluster!

Queue Vendor CPU SKU Clock (GHz) Nodes Node Countcopyq Fujitsu Xeon E5-2670 2.60 r-dm{1..6} 6

Test Cluster(32GB) Fujitsu/Lenovo Xeon E5-2670 2.60 r{1..36} 36

normal (32GB) r{37..2395} 2,359

normal (64GB) r{2396..3520} 1,125

normal (128GB) r{3521..3592} 72

Xenon P100 Xenon Xeon E5-2650 v4 2.20 r{3594..3595} 2

Gpu(haswell) 128/256GB Dell EMC Xeon E5-2670 v3 2.30 r{3596..3609} 14

knl (192GB) SGI/HPE Xeon Phi 7230 1.30 r{3610..3641} 32

gpu (broadwell) (128GB) Dell EMC Xeon E5-2690 v4 2.60 r{3642..3657} 16

normalsp (Fujitsu BW) Fujitsu Xeon E5-2697A v4 2.60 r{3682..3701} 20

fujitsu megamem (3TB) Fujitsu Xeon E7-4809 v4 2.10 r{3702..3705} 4

lenovo normalbw Lenovo Xeon E5-2690 v4 2.60 r{3706..4509} 804

lenovo hugemem Lenovo Xeon E5-2690 v4 2.60 r{4510..4519} 10

Power8 P100 IBM r{5000..5001} 2

Power8 IBM r{5002..5003} 2

ARM Cavium/Cray Thunder-X2 (pre-release) 2.60 r{5010..5011} 2


Timeline of Floating points

Pre-prod: May 6 ’13

GA: June 17’13

Fujitsu

2x GPU Nodes

Haswell+K80

13 Jan’16

Dell

12x GPU Nodes

Haswell+K80

4 May’16

Dell

16x GPU Nodes

Broadwell+K80

28 Oct’16

Dell

32 x KNL

21 Sep’16

SGI

Agility Cluster

Lenovo

Including

10 1TB nodes

11 Jan’17

IBM Power

+P100

–

Part of

test cluster

1200 TF +175TF +85TF +202TF +948TF

Broadwell

27 Mar’17

Fujitsu

+27TF

2x GPU Nodes

Broadwell+P100

3 Mar’17

Xenon

+44TF

Broadwell/Ivy

28 July’17

Fujitsu/Dell

3TB

+4TF


HPC Stats on a page!

Live Stats available at https://www.nci.org.au


Job Distribution

Live Stats available at https://www.nci.org.auWhat you cannot measure, you cannot manage

https://www.nci.org.au/


Heterogeneity and /apps

drwxr-xr-x 6 aab900 z00 4096 Jul 16 2013 4.6.2

drwxr-xr-x 6 aab900 z00 4096 Aug 7 2013 4.6.3


drwxr-xr-x 6 aab900 z00 4096 May 8 2014 4.6.5

drwxr-xr-x 7 aab900 z00 4096 Sep 11 2014 5.0.1

drwxr-xr-x 7 aab900 z00 4096 Jan 19 2015 5.0.4

drwxr-xr-x 6 aab900 z00 4096 Sep 7 2015 5.0.5

drwxr-xr-x 6 aab900 z00 4096 Oct 22 2015 5.1.0

drwxr-xr-x 6 aab900 z00 4096 Oct 22 2015 5.1.0-plumed

drwxr-xr-x 6 aab900 z00 4096 Oct 22 2015 5.1.0-gpu

drwxr-xr-x 6 aab900 z00 4096 Feb 11 2016 5.1.2-gpu

drwxr-xr-x 6 aab900 z00 4096 Feb 12 2016 5.1.2-plumed

drwxr-xr-x 6 aab900 z00 4096 Feb 12 2016 5.1.2


drwxr-xr-x 6 aab900 z00 4096 Aug 3 2016 5.1.3-test

drwxr-xr-x 6 aab900 z00 4096 Sep 22 2016 5.1.3-knl

drwxr-xr-x 6 aab900 z00 4096 Nov 3 2016 2016.1

drwxr-xr-x 6 aab900 z00 4096 Nov 7 2016 2016.1-gpu

drwxr-xr-x 6 aab900 z00 4096 Nov 8 2016 2016.1-knl

drwxr-xr-x 6 aab900 z00 4096 May 3 10:22 2016.3

drwxr-xr-x 6 aab900 z00 4096 May 3 16:16 2016.3-gpu

drwxr-xr-x 6 aab900 z00 4096 May 15 10:34 2016.3-gpupascal

drwxr-xr-x 25 aab900 z00 4096 May 24 12:56 .

drwxr-xr-x 6 aab900 z00 4096 May 24 12:56 2016.1-gpupascal

• Module environment needs to be

extended

• Fat Binaries on Linux anyone?

• Seamless switch based on the

current architecture (we have some

ideas)

• CI - qualifying applications is

becoming a challenge

• Optimized for SandyBridge

• Optimized for Broadwell

• Optimized for KNL

• Optimized for SkyLake

• Optimized for K80

• Optimized for P100

• Optimized for Volta

• Forgot, Power8 and maybe ARM, Power9,

And AMD

singularity!


Raijin – Extremely stable as we control the entire SW stack

Summary for Q1.2017

Period 01/01/2017 00:00:00 - 31/03/2017 23:59:59

Total Hours 2160

Downtime (unannounced): 0:00 (hh:mm)

Downtime (announced): 148:00 (hh:mm)

Uptime (excluding announced): 100.00%

Summary for Q2.2017

Period 01/04/2017 00:00:00 - 30/06/2017 23:59:59

Total Hours 2160




Summary for Q4.2016

Period 01/10/2016 00:00:00 - 31/12/2016 23:59:59

Total Hours 2208




Summary for 2016

Period 01/01/2016 00:00:00 - 31/12/2016 23:59:59

Total Hours 8784




Power Failure Cooling Failure


So, how to manage a system at this scale

• Custom deployment– Tested a number of vendor solutions, they are all quite ordinary

– Minimal downtime is the key!

• Custom monitoring– See above!

• Custom management– Automation

• Custom Scheduling– Altair kept a separate branch for NCI due to our strict requirements

– All of the requirements are gradually being made available to the rest of the world

• Extensive Benchmarking and Uniformity tests– Per minute, daily, weekly, on-demand, scheduled maintenance

– Any minor/major change to the cluster

Hire the best!


Cluster Deployment!

• Custom cluster deployment based on open source tools– oneSIS, conman, powerman etc.

• All hardware platforms use one system image and kernel.– Sandy Bridge, Ivy Bridge, Haswell, Broadwell, KNL, Nvidia GPUs

– Fujitsu, Lenovo, Dell

– Separate images for Power8 and ARM (non-x86)

• Rolling reboots – no cluster downtime for patching required.

• Fine grained control over hardware architecture

nci.org.au

Cluster Deployment [oneSIS]

• OneSIS: An open source software package for diskless cluster management (http://onesis.org/)

– A mesh of symlinks, difficult to setup initially and requires custom initramfs.

• All production compute nodes, login nodes and data-mover nodes see the same image from Lustre e.g. /imapps/Images/NCI/centos-6-X-N .

– Create classes e.g. compute.r, compute.r.testcluster, compute.r.login or compute.r.gpu, compute.knl etc

• Run certain services on login nodes but not on compute and data-movers

– Read-only image.

• Root-on-lustre and rsync root to ram (NCI specific modifications to oneSIS). Do not install OS on the node – create one image!

– IB and Lustre modules in initramfs. 64bit busy box. Custom distro patch.

– Full root, half root, rsync OS to ramfs.

– Nodes boot via tftp and switch to Lustre root.

• Any change is seen instantaneously by all the nodes.

– E.g. Yum updates or configuration file change.

Rsync R-man1 Lustre:/images/ce

ntos-6.X-N

Rsync to ram

OS is bare minimum.

Root-on-lustre

/etc/fstab

/ram/etc/fstab

/etc/fstab.compute.r

/etc/fstab.compute.r.login

/etc/fstab.compute.r

Hostname: r1000

Hostname: r3592

Hostname: raijin1

oneSIS symlink

oneSIS symlink

http://onesis.org/

nci.org.au

Cluster Updates

Man1

/var/lib/oneSIS/Centos-6.5-XX/var/lib/oneSIS/Centos-6.5-YY

DailyHSM

Backup ofEvery

oneSISimage

by date

Man2


Man7


Lustre:/imaps/images/NCI/centos-6.8-XXLustre:/imaps/images/NCI/centos-6.9-YY

• Rsync copy of new OS

• Boot test cluster with the image from previous step

• Chroot and do a yum update

• Rsync new image to Lustre:/imapps/images

• Perform uniformity tests and set of benchmarks we have identified.

• If all clear; offline all nodes (jobs continue to run)

• Daemon reboots the nodes in the new image when job finishes on the node(s)

• Kernel Updates: Yum update and then build initramfs. Same for MOFED and Lustre updates.

• Notifications for every new OS kernel released. If there is critical security update that affects NCI, it takes top priority.

• Rolling reboot without users noticing

TestProduction cluster

Root-on-lustree.g. running XX


Raijin Kernel [Custom based on CentOS 7.X]

• CentOS 6.X user-land with CentOS 7.X Kernel. – We normally do not upgrade the OS for the life of the system – which is supposed to be ~years.

– Not using CentOS 6 kernel due to missing features – compiling own from sources provided by CentOS7

• Control Group Fixes

– Low memory on one NUMA node, Kernel would allocate block of memory from the other numa node and that allocation was not under cgroup. That allocation then grows out of control and OOM. Still not fixed in CentOS 6.9 Kernel.

• Fully Tick-less Kernel [The first HPC facility to go production with this feature]

– CentOS 6:CONFIG_NO_HZ: timer tick disabled when cpu is idlenew in CentOS 7:CONFIG_NO_HZ_FULL: timer tick also disabled when single runnable process is on the cpu. ideal for HPC workloads.

– Improved HPL on a single node by 1%.

• MOFED Packages for Centos 6 Kernel were (getting a bit) buggy.

• Better support for KNL

– We only want to run one Kernel

– Previously, we had locally patched RHEL 7 kernel but with 500 series kernel RedHat has done the port

– Can skip a Kernel version if not affected by bugs/security.

– Seamless rolling reboot of the compute nodes after testing.


Raijin [Lustre]

• Lustre Updates– Lustre Server is 2.5++ on Raijin

– Lustre 2.5 IEEL 2 based with NCI applied list of patches and is extremely stable.

– Last 12 months chart is shown.

nci.org.au

Monitoring• Monitoring at scale

– More than 4500 machines (compute, dm, lustre servers, logins, management) – loosing count!

– 6 IB director switches, 208 IB leaf switches (Raijin) and 2 Super Spine + 33 leaf for Lenovo

– 1 Ethernet core and 106 leaf switches + 24 leaf for Lenovo system

– 5 DDN SFA12K (10 Controllers), 4200 Disks. (Raijin native – we are not talking gdata here)

• homesys and imapps 900GB SAS, /short: 3TB SATA (near line SAS)

– Not talking about our Cloud here – that has similar setup and actively being unified with HPC.

– Use of Hybrid ELK, Ganglia and OpenTSDB – will be moving to OpenTSDB + ELK for the next system

• Email, SMS and Slack integration.– No point in monitoring if you cannot automate.

– Human input is still required for new events; our monitoring is not sentient - yet.

nci.org.au

Monitoring

• Custom scripts and cron jobs to send emails, Slack Bot(s) and SMS alerts

– Monitoring scripts and daemons to check health of compute and Lustre services.

• Varied time – 1 minute, 15 minutes, 30 minutes, hour, 2 hours, daily and even at all times

– Before running any job, we run our check_node script.

• CPU, memory, filesystem, services and a lot more and runs in ~0.15 Seconds

– Also checks for Kernel, OneSIS image, FW versions when necessary.

• Adding random after-job full health check with uniformity tests

– After N job runs, offline the node and perform uniformity benchmarks and put it back to production

– Same script runs every 2 hours – does not cause noticeable jitter

– Check node script is growing with each new case we encounter issues that can cause job/node failure.

• Take Away: Automatically offline the nodes upon failure condition(s).

• Do not dispatch jobs to the nodes that are suspect, faulty or flagged for update.

nci.org.au

Monitoring• Temperature monitoring via BMC

– Continuous monitoring –temperature warnings, automated suspend and shutdown on the Cluster.

– HeatMap of the entire cluster – To find hotspots!

• Ganglia spoofing

– For CPU, Custom ganglia for Lustre, IB and GPU monitoring

– [Ganglia will be replaced with Open TSDB in the next system with custom daemon – PoC is ready]

• OpenSM monitoring and SMS triggers based on custom daemon

– Switches, HCA’s, Cables faults are picked up and flagged instantaneously.

• ELK Stack

– Entire cluster logs (compute nodes, lustre servers, cloud, management, PBS Server) are send to ELK server

– Custom notifications are generated (Email, SMS)

• E.g too many re-queue for a single job.

• LBUGS, IB errors

• Job Monitoring and dashboard

nci.org.au

Monitoring• Automated Fault Categorisation and Reporting

– We hate doing repeated work

– Based on years of experience, common hardware failures result in

• Automatic Node Offline

• Diagnostics pulled (upon meeting certain conditions) and error logged with Fujitsu

– Picked up by check_node script, the node is offlined, ticket created automatically.

» E.g Failed memory, HCA Throwing Errors,

• Working with Lenovo on similar method.

nci.org.au

Scheduling• Lots of custom hooks for NCI

– Extensive use of cgroups - even on GPU nodes.

– Allocation management

• We have hyperthreads on at all times – control them via Cgroups. User can request HT via –l hyperthread flag

• Fair-share

– Custom daemon that calculates priority of jobs

• Per project, per running cpus of a project, ageing

– Tightly coupled with alloc-csv{db}

• Custom flags

– File-system up and down

• /g/data3 -l other gdata3

nci.org.au

Monitoring File-systems

• User perspective – have been untaring linux for years

• Triggers based on slow untar speeds, hung untars (OST issues). Each file-system has its own threshold based on MDT performance

• Push notifications via email, slack-bot.

nci.org.au

Monitoring File-systems

nci.org.au

NCI Dashboard (Not open to users currently)

Images courtesy Dr. Lei Shang

nci.org.au

Job Trends

nci.org.au

User Statistics

*Permission was obtained from the user.

nci.org.au

Project Stats

nci.org.au

Inefficient Jobs

User is contacted

nci.org.au

Running job per node comparison.

nci.org.au

Benchmarks – system uniformity perspective

• Micro and application benchmarks

• Micro-Benchmarks

– Per node: Stream, HPL, MOST

– Set of nodes: OSU (bw, latency, all_to_all, barrier, reduce)

• Application Benchmarks

– A lot – keep on adding – each benchmark is thoroughly profiled using IPM, Allinea etc.

– E.g UM, NAMD (x86, GPU), Bowtie2, nvBowtie, QE, Qchem, Gaussian (not publishing results), OpenFOAM, CCAM, QCD, S3D, Gromacs etc

• Methodology

– Scripts to submit the job (or run independent of scheduler). Results recorded in a DB

– Run at-least 5 times, throw away min and max and take an average

– Stdev of 5% is allowed – moving to 3%.

– This can change based on what we are trying to achieve

nci.org.au

Benchmarks – system uniformity perspective--- streamBenchmark -----------------------------------

Statistics:

-----------

Agreed Avg: 59500.00 Test Avg: 60199.35 Median: 60347.77 StDev: 1595.85 Min: 42495.67 Max: 60498.17 Total: 991

Results are based on 59500.00

Passing list is between -2.5% to +2.5% (i.e. 5%)

Value < -100% [0.00] ( 0) []

Between -100.0% and -50.0% [0.00 and 29750.00] ( 0) []

Between -50.0% and -20.0% [29750.00 and 47600.00] ( 8) [1641, 1642, 1643, 1644, 1841, 1842, 1843, 1844]

Between -20.0% and -10.0% [47600.00 and 53550.00] ( 0) []

Between -10.0% and -5.0% [53550.00 and 56525.00] ( 0) []

Between -5.0% and -2.5% [56525.00 and 58012.50] ( 0) []

Between -2.5% and -1.0% [58012.50 and 58905.00] ( 0) []

Between -1.0% and 1.0% [58905.00 and 60095.00] ( 3) [1604, 1772, 2252]

Between 1.0% and 2.5% [60095.00 and 60987.50] ( 980) [1500, 1501, 1502, 1504, 1507, 1508, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538,

1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1 581, 1582, 1583, 1584, 1585, 1586, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599, 1600, 1601, 160 2, 1603, 1605, 1606, 1607, 1608, 1609, 1610, 1611, 1612, 1613, 1614, 1615, 1616, 1617, 1618, 1619, 1620, 1621, 1622, 1623, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1662, 1663, 1664, 1665, 1666, 1667, 1668, 1669, 1670, 1671, 1672, 1673, 1674, 1675, 1676, 1 677, 1678, 1679, 1680, 1681, 1682, 1683, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 169 8, 1699, 1700, 1701, 1702, 1703, 1704, 1705, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1 768, 1769, 1770, 1771, 1773, 1774, 1775, 1776, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 1786, 1787, 1788, 1789, 1790, 179 1, 1792, 1793, 1794, 1795, 1796, 1797, 1798, 1799, 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1 865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 188 6, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1 955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 197 6, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2 045, 2046, 2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 206 6, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096, 2097, 2098, 2099, 2100, 2101, 2102, 2103, 2104, 2105, 2106, 2107, 2109, 2110, 2111, 2112, 2113, 2114, 2115, 2116, 2117, 2118, 2119, 2120, 2121, 2122, 2123, 2124, 2125, 2126, 2127, 2128, 2129, 2130, 2131, 2132, 2133, 2134, 2135, 2 136, 2137, 2138, 2139, 2140, 2141, 2142, 2143, 2144, 2145, 2146, 2147, 2148, 2149, 2150, 2151, 2152, 2153, 2154, 2155, 2156, 215 7, 2158, 2159, 2160, 2161, 2162, 2163, 2164, 2165, 2166, 2167, 2168, 2169, 2170, 2171, 2172, 2173, 2174, 2175, 2176, 2177, 2178, 2179, 2180, 2181, 2182, 2183, 2184, 2185, 2186, 2187, 2188, 2189, 2190, 2191, 2192, 2193, 2194, 2195, 2196, 2197, 2198, 2199, 2200, 2201, 2202, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2218, 2219, 2220, 2221, 2222, 2223, 2224, 2225, 2 226, 2227, 2228, 2229, 2230, 2231, 2232, 2233, 2234, 2235, 2236, 2237, 2238, 2239, 2240, 2241, 2242, 2243, 2244, 2245, 2246, 224 7, 2248, 2249, 2250, 2251, 2253, 2254, 2255, 2256, 2257, 2258, 2259, 2260, 2261, 2262, 2263, 2264, 2265, 2266, 2267, 2268, 2269, 2270, 2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2280, 2281, 2282, 2283, 2284, 2285, 2286, 2287, 2288, 2289, 2290, 2291, 2292, 2293, 2294, 2295, 2296, 2297, 2298, 2299, 2300, 2301, 2302, 2303, 2304, 2305, 2306, 2307, 2308, 2309, 2310, 2311, 2312, 2313, 2314, 2315, 2316, 2 317, 2318, 2319, 2320, 2321, 2322, 2323, 2324, 2325, 2326, 2327, 2328, 2329, 2330, 2331, 2332, 2333, 2334, 2335, 2336, 2337, 233 8, 2339, 2340, 2341, 2342, 2343, 2344, 2345, 2346, 2347, 2348, 2349, 2350, 2351, 2352, 2353, 2354, 2355, 2356, 2357, 2358, 2359, 2360, 2361, 2362, 2363, 2364, 2365, 2366, 2367, 2368, 2369, 2370, 2371, 2372, 2373, 2374, 2375, 2376, 2377, 2378, 2379, 2380, 2381, 2382, 2383, 2384, 2385, 2386, 2387, 2388, 2389, 2390, 2391, 2392, 2393, 2394, 2395, 2396, 2397, 2398, 2399, 2400, 2401, 2402, 2403, 2404, 2405, 2406, 2 407, 2408, 2409, 2410, 2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2420, 2421, 2422, 2423, 2424, 2425, 2426, 2427, 242 8, 2429, 2430, 2431, 2432, 2433, 2434, 2435, 2436, 2437, 2438, 2439, 2440, 2441, 2442, 2443, 2444, 2445, 2446, 2447, 2448, 2449, 2450, 2451, 2452, 2453, 2454,

2455, 2456, 2457, 2458, 2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469, 2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480, 2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491, 2492, 2493, 2494, 2495, 2496, 2497, 2498, 2499, 2500]

Between 2.5% and 5.0% [60987.50 and 62475.00] ( 0) []

Between 5.0% and 10.0% [62475.00 and 65450.00] ( 0) []

Between 10.0% and 20.0% [65450.00 and 71400.00] ( 0) []

Between 20.0% and 50.0% [71400.00 and 89250.00] ( 0) []

Between 50.0% and 100.0% [89250.00 and 119000.00] ( 0) []

Value > 100% [119000.00] ( 0) []

nci.org.au

Running job per node comparison.


Benchmark Results

• NAMD (stmv_28 mem opt)

Results for ARM and AMD EPYC are not being presented due to Open Tender


Benchmark Results


Benchmark Results - QChem

• First code ported by NCI on Power8

– Q-Chem is similar to Gaussian and scales to multiple nodes.

SMT=2 gave the

best results

SMT=8 gave the

worst results (not

shown)


Benchmark Results - MILC

0

10

20

30

40

50

60

70

80

90

100

ob1 ucx yalla ob1 ucx yalla ob1 ucx yalla

SandyBridge Broadwell POWER8

AveragePerform

ance(GFlop/s)

MILCMulti-massCGSolverPerformance(Phase1)

1node 2nodes


Benchmark Results - MILC

0

10

20

30

40

50

60

70

80

90

100

Performance(GFlop/s)

MILCMulti-massCGSolverPerformance(Phase1)

SandyBridgeob11node SandyBridgeob12nodes SandyBridgeucx1node SandyBridgeucx2nodes SandyBridgeyalla1node SandyBridgeyalla2nodes

Broadwellob11node Broadwellob12nodes Broadwellucx1node Broadwellucx2nodes Broadwellyalla1node Broadwellyalla2nodes

POWER8ob11node POWER8ob12nodes POWER8ucx1node POWER8ucx2nodes POWER8yalla1node POWER8yalla2nodes


Some more details

0

0.5

1

1.5

2

2.5

Rel

ativ

e G

FLO

P/s

Section

Floating Point Perfomance of UM Sections (MPI removed)

Sandy Bridge Broadwell Skylake (Cray) Skylake (AWS)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Total Boundary Layer Helmholtz Gravity Waves Wind Advection OtherAdvection

Diagnostics/IO

Frac

tio

n

Section

Communication Fraction

Raijin (Sandy Bridge) Raijin (Broadwell) Cray AWS


Take Aways

• Automation– Repeated tasks are for machines – not humans

• One Image – One Kernel– All nodes use the same root image and boot of the same kernel which is optimised for the hardware.

– Benchmark the OS kernel and underlying libraries.

– Go Tickless!

• Data– Supercomputer monitoring generates big data – harness that!

• Benchmarks– Know your benchmark – profile them in as much detail as possible

– Uniformity is critical.

– Automated scripts to run and record the results.

– First achieve uniformity and then go for speed!

• User– The most important person - everything we do is to improve their workflow.


Thank you

HPC @ NCI...nci.org.au nci.org.au @NCInews HPC @ NCI Deployment, Monitoring, Management, Uniformity...

Documents

Transcript of HPC @ NCI...nci.org.au nci.org.au @NCInews HPC @ NCI Deployment, Monitoring, Management, Uniformity...