HPC @ NCI...nci.org.au nci.org.au @NCInews HPC @ NCI Deployment, Monitoring, Management, Uniformity...
Transcript of HPC @ NCI...nci.org.au nci.org.au @NCInews HPC @ NCI Deployment, Monitoring, Management, Uniformity...
nci.org.aunci.org.au
@NCInews
HPC @ NCIDeployment, Monitoring, Management, Uniformity
Dr Muhammad Atif
Manager HPC Systems and Cloud Services
nci.org.au
Agenda
• About NCI
• Infrastructure
• Cluster Deployment
• Management
• Monitoring
• Uniformity Tests & Benchmarks (Brief as we are in an open-tender)
nci.org.au
NCI: an overview
Mission: World-class, high-end computing services for Australian research and innovation
What is NCI:
• Australia’s most highly integrated e-infrastructure environment
• Petascale supercomputer + highest performance research cloud + highest performance storage in the southern hemisphere
• Comprehensive and integrated expert service
• National/internationally renowned support team
NCI is national and strategic:
• Driven by national research priorities and excellence
• Engaged with research institutions/collaborations and industry
• A capability beyond the capacity of any single institution
• Sustained by a collaboration of agencies/universities
NCI is important to Australia because it:
• Enables research that otherwise would be impossible
• Enables delivery of world-class science
• Enables interrogation of big data, otherwise impossible
• Enables high-impact research that matters; informs public policy
• Attracts and retains world-class researchers for Australia
• Catalyses development of young researchers’ skills
Research Outcomes
Communities and Institutions/
Access and Services
Expertise Support and
Development
HPC ServicesVirtual Laboratories/
Data-intensive Services
Integration
Compute (HPC/Cloud) Storage/Network
Infrastructure
Res
ea
rch
Ob
jec
tive
s
nci.org.aunci.org.au
Our Partners
nci.org.au
Supports the full gamut of research
pure strategic applied industry
• Fundamental
sciences
• Mathematics,
physics,
chemistry,
astronomy,
• ARC Centres of
Excellence
(ARCCSS,
CAASTRO,
CUDOS)
• Research with an
intended strategic
outcome
• Environmental,
medical,
geoscientific
• e.g., energy
(UNSW), food
security (ANU),
geosciences
(Sydney)
• Supporting
industry and
innovation
• e.g., ANU/UNSW
startup, Lithicon,
sold for $76M to
US company FEI
in 2014;
multinational miner
• Informing public
policy; real
economic impact
• Climate variation,
next-gen weather
forecasting,
disaster
management
(CoE, BoM,
CSIRO, GA)
nci.org.au
Current Infrastructure: comprehensive, integrated and research-directed
Cloud• Dell: 3,200 Intel Xeon cores; 56G
Ethernet; 25 TB memory; 160 TB SSD, 1PB of Ceph
• 100G Ethernet being deployed.Storage • Global storage system IB <->
supercomputer and cloud• 20 PB raw disk; 22.6 PB
uncompressed tape (Dual Site Redundant)
Compute• Current Supercomputer, Raijin — Fujitsu Primergy
Cluster + Agility System • 83872 Cores, 4519 Compute nodes
• Initially 57,472 cores (Intel Xeon Sandy Bridge, 2.6 GHz)—3592 nodes
• Added GPU queue with a total of 720 cores and 120 K80s
• Added KNL queue with a total of 2048 cores
• Added 22792 Cores of Intel Xeon –Broadwell, 2.6GHz)
• ~330TBytes main memory• FDR Infiniband interconnect + EDR for Agility
System• ~10 PBytes dedicated scratch storage (150 GB/sec
bandwidth)• Three additional high performing file-systems for
persistent storage• gdata1a – 12PB @ 50GB/Sec• gdata1b – 12PB @ 70GB/Sec• gdata2 – 6.7PB @ 80GB/Sec• gdata3 – 7.7PB @120GB/Sec
nci.org.au
NCI: comprehensive and integrated, quality and innovation
• Services and Technologies (~30 staff) – Operations— robust/expert/secure (20 staff) – HPC
• Expert user support (9) • Largest research software library in Australia (300+ applications in all fields)
– Cloud • High-performance: VMs, Clusters • Secure, high-performance filesystem, integrated into NCI workflow environment
– Storage • Active (high-performance Lustre parallel) and archival (dual copy HSM tape); • Partner shares; Collections; Partner dedicated
– Staff Scientiests• Domain Experts
• Research Engagement and Innovation (~20 staff) – HPC and Data-Intensive Innovation
• Upscaling priority applications (e.g., Fujitsu-NCI collaboration on ACCESS), • Bioinformatics pipelines (APN, melanoma, human genome)
– Virtual Environments • Climate/Weather, All-sky Astrophysics, Geophysics, etc. (NeCTAR)
– Data Collections• Management, publication, citation— strong environmental focus + other
– Visualisation• Drishti, Voluminous, Interactive presentations
nci.org.au
10 GigE
/g/data 56Gb FDR IB Fabric
/g/data1(a
&b)12+12PB
/g/data2~6.5PB
/short7.6PB
/home, /system,
/images, /apps
Cache 1.0PB,
Tape 12.3PB
Massdata /g/data Raijin FS
VMwareOpenStack
Private
NCI data movers
To
Hu
xle
y D
C
Raijin 56Gb FDR IB Fabric
Internet
Raijin Compute
Raijin Login + Data movers
/g/data3~7.3PB
OpenStackNeCTAR
Ceph (3-way rep)
NeCTAR0,.5 PB
Tenjin0.5PB
NCI Systems Connectivity
nci.org.aunci.org.au
Network
FFT Raijin (Fujitsu)[FDR FFT]
Raijin Director Switches (Fujitsu) Agility System
(Lenovo/Xenon)[2:1 BFT EDR]24 Compute
nodes/switch with 12 uplinks to core switches
gdata1a
gdata1b
gdata2
gdata3
/short /homsys /imapps
Lnest-ga1 Lnets-g3Lnets-g2Lnets-g1
FFT 36xE/FDR Edge Switches
Single OpenSM (with failovers)
Routing is based on host names
Raijin Management servers (Fujitsu)
Raijin Login/data-mover nodes (Fujitsu/Lenovo)
GPUs KNLs P8 GPUs 1TB Nodes
3TB FusionIO ARM
nci.org.aunci.org.au
Heterogeneous Cluster!
Queue Vendor CPU SKU Clock (GHz) Nodes Node Countcopyq Fujitsu Xeon E5-2670 2.60 r-dm{1..6} 6
Test Cluster(32GB) Fujitsu/Lenovo Xeon E5-2670 2.60 r{1..36} 36
normal (32GB) r{37..2395} 2,359
normal (64GB) r{2396..3520} 1,125
normal (128GB) r{3521..3592} 72
Xenon P100 Xenon Xeon E5-2650 v4 2.20 r{3594..3595} 2
Gpu(haswell) 128/256GB Dell EMC Xeon E5-2670 v3 2.30 r{3596..3609} 14
knl (192GB) SGI/HPE Xeon Phi 7230 1.30 r{3610..3641} 32
gpu (broadwell) (128GB) Dell EMC Xeon E5-2690 v4 2.60 r{3642..3657} 16
normalsp (Fujitsu BW) Fujitsu Xeon E5-2697A v4 2.60 r{3682..3701} 20
fujitsu megamem (3TB) Fujitsu Xeon E7-4809 v4 2.10 r{3702..3705} 4
lenovo normalbw Lenovo Xeon E5-2690 v4 2.60 r{3706..4509} 804
lenovo hugemem Lenovo Xeon E5-2690 v4 2.60 r{4510..4519} 10
Power8 P100 IBM r{5000..5001} 2
Power8 IBM r{5002..5003} 2
ARM Cavium/Cray Thunder-X2 (pre-release) 2.60 r{5010..5011} 2
nci.org.aunci.org.au
Timeline of Floating points
Pre-prod: May 6 ’13
GA: June 17’13
Fujitsu
2x GPU Nodes
Haswell+K80
13 Jan’16
Dell
12x GPU Nodes
Haswell+K80
4 May’16
Dell
16x GPU Nodes
Broadwell+K80
28 Oct’16
Dell
32 x KNL
21 Sep’16
SGI
Agility Cluster
Lenovo
Including
10 1TB nodes
11 Jan’17
IBM Power
+P100
–
Part of
test cluster
1200 TF +175TF +85TF +202TF +948TF
Broadwell
27 Mar’17
Fujitsu
+27TF
2x GPU Nodes
Broadwell+P100
3 Mar’17
Xenon
+44TF
Broadwell/Ivy
28 July’17
Fujitsu/Dell
3TB
+4TF
nci.org.aunci.org.au
HPC Stats on a page!
Live Stats available at https://www.nci.org.au
nci.org.aunci.org.au
Job Distribution
Live Stats available at https://www.nci.org.auWhat you cannot measure, you cannot manage
nci.org.aunci.org.au
Heterogeneity and /apps
drwxr-xr-x 6 aab900 z00 4096 Jul 16 2013 4.6.2
drwxr-xr-x 6 aab900 z00 4096 Aug 7 2013 4.6.3
drwxr-xr-x 6 aab900 z00 4096 Aug 20 2013 4.0.7
drwxr-xr-x 6 aab900 z00 4096 May 8 2014 4.6.5
drwxr-xr-x 7 aab900 z00 4096 Sep 11 2014 5.0.1
drwxr-xr-x 7 aab900 z00 4096 Jan 19 2015 5.0.4
drwxr-xr-x 6 aab900 z00 4096 Sep 7 2015 5.0.5
drwxr-xr-x 6 aab900 z00 4096 Oct 22 2015 5.1.0
drwxr-xr-x 6 aab900 z00 4096 Oct 22 2015 5.1.0-plumed
drwxr-xr-x 6 aab900 z00 4096 Oct 22 2015 5.1.0-gpu
drwxr-xr-x 6 aab900 z00 4096 Feb 11 2016 5.1.2-gpu
drwxr-xr-x 6 aab900 z00 4096 Feb 12 2016 5.1.2-plumed
drwxr-xr-x 6 aab900 z00 4096 Feb 12 2016 5.1.2
drwxr-xr-x 6 aab900 z00 4096 Aug 3 2016 5.1.3
drwxr-xr-x 6 aab900 z00 4096 Aug 3 2016 5.1.3-test
drwxr-xr-x 6 aab900 z00 4096 Sep 22 2016 5.1.3-knl
drwxr-xr-x 6 aab900 z00 4096 Nov 3 2016 2016.1
drwxr-xr-x 6 aab900 z00 4096 Nov 7 2016 2016.1-gpu
drwxr-xr-x 6 aab900 z00 4096 Nov 8 2016 2016.1-knl
drwxr-xr-x 6 aab900 z00 4096 May 3 10:22 2016.3
drwxr-xr-x 6 aab900 z00 4096 May 3 16:16 2016.3-gpu
drwxr-xr-x 6 aab900 z00 4096 May 15 10:34 2016.3-gpupascal
drwxr-xr-x 25 aab900 z00 4096 May 24 12:56 .
drwxr-xr-x 6 aab900 z00 4096 May 24 12:56 2016.1-gpupascal
• Module environment needs to be
extended
• Fat Binaries on Linux anyone?
• Seamless switch based on the
current architecture (we have some
ideas)
• CI - qualifying applications is
becoming a challenge
• Optimized for SandyBridge
• Optimized for Broadwell
• Optimized for KNL
• Optimized for SkyLake
• Optimized for K80
• Optimized for P100
• Optimized for Volta
• Forgot, Power8 and maybe ARM, Power9,
And AMD
singularity!
nci.org.aunci.org.au
Raijin – Extremely stable as we control the entire SW stack
Summary for Q1.2017
Period 01/01/2017 00:00:00 - 31/03/2017 23:59:59
Total Hours 2160
Downtime (unannounced): 0:00 (hh:mm)
Downtime (announced): 148:00 (hh:mm)
Uptime (excluding announced): 100.00%
Summary for Q2.2017
Period 01/04/2017 00:00:00 - 30/06/2017 23:59:59
Total Hours 2160
Downtime (unannounced): 0:00 (hh:mm)
Downtime (announced): 11:30 (hh:mm)
Uptime (excluding announced): 100.00%
Summary for Q4.2016
Period 01/10/2016 00:00:00 - 31/12/2016 23:59:59
Total Hours 2208
Downtime (unannounced): 00:00 (hh:mm)
Downtime (announced): 00:00 (hh:mm)
Uptime (excluding announced): 100.00%
Summary for 2016
Period 01/01/2016 00:00:00 - 31/12/2016 23:59:59
Total Hours 8784
Downtime (unannounced): 11:23 (hh:mm)
Downtime (announced): 28:15 (hh:mm)
Uptime (excluding announced): 99.87%
Power Failure Cooling Failure
nci.org.aunci.org.au
So, how to manage a system at this scale
• Custom deployment– Tested a number of vendor solutions, they are all quite ordinary
– Minimal downtime is the key!
• Custom monitoring– See above!
• Custom management– Automation
• Custom Scheduling– Altair kept a separate branch for NCI due to our strict requirements
– All of the requirements are gradually being made available to the rest of the world
• Extensive Benchmarking and Uniformity tests– Per minute, daily, weekly, on-demand, scheduled maintenance
– Any minor/major change to the cluster
Hire the best!
nci.org.aunci.org.au
Cluster Deployment!
• Custom cluster deployment based on open source tools– oneSIS, conman, powerman etc.
• All hardware platforms use one system image and kernel.– Sandy Bridge, Ivy Bridge, Haswell, Broadwell, KNL, Nvidia GPUs
– Fujitsu, Lenovo, Dell
– Separate images for Power8 and ARM (non-x86)
• Rolling reboots – no cluster downtime for patching required.
• Fine grained control over hardware architecture
nci.org.au
Cluster Deployment [oneSIS]
• OneSIS: An open source software package for diskless cluster management (http://onesis.org/)
– A mesh of symlinks, difficult to setup initially and requires custom initramfs.
• All production compute nodes, login nodes and data-mover nodes see the same image from Lustre e.g. /imapps/Images/NCI/centos-6-X-N .
– Create classes e.g. compute.r, compute.r.testcluster, compute.r.login or compute.r.gpu, compute.knl etc
• Run certain services on login nodes but not on compute and data-movers
– Read-only image.
• Root-on-lustre and rsync root to ram (NCI specific modifications to oneSIS). Do not install OS on the node – create one image!
– IB and Lustre modules in initramfs. 64bit busy box. Custom distro patch.
– Full root, half root, rsync OS to ramfs.
– Nodes boot via tftp and switch to Lustre root.
• Any change is seen instantaneously by all the nodes.
– E.g. Yum updates or configuration file change.
Rsync R-man1 Lustre:/images/ce
ntos-6.X-N
Rsync to ram
OS is bare minimum.
Root-on-lustre
/etc/fstab
/ram/etc/fstab
/etc/fstab.compute.r
/etc/fstab.compute.r.login
/etc/fstab.compute.r
Hostname: r1000
Hostname: r3592
Hostname: raijin1
oneSIS symlink
oneSIS symlink
nci.org.au
Cluster Updates
Man1
/var/lib/oneSIS/Centos-6.5-XX/var/lib/oneSIS/Centos-6.5-YY
DailyHSM
Backup ofEvery
oneSISimage
by date
Man2
/var/lib/oneSIS/Centos-6.5-XX/var/lib/oneSIS/Centos-6.5-YY
Man7
/var/lib/oneSIS/Centos-6.5-XX/var/lib/oneSIS/Centos-6.5-YY
Lustre:/imaps/images/NCI/centos-6.8-XXLustre:/imaps/images/NCI/centos-6.9-YY
• Rsync copy of new OS
• Boot test cluster with the image from previous step
• Chroot and do a yum update
• Rsync new image to Lustre:/imapps/images
• Perform uniformity tests and set of benchmarks we have identified.
• If all clear; offline all nodes (jobs continue to run)
• Daemon reboots the nodes in the new image when job finishes on the node(s)
• Kernel Updates: Yum update and then build initramfs. Same for MOFED and Lustre updates.
• Notifications for every new OS kernel released. If there is critical security update that affects NCI, it takes top priority.
• Rolling reboot without users noticing
TestProduction cluster
Root-on-lustree.g. running XX
nci.org.aunci.org.au
Raijin Kernel [Custom based on CentOS 7.X]
• CentOS 6.X user-land with CentOS 7.X Kernel. – We normally do not upgrade the OS for the life of the system – which is supposed to be ~years.
– Not using CentOS 6 kernel due to missing features – compiling own from sources provided by CentOS7
• Control Group Fixes
– Low memory on one NUMA node, Kernel would allocate block of memory from the other numa node and that allocation was not under cgroup. That allocation then grows out of control and OOM. Still not fixed in CentOS 6.9 Kernel.
• Fully Tick-less Kernel [The first HPC facility to go production with this feature]
– CentOS 6:CONFIG_NO_HZ: timer tick disabled when cpu is idlenew in CentOS 7:CONFIG_NO_HZ_FULL: timer tick also disabled when single runnable process is on the cpu. ideal for HPC workloads.
– Improved HPL on a single node by 1%.
• MOFED Packages for Centos 6 Kernel were (getting a bit) buggy.
• Better support for KNL
– We only want to run one Kernel
– Previously, we had locally patched RHEL 7 kernel but with 500 series kernel RedHat has done the port
– Can skip a Kernel version if not affected by bugs/security.
– Seamless rolling reboot of the compute nodes after testing.
nci.org.aunci.org.au
Raijin [Lustre]
• Lustre Updates– Lustre Server is 2.5++ on Raijin
– Lustre 2.5 IEEL 2 based with NCI applied list of patches and is extremely stable.
– Last 12 months chart is shown.
nci.org.au
Monitoring• Monitoring at scale
– More than 4500 machines (compute, dm, lustre servers, logins, management) – loosing count!
– 6 IB director switches, 208 IB leaf switches (Raijin) and 2 Super Spine + 33 leaf for Lenovo
– 1 Ethernet core and 106 leaf switches + 24 leaf for Lenovo system
– 5 DDN SFA12K (10 Controllers), 4200 Disks. (Raijin native – we are not talking gdata here)
• homesys and imapps 900GB SAS, /short: 3TB SATA (near line SAS)
– Not talking about our Cloud here – that has similar setup and actively being unified with HPC.
– Use of Hybrid ELK, Ganglia and OpenTSDB – will be moving to OpenTSDB + ELK for the next system
• Email, SMS and Slack integration.– No point in monitoring if you cannot automate.
– Human input is still required for new events; our monitoring is not sentient - yet.
nci.org.au
Monitoring
• Custom scripts and cron jobs to send emails, Slack Bot(s) and SMS alerts
– Monitoring scripts and daemons to check health of compute and Lustre services.
• Varied time – 1 minute, 15 minutes, 30 minutes, hour, 2 hours, daily and even at all times
– Before running any job, we run our check_node script.
• CPU, memory, filesystem, services and a lot more and runs in ~0.15 Seconds
– Also checks for Kernel, OneSIS image, FW versions when necessary.
• Adding random after-job full health check with uniformity tests
– After N job runs, offline the node and perform uniformity benchmarks and put it back to production
– Same script runs every 2 hours – does not cause noticeable jitter
– Check node script is growing with each new case we encounter issues that can cause job/node failure.
• Take Away: Automatically offline the nodes upon failure condition(s).
• Do not dispatch jobs to the nodes that are suspect, faulty or flagged for update.
nci.org.au
Monitoring• Temperature monitoring via BMC
– Continuous monitoring –temperature warnings, automated suspend and shutdown on the Cluster.
– HeatMap of the entire cluster – To find hotspots!
• Ganglia spoofing
– For CPU, Custom ganglia for Lustre, IB and GPU monitoring
– [Ganglia will be replaced with Open TSDB in the next system with custom daemon – PoC is ready]
• OpenSM monitoring and SMS triggers based on custom daemon
– Switches, HCA’s, Cables faults are picked up and flagged instantaneously.
• ELK Stack
– Entire cluster logs (compute nodes, lustre servers, cloud, management, PBS Server) are send to ELK server
– Custom notifications are generated (Email, SMS)
• E.g too many re-queue for a single job.
• LBUGS, IB errors
• Job Monitoring and dashboard
nci.org.au
Monitoring• Automated Fault Categorisation and Reporting
– We hate doing repeated work
– Based on years of experience, common hardware failures result in
• Automatic Node Offline
• Diagnostics pulled (upon meeting certain conditions) and error logged with Fujitsu
– Picked up by check_node script, the node is offlined, ticket created automatically.
» E.g Failed memory, HCA Throwing Errors,
• Working with Lenovo on similar method.
nci.org.au
Scheduling• Lots of custom hooks for NCI
– Extensive use of cgroups - even on GPU nodes.
– Allocation management
• We have hyperthreads on at all times – control them via Cgroups. User can request HT via –l hyperthread flag
• Fair-share
– Custom daemon that calculates priority of jobs
• Per project, per running cpus of a project, ageing
– Tightly coupled with alloc-csv{db}
• Custom flags
– File-system up and down
• /g/data3 -l other gdata3
nci.org.au
Monitoring File-systems
• User perspective – have been untaring linux for years
• Triggers based on slow untar speeds, hung untars (OST issues). Each file-system has its own threshold based on MDT performance
• Push notifications via email, slack-bot.
nci.org.au
Monitoring File-systems
nci.org.au
Monitoring File-systems
nci.org.au
Monitoring File-systems
nci.org.au
Monitoring File-systems
nci.org.au
NCI Dashboard (Not open to users currently)
Images courtesy Dr. Lei Shang
nci.org.au
Job Trends
nci.org.au
User Statistics
*Permission was obtained from the user.
nci.org.au
Project Stats
nci.org.au
Inefficient Jobs
User is contacted
nci.org.au
Running job per node comparison.
nci.org.au
Benchmarks – system uniformity perspective
• Micro and application benchmarks
• Micro-Benchmarks
– Per node: Stream, HPL, MOST
– Set of nodes: OSU (bw, latency, all_to_all, barrier, reduce)
• Application Benchmarks
– A lot – keep on adding – each benchmark is thoroughly profiled using IPM, Allinea etc.
– E.g UM, NAMD (x86, GPU), Bowtie2, nvBowtie, QE, Qchem, Gaussian (not publishing results), OpenFOAM, CCAM, QCD, S3D, Gromacs etc
• Methodology
– Scripts to submit the job (or run independent of scheduler). Results recorded in a DB
– Run at-least 5 times, throw away min and max and take an average
– Stdev of 5% is allowed – moving to 3%.
– This can change based on what we are trying to achieve
nci.org.au
Benchmarks – system uniformity perspective--- streamBenchmark -----------------------------------
Statistics:
-----------
Agreed Avg: 59500.00 Test Avg: 60199.35 Median: 60347.77 StDev: 1595.85 Min: 42495.67 Max: 60498.17 Total: 991
Results are based on 59500.00
Passing list is between -2.5% to +2.5% (i.e. 5%)
Value < -100% [0.00] ( 0) []
Between -100.0% and -50.0% [0.00 and 29750.00] ( 0) []
Between -50.0% and -20.0% [29750.00 and 47600.00] ( 8) [1641, 1642, 1643, 1644, 1841, 1842, 1843, 1844]
Between -20.0% and -10.0% [47600.00 and 53550.00] ( 0) []
Between -10.0% and -5.0% [53550.00 and 56525.00] ( 0) []
Between -5.0% and -2.5% [56525.00 and 58012.50] ( 0) []
Between -2.5% and -1.0% [58012.50 and 58905.00] ( 0) []
Between -1.0% and 1.0% [58905.00 and 60095.00] ( 3) [1604, 1772, 2252]
Between 1.0% and 2.5% [60095.00 and 60987.50] ( 980) [1500, 1501, 1502, 1504, 1507, 1508, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538,
1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1560, 1561, 1562, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1 581, 1582, 1583, 1584, 1585, 1586, 1587, 1588, 1589, 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599, 1600, 1601, 160 2, 1603, 1605, 1606, 1607, 1608, 1609, 1610, 1611, 1612, 1613, 1614, 1615, 1616, 1617, 1618, 1619, 1620, 1621, 1622, 1623, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1662, 1663, 1664, 1665, 1666, 1667, 1668, 1669, 1670, 1671, 1672, 1673, 1674, 1675, 1676, 1 677, 1678, 1679, 1680, 1681, 1682, 1683, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 169 8, 1699, 1700, 1701, 1702, 1703, 1704, 1705, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1 768, 1769, 1770, 1771, 1773, 1774, 1775, 1776, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 1786, 1787, 1788, 1789, 1790, 179 1, 1792, 1793, 1794, 1795, 1796, 1797, 1798, 1799, 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1 865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 188 6, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1 955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 197 6, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2 045, 2046, 2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 206 6, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096, 2097, 2098, 2099, 2100, 2101, 2102, 2103, 2104, 2105, 2106, 2107, 2109, 2110, 2111, 2112, 2113, 2114, 2115, 2116, 2117, 2118, 2119, 2120, 2121, 2122, 2123, 2124, 2125, 2126, 2127, 2128, 2129, 2130, 2131, 2132, 2133, 2134, 2135, 2 136, 2137, 2138, 2139, 2140, 2141, 2142, 2143, 2144, 2145, 2146, 2147, 2148, 2149, 2150, 2151, 2152, 2153, 2154, 2155, 2156, 215 7, 2158, 2159, 2160, 2161, 2162, 2163, 2164, 2165, 2166, 2167, 2168, 2169, 2170, 2171, 2172, 2173, 2174, 2175, 2176, 2177, 2178, 2179, 2180, 2181, 2182, 2183, 2184, 2185, 2186, 2187, 2188, 2189, 2190, 2191, 2192, 2193, 2194, 2195, 2196, 2197, 2198, 2199, 2200, 2201, 2202, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2218, 2219, 2220, 2221, 2222, 2223, 2224, 2225, 2 226, 2227, 2228, 2229, 2230, 2231, 2232, 2233, 2234, 2235, 2236, 2237, 2238, 2239, 2240, 2241, 2242, 2243, 2244, 2245, 2246, 224 7, 2248, 2249, 2250, 2251, 2253, 2254, 2255, 2256, 2257, 2258, 2259, 2260, 2261, 2262, 2263, 2264, 2265, 2266, 2267, 2268, 2269, 2270, 2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2280, 2281, 2282, 2283, 2284, 2285, 2286, 2287, 2288, 2289, 2290, 2291, 2292, 2293, 2294, 2295, 2296, 2297, 2298, 2299, 2300, 2301, 2302, 2303, 2304, 2305, 2306, 2307, 2308, 2309, 2310, 2311, 2312, 2313, 2314, 2315, 2316, 2 317, 2318, 2319, 2320, 2321, 2322, 2323, 2324, 2325, 2326, 2327, 2328, 2329, 2330, 2331, 2332, 2333, 2334, 2335, 2336, 2337, 233 8, 2339, 2340, 2341, 2342, 2343, 2344, 2345, 2346, 2347, 2348, 2349, 2350, 2351, 2352, 2353, 2354, 2355, 2356, 2357, 2358, 2359, 2360, 2361, 2362, 2363, 2364, 2365, 2366, 2367, 2368, 2369, 2370, 2371, 2372, 2373, 2374, 2375, 2376, 2377, 2378, 2379, 2380, 2381, 2382, 2383, 2384, 2385, 2386, 2387, 2388, 2389, 2390, 2391, 2392, 2393, 2394, 2395, 2396, 2397, 2398, 2399, 2400, 2401, 2402, 2403, 2404, 2405, 2406, 2 407, 2408, 2409, 2410, 2411, 2412, 2413, 2414, 2415, 2416, 2417, 2418, 2419, 2420, 2421, 2422, 2423, 2424, 2425, 2426, 2427, 242 8, 2429, 2430, 2431, 2432, 2433, 2434, 2435, 2436, 2437, 2438, 2439, 2440, 2441, 2442, 2443, 2444, 2445, 2446, 2447, 2448, 2449, 2450, 2451, 2452, 2453, 2454,
2455, 2456, 2457, 2458, 2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469, 2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480, 2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491, 2492, 2493, 2494, 2495, 2496, 2497, 2498, 2499, 2500]
Between 2.5% and 5.0% [60987.50 and 62475.00] ( 0) []
Between 5.0% and 10.0% [62475.00 and 65450.00] ( 0) []
Between 10.0% and 20.0% [65450.00 and 71400.00] ( 0) []
Between 20.0% and 50.0% [71400.00 and 89250.00] ( 0) []
Between 50.0% and 100.0% [89250.00 and 119000.00] ( 0) []
Value > 100% [119000.00] ( 0) []
nci.org.au
Running job per node comparison.
nci.org.au
Running job per node comparison.
nci.org.aunci.org.au
Benchmark Results
• NAMD (stmv_28 mem opt)
Results for ARM and AMD EPYC are not being presented due to Open Tender
nci.org.aunci.org.au
Benchmark Results
nci.org.aunci.org.au
Benchmark Results - QChem
• First code ported by NCI on Power8
– Q-Chem is similar to Gaussian and scales to multiple nodes.
SMT=2 gave the
best results
SMT=8 gave the
worst results (not
shown)
nci.org.aunci.org.au
Benchmark Results - MILC
0
10
20
30
40
50
60
70
80
90
100
ob1 ucx yalla ob1 ucx yalla ob1 ucx yalla
SandyBridge Broadwell POWER8
AveragePerform
ance(GFlop/s)
MILCMulti-massCGSolverPerformance(Phase1)
1node 2nodes
nci.org.aunci.org.au
Benchmark Results - MILC
0
10
20
30
40
50
60
70
80
90
100
Performance(GFlop/s)
MILCMulti-massCGSolverPerformance(Phase1)
SandyBridgeob11node SandyBridgeob12nodes SandyBridgeucx1node SandyBridgeucx2nodes SandyBridgeyalla1node SandyBridgeyalla2nodes
Broadwellob11node Broadwellob12nodes Broadwellucx1node Broadwellucx2nodes Broadwellyalla1node Broadwellyalla2nodes
POWER8ob11node POWER8ob12nodes POWER8ucx1node POWER8ucx2nodes POWER8yalla1node POWER8yalla2nodes
nci.org.aunci.org.au
Some more details
0
0.5
1
1.5
2
2.5
Rel
ativ
e G
FLO
P/s
Section
Floating Point Perfomance of UM Sections (MPI removed)
Sandy Bridge Broadwell Skylake (Cray) Skylake (AWS)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Total Boundary Layer Helmholtz Gravity Waves Wind Advection OtherAdvection
Diagnostics/IO
Frac
tio
n
Section
Communication Fraction
Raijin (Sandy Bridge) Raijin (Broadwell) Cray AWS
nci.org.aunci.org.au
Take Aways
• Automation– Repeated tasks are for machines – not humans
• One Image – One Kernel– All nodes use the same root image and boot of the same kernel which is optimised for the hardware.
– Benchmark the OS kernel and underlying libraries.
– Go Tickless!
• Data– Supercomputer monitoring generates big data – harness that!
• Benchmarks– Know your benchmark – profile them in as much detail as possible
– Uniformity is critical.
– Automated scripts to run and record the results.
– First achieve uniformity and then go for speed!
• User– The most important person - everything we do is to improve their workflow.
nci.org.aunci.org.au
Thank you