Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores...
Transcript of Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores...
![Page 1: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/1.jpg)
This project and the research leading to these results has received funding from the European Community's Seventh Framework and H2020 Programmes under grant agreement n° 610402 and 671697.
http://www.montblanc-project.eu
Scientific computing on ARM-based platforms:
evaluation and perspectives
Filippo Mantovani
Senior Researcher at Barcelona Supercomputing Center
Technical coordinator of the Mont-Blanc 1 and 2 projects
![Page 2: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/2.jpg)
Mont-Blanc projects in a glance
Roma, 28 Sep 2016Perspectives of GPU computing in Science
Vision: to leverage the fast growing market of mobile technology for scientific computation, HPC and non-HPC workload.
Mont-BlancMont-Blanc 2
Mont-Blanc 3
2012 2013 2014 20162015 2017 2018
2
![Page 3: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/3.jpg)
Areas of contribution
Roma, 28 Sep 2016Perspectives of GPU computing in Science3
ARM-based prototypes
• Mobile technology• Server technology
System software
• Scientific libraries• Performance analysis tools• Support for runtimes
Resiliency
• Reliability study of the MB prototype• Application based fault tolerance• Fault tolerance support in the runtime
Architectural studies
• Multi-scale simulation• big.LITTLE in scientific codes• Next-generation SoCs• Power monitoring and energy
aware programming
Scientific applications
• Scalability study• Detailed study using
performance analysis tools• Very little "macho effort"
![Page 4: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/4.jpg)
PRACE prototypes
•Tibidabo
•Carma
•Pedraforca
Mini-clusters
•Arndale
•Odroid XU
•Odroid XU-3
•NVIDIA Jetson
Mont-Blanc prototype
•1080 compute cards
•Fine grained power monitoring system
•Installed betweenJan and May 2015
•Operational sinceMay 2015 @ BSC
ARM 64-bit mini-clusters
•APM X-GENE2
•Cavium ThunderX
•NVIDIA TX1
Mont-Blanc 3 demonstrator
•Based on new-generation ARM 64-bit processors
•Targeting HPC market
The Mont-Blanc prototype ecosystem
Roma, 28 Sep 2016Perspectives of GPU computing in Science
2012 2013 2014
Prototypes are critical to accelerate software developmentSystem software stack + applications
2015 2016
4
2017
![Page 5: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/5.jpg)
5
Exynos5 Dual SoC2x cores ARM Cortex-A151x GPU ARM Mali-T604
NetworkUSB3.0 to 1GbE bridge
Memory4 GB LPDDR3-1600
Local StoragemicroSD up to 64 GB
Mont-Blanc prototype
Roma, 28 Sep 2016Perspectives of GPU computing in Science
8.5 cm
5.6 c
m
5
2 Racks8 BullX chassis72 Compute blades1080 Compute cards
2160 CPUs1080 GPUs4.3 TB of DRAM17.2 TB of Flash
Operational since May 2015 @ BSC
![Page 6: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/6.jpg)
Cavium Thunder cluster
• Based on Cavium ThunderX SoC• Core: ARMv8 custom implementation
• 48 cores @ 1.8 GHz per SoC
• 1 cluster node <=> dual socket board• 1 board, 2 sockets, 96 cores
• 128 GB of DDR3 RAM
• Cache coherency protocol implemented
• One instance of Linux
• Cluster deployed at BSC facilities• 4x dual socket boards ( +1 ☺ )
• 384 cores in 2U
• ~700W peak power consumption*
Roma, 28 Sep 2016Perspectives of GPU computing in Science6
* On a reference design board + PASS1 SoC
Provided by:
![Page 7: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/7.jpg)
Provided by:
Jetson TX1 cluster
• Same SoC of NVIDIA Shield console
• 1x NVIDIA Tegra X1• 4x Cortex-A57 @ 1.73GHz
• 1x Cortex-A53 (not usable)
• 1x NVIDIA Maxwell GPU• 256 CUDA cores
• 4 GB LPDDR4
• 1GbE Network
• Cluster deployed at BSC facilities• 16x NVIDIA Jetson TX1 boards
• Mont-Blanc software stack available
Roma, 28 Sep 2016Perspectives of GPU computing in Science7
![Page 8: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/8.jpg)
Power monitor
Network driverOpenCL driver
System software stack for ARM
Roma, 28 Sep 2016Perspectives of GPU computing in Science
CPUGPUCPU
CPU
Source files (C, C++, FORTRAN, Python, …)
GNU JDK Mercurium
Compilers
Linux OS / Ubuntu
LAPACK Boost PETSc clFFT
FFTW HDF5ATLAS clBLAS
Scientific libraries
ScalascaPerfExtrae DDT
Developer tools
SLURMGanglia NTP
OpenLDAPNagios Puppet
Cluster management
Nanos++ OpenCL CUDA MPI
Runtime libraries
LustreNFSDVFS
Hardware support / Storage
Tested on severalARM-based platform
2
Based on open-source packages
1
But…
• More than 10 prototypes
• More than 5 years
• More than 4 different ways of measuring the power……and still no standards!
3
Network
8
Power monitor
![Page 9: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/9.jpg)
MB-proto: Power monitor – HW infrastructure
9
Credits: Axel Auweter, Daniele Tafani (LRZ)
Roma, 28 Sep 2016Perspectives of GPU computing in Science
![Page 10: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/10.jpg)
MB-proto: Power monitor – HW / SW interface
• Field Programmable Gate Array (FPGA)• Collects power consumption data from all 15 power measurement /
sample interval: 70ms
• Board Management Controller (BMC)• Collects 1s averaged
data from FPGA
• Stores measurement samples in FIFO
• Mont-Blanc Pusher• Collects measurement data
from multiple BMCs using custom IPMI commands
• Forwards data using MQTT protocol through Collect Agent into key-value store
10
Credits: Axel Auweter, Daniele Tafani (LRZ)
Roma, 28 Sep 2016Perspectives of GPU computing in Science
![Page 11: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/11.jpg)
On Mont-Blanc prototype
Roma, 28 Sep 2016Perspectives of GPU computing in Science11
30.6 s238 J
1 core @ 1.6 GHz
13.5 s126 J
2 cores @ 1.6 GHzOpenMP
52.7 s352 J
1 core @ 0.8 GHz
2 cores @ 0.8 GHzOpenMP
26.4 s204 J 9.1 s
61 J
1 core @ 1.6 GHz
+GPU
1 core @ 0.8 GHz
+GPU
12.1 s90 JStatic Idle Power > 5W
![Page 12: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/12.jpg)
Alya RED on the Mont-Blanc prototype
Roma, 28 Sep 2016Perspectives of GPU computing in Science12
Credits: Nikola Rajovic
![Page 13: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/13.jpg)
Applications - weak scaling
Roma, 28 Sep 2016Perspectives of GPU computing in Science13
Credits: Nikola Rajovic
![Page 14: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/14.jpg)
Applications - strong scaling
Roma, 28 Sep 2016Perspectives of GPU computing in Science14
Credits: Nikola Rajovic
![Page 15: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/15.jpg)
Experimental setup with external power monitor
15
Server
SerialInterface
3 sample/sec
System power plug
• Not “platform specific”
• Cavium ThunderX
• Full node measurements
• Including PSU losses
Cluster
Roma, 28 Sep 2016Perspectives of GPU computing in Science
![Page 16: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/16.jpg)
Roma, 28 Sep 2016Perspectives of GPU computing in Science16
Linux Kernel support for the ThunderX PMU
• Access to PMU is SoC-dependent
• In order to access the PMU• Device Tree must include PMU definition
• Linux Kernel must implement a way to access PMU
• In our case
• Device Tree does not include PMU definition
• PMU access not supported by the kernel
• Result
• Hardware counters can be accessed with the “perf” command
• Patch available to the rest of the world
INFO: On ARM CPU, hardware performance counters are handled by a component called Performance Monitor Unit (PMU), part of the SoC:
![Page 17: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/17.jpg)
• Voltage monitor on-board component• Texas Instruments INA3221
• Connected via I2C
• No support provided by NVIDIA
• Hand-written support…
• Measurements validated with external setup
• So we are now able to get power traces on Jetson TX1!
Jetson TX1: “old school” hacking…
Roma, 28 Sep 2016Perspectives of GPU computing in Science17
![Page 18: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/18.jpg)
DGEMM with CUDA
• Test of a simple CUDA code performing a DGEMM
• Basic time to solution / energy to solution study with default governor configurations
Roma, 28 Sep 2016Perspectives of GPU computing in Science18
![Page 19: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/19.jpg)
MILCmk power analysis (preliminary)
• 7 representative microkernels for the MIMD Lattice Computation (MILC) collaboration code
• CORAL benchmark, C code, OpenMP, double precision
• Study of the computational features using hw counters
Roma, 28 Sep 2016Perspectives of GPU computing in Science19
2.5x moreenergy
![Page 20: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/20.jpg)
Next steps
Short term:
• Power profile complex CUDA codes
• Deeper understanding of governors
Ideally targeting three levels of power information:
• From the application• Access to an energy register, PAPI style
• Possibility of easily powering on-off cores
• Change of frequency
• From the runtime• Direct access to the power registers
• Possibility of easily powering on-off cores (without kernel support)
• From the outside• Gather power data of larger systems “a la Mont-Blanc”
• Power aware job scheduling
Roma, 28 Sep 2016Perspectives of GPU computing in Science20
![Page 21: Scientific computing on ARM-based platforms -- evaluation ... · • 1 board, 2 sockets, 96 cores • 128 GB of DDR3 RAM • Cache coherency protocol implemented • One instance](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc8ac9e9e400e0a6f0eabe8/html5/thumbnails/21.jpg)
Conclusions
Roma, 28 Sep 2016Perspectives of GPU computing in Science21
• Highlight of Mont-Blanc activities have been presented• Even with low-end hardware components it is possible to achieve
decent performance in parallel computation
• Main-line of Mont-Blanc 3 activity is targeting high-end server market
• Still researching in cost-efficient platforms
• 3 ARM-based platforms for scientific computing have been introduced• With focus on power monitoring
• There is still al long way for real power aware programming…
“The secret is to win going as slowly as possible.”Niki Lauda
MontBlancEUmontblanc-project.eu @MontBlanc_EU