LAMMPS
Performance Benchmark and Profiling
November 2020
2
Note
• The following research was performed under the HPC Advisory Council activities
– HPCAI-AC - Iris cluster
– Dell – Zenith cluster
• The following was done to provide best practices
– LAMMPS performance overview over Intel based platforms
– Understanding LAMMPS MPI communication patterns
• More info on LAMMPS
– https://lammps.sandia.gov/
3
LAMMPS
• Large-scale Atomic/Molecular Massively Parallel Simulator
– Classical molecular dynamics code which can model:
– Atomic, Polymeric, Biological, Metallic, Granular, and coarse-grained systems
• LAMMPS-KOKKOS package contains
– Versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library
• LAMMPS runs efficiently in parallel using message-passing techniques
– Developed at Sandia National Laboratories
– An open-source code, distributed under GNU Public License
• More information on LAMMPS can be found at the LAMMPS web site:
http://lammps.sandia.gov
4
Cluster Configuration
• HPC-AI AC Cluster Center – Iris cluster
– Dual Socket Intel Gold 6148 CPU @ 2.40GHz
– ConnectX-6 HDR100 InfiniBand
– Quantum Switch HDR InfiniBand
– Memory: 192GB DDR4 2677MHz RDIMMs per node
• Software
– OS: RHEL 7.8,
– MLNX_OFED 4.9
– MPI: HPC-X 2.7.0
– LAMMPS: v10-29-2020
– Compiler: Intel 2020.4.304
• Dell Cluster Center – Zenith cluster
– Dual Socket Intel Gold 6248 CPU @ 2.50GHz
– ConnectX-6 HDR100 InfiniBand
– Quantum Switch HDR InfiniBand
– Memory: 192GB DDR4 2677MHz RDIMMs per node
• Software
– OS: RHEL 7.8,
– MLNX_OFED 4.9
– MPI: HPC-X 2.7.0
– LAMMPS: v10-29-2020
– Compiler: Intel 2020.4.304
5
LAMMPS Inputs
• AF_lennard-jones_2.5
– Problem: https://lammps.sandia.gov/bench/in.lj.txt
– region: box block 0 200 0 200 0 200
– neigh_modify: delay 0 every 20 check no
– Iterations: 1000
• EAM
– Problem:
https://github.com/lammps/lammps/blob/master/bench/POTENTIALS/in.eam
– region: box block 0 200 0 200 0 200
– neigh_modify: delay 1 every 5 check yes
– Iterations: 1000
– thermo 100
– thermo 100
• Tersoff
– Problem: https://lammps.sandia.gov/bench/in.tersoff.txt
– region: box block 0 200 0 200 0 200
– Iterations: 1000
• Gay-Berne
– Problem: https://lammps.sandia.gov/bench/in.gb.txt
– region: box block 0 320 0 320 0 320
– set type 1 mass 1.5
– set type 1 shape 1 1.5 2
– neigh_modify: delay 1 every 5 check yes
– Iterations: 1000
– thermo 100
• Rhodopsin
– Problem:
https://github.com/lammps/lammps/blob/master/bench/in.rhodo
– replicate: 1 1 1
– atom_modify map array
– Iterations: 1000
• SNAP
– Problem:
– region: box block 0 5 0 8 0 32
– Iterations: 1000
6
LAMMPS Performance – Scalability
Higher is better
100% 100%92%
92% 100%97%
* Bigger problem size
7
LAMMPS Performance – AVX2/AVX512
Higher is better
8
LAMMPS Performance – CPU
Higher is better
9
LAMMPS MPI Profiles on 32 nodes
Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI
Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI
10
LAMMPS MPI Profiles on 32 nodes
Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI
Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI
11
Summary
• LAMMPS can be scalable, per the problem size defined. The problem should suit the CPU
architecture and cluster size. With InfiniBand the scalability is above 92% for the
demonstrated cased
• AVX512 helps five out of the six input benchmarks, and up to 2x improvment
• Intel Gold 6248, 2.5GHz (40 cores per node) demonstrated up 38% of performance
improvements comparing to Intel Gold 6148 @2.4GHz (40 cores per node)
• MPI Profile shows up to 30% communication time mostly on point to point and MPI
AllReduce operations. Rhodopsin input also showing also MPI alltoallv as well
12
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information
contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You
Top Related