LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel...

12
LAMMPS Performance Benchmark and Profiling November 2020

Transcript of LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel...

Page 1: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

LAMMPS

Performance Benchmark and Profiling

November 2020

Page 2: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

2

Note

• The following research was performed under the HPC Advisory Council activities

– HPCAI-AC - Iris cluster

– Dell – Zenith cluster

• The following was done to provide best practices

– LAMMPS performance overview over Intel based platforms

– Understanding LAMMPS MPI communication patterns

• More info on LAMMPS

– https://lammps.sandia.gov/

Page 3: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

3

LAMMPS

• Large-scale Atomic/Molecular Massively Parallel Simulator

– Classical molecular dynamics code which can model:

– Atomic, Polymeric, Biological, Metallic, Granular, and coarse-grained systems

• LAMMPS-KOKKOS package contains

– Versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library

• LAMMPS runs efficiently in parallel using message-passing techniques

– Developed at Sandia National Laboratories

– An open-source code, distributed under GNU Public License

• More information on LAMMPS can be found at the LAMMPS web site:

http://lammps.sandia.gov

Page 4: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

4

Cluster Configuration

• HPC-AI AC Cluster Center – Iris cluster

– Dual Socket Intel Gold 6148 CPU @ 2.40GHz

– ConnectX-6 HDR100 InfiniBand

– Quantum Switch HDR InfiniBand

– Memory: 192GB DDR4 2677MHz RDIMMs per node

• Software

– OS: RHEL 7.8,

– MLNX_OFED 4.9

– MPI: HPC-X 2.7.0

– LAMMPS: v10-29-2020

– Compiler: Intel 2020.4.304

• Dell Cluster Center – Zenith cluster

– Dual Socket Intel Gold 6248 CPU @ 2.50GHz

– ConnectX-6 HDR100 InfiniBand

– Quantum Switch HDR InfiniBand

– Memory: 192GB DDR4 2677MHz RDIMMs per node

• Software

– OS: RHEL 7.8,

– MLNX_OFED 4.9

– MPI: HPC-X 2.7.0

– LAMMPS: v10-29-2020

– Compiler: Intel 2020.4.304

Page 5: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

5

LAMMPS Inputs

• AF_lennard-jones_2.5

– Problem: https://lammps.sandia.gov/bench/in.lj.txt

– region: box block 0 200 0 200 0 200

– neigh_modify: delay 0 every 20 check no

– Iterations: 1000

• EAM

– Problem:

https://github.com/lammps/lammps/blob/master/bench/POTENTIALS/in.eam

– region: box block 0 200 0 200 0 200

– neigh_modify: delay 1 every 5 check yes

– Iterations: 1000

– thermo 100

– thermo 100

• Tersoff

– Problem: https://lammps.sandia.gov/bench/in.tersoff.txt

– region: box block 0 200 0 200 0 200

– Iterations: 1000

• Gay-Berne

– Problem: https://lammps.sandia.gov/bench/in.gb.txt

– region: box block 0 320 0 320 0 320

– set type 1 mass 1.5

– set type 1 shape 1 1.5 2

– neigh_modify: delay 1 every 5 check yes

– Iterations: 1000

– thermo 100

• Rhodopsin

– Problem:

https://github.com/lammps/lammps/blob/master/bench/in.rhodo

– replicate: 1 1 1

– atom_modify map array

– Iterations: 1000

• SNAP

– Problem:

– region: box block 0 5 0 8 0 32

– Iterations: 1000

Page 6: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

6

LAMMPS Performance – Scalability

Higher is better

100% 100%92%

92% 100%97%

* Bigger problem size

Page 7: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

7

LAMMPS Performance – AVX2/AVX512

Higher is better

Page 8: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

8

LAMMPS Performance – CPU

Higher is better

Page 9: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

9

LAMMPS MPI Profiles on 32 nodes

Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI

Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI

Page 10: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

10

LAMMPS MPI Profiles on 32 nodes

Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI

Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI

Page 11: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

11

Summary

• LAMMPS can be scalable, per the problem size defined. The problem should suit the CPU

architecture and cluster size. With InfiniBand the scalability is above 92% for the

demonstrated cased

• AVX512 helps five out of the six input benchmarks, and up to 2x improvment

• Intel Gold 6248, 2.5GHz (40 cores per node) demonstrated up 38% of performance

improvements comparing to Intel Gold 6148 @2.4GHz (40 cores per node)

• MPI Profile shows up to 30% communication time mostly on point to point and MPI

AllReduce operations. Rhodopsin input also showing also MPI alltoallv as well

Page 12: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular

12

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information

contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You