Large-scale Electronic Structure Simulations with MVAPICH2...

Large-scale Electronic Structure Simulations with MVAPICH2on Intel Knights Landing Manycore Processors

Hoon Ryu, Ph.D.

(E: [email protected])

Principal Researcher / Korea Institute of Science and Technology Information (KISTI)

Principal Investigator / KISTI Intel® Parallel Computing Center (IPCC)

Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 2

Electronic Structure CalculationsAn application that has customers in a a HUGE range

• The state of motion of electrons in an electrostatic field created by the stationary nuclei.

• Quantum Physics; Quantum Chemistry; Nanoscale Materials and Devices

à Physics, Chemistry, Materials Science, Electrical Engineering and Mechanical Engineering ETC.

• Huge customers in the society of computational science

KISTI-4 (Tachyon-II) utilization (2013-2014)

Quantum Simulations

Bio/Chem

Fluid/Combustion/Structure

Meteorology

ParticlePhysics

AstroPhysics ETC


Electronic Structure CalculationsIn a perspective of “numerical analysis”

• Two PDE-coupled Loop: Schrödinger Equation and Poisson Equation

• Both equation involve system matrices (Hamiltonian and Poisson)

à DOFs of those matrices are proportional to the # of grids in the simulation domains

• Schrödinger Equations

à Normal Eigenvalue Problem

• Poisson Equations

à Linear System Problem à Ax = b


Electronic Structure CalculationsIn a perspective of “numerical analysis”

• Two PDE-coupled Loop: Schrödinger Equation and Poisson Equation

• Both equation involve system matrices (Hamiltonian and Poisson)

à DOFs of those matrices are proportional to the # of grids in the simulation domains

• Schrödinger Equations

à Normal Eigenvalue Problem

• Poisson Equations

à Linear System Problem à Ax = b

How large are these system matrices?Why do we need to handle those?


Needs for “Large” Electronic StructuresNeeds for High Performance Computing

1. Quantum Simulations of “Realizable” Nanoscale Materials and Devices à Needs to handle large-scale atomic systems (~ A few tens of nms)

[Logic Transistors (FinFET)]

[Quantum Dot Devices]

30nm3 Silicon Box? à About a million atoms

2. DOF of Matrices of Governing Equationsà Linearly proportional to # of atoms (w/ some weight)

3. Parallel Computing

Ax = b


Development Strategy: DD, Matrix HandlingLarge-scale Schrödinger, Poisson Eqns.

Schrödinger Equation• Normal Eigenvalue Problem (Electronic Structure)

• Hamiltonian is always symmetric

Poisson Equation• Linear System Problem (Electrostatics: Q-V)

• Poisson matrix is always symmetric

Domain Decomposition• MPI + OpenMP

• Effectively multi-dimensional decomposition

Z

Thread 1 Thread 2

Thread 3 Thread 4

Adiag(0)

Adiag(1)

W(0,1)

W(1,0)

Adiag(2)

W(1,2)

W(2,1)

Adiag(3)

W(2,3)

W(3,2)

Rank 3

Rank 2

Rank 1

Rank 0

Mapping to TB Hamiltonian

Ran

k 0

Ran

k 1

Ran

k 2

Thread 0

Thread 1

Thread 2

Thread 3R

ank

3Y

X

Slab 0

Slab 1

Slab 2

Slab 3

Matrix Handling• Tight-binding Hamiltonian (Schrödinger Eq.)

• Finite Difference Method (Poisson Eq.)

• Nearest Neighbor Coupling: Highly Sparse à CSR

[100] Zincblende Unitcell0

1

23

4

5

67

Cation

Anion Onsite

(0)

Onsite (1)

Onsite (2)

Onsite (3)

Onsite (4)

Onsite (5)

Onsite (6)

Onsite (7)

Coupling(0,4)

Coupling(4,0)

Coupling(1,4)

Coupling(1,5)

Coupling(4,1)

Coupling(5,1)

Coupling(2,4)

Coupling(2,6)

Coupling(4,2)

Coupling(6,2)

Coupling(3,4)

Coupling(4,3)

Coupling(3,7)

Coupling(7,3)

Mapping to TB Hamiltonian


Development Strategy: Numerical AlgorithmsSchrödinger Equations

Schrödinger Eqs. w/ LANCZOS Algorithmà C. Lanczos, J. Res. Natl. Bur. Stand. 45, 255

• Normal Eigenvalue Problem (Electronic Structure)

• Hamiltonian is always symmetric

• Original Matrix à T matrix-reduction

• Steps for Iteration: Purely Scalable Algebraic Ops.

Self-consistent Loop for Device Simulations


Development Strategy: Numerical AlgorithmsPoisson Equations

Poisson Eqs. w/ CG Algorithmà A Problem of Solving Linear Systems

• Conv. Guaranteed: Symmetric & Positive Definite

• Poisson is always S & PD.

• Steps for Iteration: Purely Scalable

Algebraic Ops.

Self-consistent Loop for Device Simulations


Performance Bottleneck?Matrix-vector Multiplier: Sparse Matrices

W

W

W

W

Ran

k 0

Ran

k 1

Ran

k 2

Ran

k 3

X(0)

X(1)

X(2)

X(3)

Y(0)

Y(1)

Y(2)

Y(3)

Thread 0

Thread 1

Thread 2

Thread 3

Thread 0

Thread 1

Thread 2Thread 3

Thread 0

Thread 1

Thread 2

Thread 3

Thread 0

Thread 1

Thread 2

Thread 3

Ele

men

t-w

ise

mul

tiplic

atio

n &

sum

of

X(i)

and

Y(i)

in r

ank

i to

get Z

i

Ran

k 0

Ran

k 1

Ran

k 2

Ran

k 3

Z0

Z1

Z2

Z3

Ran

k 0

Ran

k 1

Ran

k 2

Ran

k 3

Tota

l sum

with

MP

I col

lect

ive

com

mun

icat

ion

(W=Z

0+Z1

+Z2+

Z3)(i) (ii) (iii)

MP

I poi

nt-to

-poi

ntco

mm

unic

atio

n

Adiag(0)

Adiag(1)

W(0,1)

W(1,0)

Adiag(2)

W(1,2)

W(2,1)

Adiag(3)

W(2,3)

W(3,2)

Thread 0

Thread 1

Thread 2

Thread 3

Ran

k 0

Ran

k 1

Ran

k 2

Ran

k 3

X(0)

X(1)

X(2)

X(3)

(ii)(i)

(iii) (iv)

Adiag(2) X(2)

W(2,1) X(1)

X(3)W(2,3)

X

X

X

+

+

Res

ult v

ecto

r in

Ran

k 2

Adiag(0)

Adiag(1)

W(0,1)

W(1,0)

Adiag(2)

W(1,2)

W(2,1)

Adiag(3)

W(2,3)

W(3,2)

Thread 0

Thread 1

Thread 2

Thread 3

Ran

k 0

Ran

k 1

Ran

k 2

Ran

k 3

X(0)

X(1)

X(2)

X(3)

Adiag(0)

Adiag(1)

W(0,1)

W(1,0)

Adiag(2)

W(1,2)

W(2,1)

Adiag(3)

W(2,3)

W(3,2)

Thread 0

Thread 1

Thread 2

Thread 3

Ran

k 0

Ran

k 1

Ran

k 2

Ran

k 3

X(0)

X(1)

X(2)

X(3)

MP

I poi

nt-to

-poi

ntco

mm

unic

atio

n

Vector Dot-Product (VVDot) (Sparse) Matrix-vector Multiplier (MVMul)

Main Concerns for Performance• Collective Communication

à May not be the main bottleneck as we only need to

collect a single value from a single MPI process

• Matrix-vector Multiplier

à Communication happens, but would not be a critical

problem as it only happens between adjacent ranks

à Cache-miss affects vectorization efficiency


Intel® Knights Landing (KNL) ProcessorsA Simple Overview of Architecture

Summary of Spec & Issues• Up to 72 CPU cores

à 36 Tiles in a 2D mesh (2 AVX-512 VPU per core)

à DDR4, 6 Channels

à Up to 16GB MCDRAM (HBM) Total

• Self-hosted, No issue in copying time

• Extracting performance means

à Better memory access patternsà More threads & Vectorization

A. Sodani, Intel® Xeon Phi™ Processor “Knights Landing”

Architectural Overview, ISC (2015)

HBM Modes• Cache Mode

à No source changes needed to use

• Flat mode

à MCDRAM mapper to physical address space

à Accessed through (memkind) library or numactl

• Hybrid

à Combination of the above two


Single-node Performance: The power of MCDRAMPerformance w/ Intel® KNL Processors

Intel MPI MVAPICH

Intel MPI(HBM)

MVAPICH(HBM)

Description of BMT Target and Test Mode• 36x12x12(nm

3) [100] Si:P quantum dot

à Material candidates for Si Quantum Info. Processors

(Nature Nanotech. 9, 430)

à 2.1Mx2.1M Hamiltonian Matrix

• KNL (Xeon Phi 7210); Up to 256 (64x4) cores

• MCDRAM control w/ numactrl; Quad Mode; MPI only• Intel Parallel Studio 2017 (Intel MPI)

• MVAPICH2-2.2 with Intel Composer XE 2017 (MVAPICH)

Results• With no MCDRAM

à No clear speed-up beyond 32 cores

• With MCDRAMà Full scalability up to 64 hardware cores

Points of Questions• How is the performance of MVAPICH2 w.r.t. Intel?

à Spot of comparison: Communications


Single-node Performance: MVAPICH vs. Intel MPIPerformance w/ Intel® KNL Processors

Description of BMT Target and Test Mode• 36x12x12(nm





• KNL (Xeon Phi 7210); Up to 256 (64x4) cores

• MCDRAM control w/ numactrl; Quad Mode; MPI only• Intel Parallel Studio 2017 (Intel MPI)

• MVAPICH2-2.2 with Intel Composer XE 2017 (MVAPICH)

Points of Questions• How is the performance of MVAPICH2 w.r.t. Intel?

à Where is the spot for comparison in this simple

• A focus on communication timeà MVAPICH2 clearly shows longer communication times

à MPI_Send/Recv + Allreduce function calls

• Why?à The reason may be related to the resource allocation

Communication Time


MVAPICH vs. Intel MPI: CPU allocation to Rank ID’sPerformance w/ Intel® KNL Processors



How are physical cores allocated to MPI processes? • Cluster set to quadrant mode

• Sequential vs. Random?

2 31 4

Quadrant

Physical

Core

Ranks (Starting w/ 1)





How are physical cores allocated to MPI processes? • Cluster set to quadrant mode

• Sequential vs. Random?

2 31 4

Ranks (Starting w/ 1)



MVAPICH2 àß Intel MPIWhat about in terms of computing time?• 1 physical core has 4 hardware threads

• If 1 physical core is mapped by multiple MPI processes, computing time would get less benefits

Best affirnity for this workload: “clock-wise” and “inter-core” allocations of MPI processes

16 MPI 64 MPI 16 MPI 64 MPI

2 31 4


Summary and Acknowledgements Electronic Structure Simulations w/ Intel MPI and MVAPICH2

• Introduction to Code Functionality

• Main Numerical Problems and Performance Bottleneck

• Performance (speed and energy consumption) in a single KNL node

• Intel MPI vs MVAPICH 2: affinity of MPI processes

à w/ initial analysis of its potential effects on performance

Members of KISTI Intel® Parallel Computing Center

Seungmin LeeYosang Jeong Geunchul Park Dukyun Nam Hoon Ryu


Mapping MPI processes to CPUs in KNLElectronic Structure Simulations w/ Intel MPI and MVAPICH2

2 31 4

J. Cazes et al., Programming Intel’s 2nd Generation Xeon Phi, IXPUG 2016 KNL Tutorial, ISC (2016)

Strategy• We know the id’s of cpus in each quadrant

à (e.g) CPU 0~15 is in the 2nd

quadrant

• Need approximations to describe the cpu

map in a single quadrant

à it is not clear which cpu has which id.

• The following approximation would not hurt

the generality of the issue.

0 1 2 34 5 6 78 9 10 1112 13 14 15

48 49 50 5152 53 54 5556 57 58 5960 61 62 63

Quad 1 Quad 4


Multi-node Performance: ScalabilityPerformance w/ Intel® KNL Processors

Description of BMT Target and Test Mode• 5 CB states in 27x33x33(nm





• KNL (Xeon Phi 7250) nodes; Up to 272 (68x4) cores/node

à (4 MPI processes + 68 threads) per node

à Quad / Flat mode, No OPA (10G network)

à Strong scalability measured up to 3 nodes

Results• Speed-enhancement with HBM becomes larger as a single

node takes larger workload

• Inter-node strong scalability is quite nice with no HBM

à 2.36x speed-up with 3x computing expense (no HBM)

à ~78% scalability (2.36/3) is what we usually get from

multi-core base HPCs (Tachyon-II HPC in KISTI)

à 1.58x speed-up with HBM (prob. size is not large enough)

H. Ryu et al., to be presented in SC17 (Intel® HPCDC)

Large-scale Electronic Structure Simulations with MVAPICH2...

Documents

Transcript of Large-scale Electronic Structure Simulations with MVAPICH2...