Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6...
Transcript of Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6...
![Page 1: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/1.jpg)
Performance of Density Functional Theory codes on Cray XE6
Zhengji Zhao, and Nicholas Wright National Energy Research Scientific
Computing Center Lawrence Berkeley National
Laboratory
![Page 2: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/2.jpg)
• Motivation • Introduction to DFT codes • Threads and performance of VASP • OpenMP threads and performance of Qauntum Espresso • Conclusion
Outline
![Page 3: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/3.jpg)
• Challenges from the multi-core trend – Address reduced per core memory, – Make use of faster intra node memory access
• Recommended path forward is to use threads/OpenMP • Majority of the NERSC application codes are still in flat MPI • Exam the performance implications from the use of threads in real user applications
Motivation
![Page 4: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/4.jpg)
• Materials and Chemistry applications account for 1/3 of NERSC workflow. • 75% of them run various DFT codes. • Among 500 application code instances at NERSC, VASP consumes the most computing cycles (~8%). • VASP is in pure MPI, current status of majority user codes • Quantum Espresso, an OpenMP/MPI hybrid codes, top #8 code at NERSC.
Why DFT codes
![Page 5: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/5.jpg)
• What it solves – Kohn-sham equation
Density Functional Theory
!
{" 12#2 +V (r)[$]}% i(r) = Ei% i(r)
!
" i(r) = Ci,Ge[i(k+G ).r]
G#!
" # i(r)# j (r)dr = $ ij,{# i}i=1,..,N
!
V (r)["] = Z R|r#R | +
"(r' )|r#r'|$
R% d3r'+µ("(r))
Local Density Approximation:
![Page 6: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/6.jpg)
N electrons N wave functions
H!, 2 FFTs (CG, RMM-DIIS,Davison)
Orthogonalization
Subspace diagonalization
Potential generation solve poission equation using density functional formula
Flow chart of DFT codes
!
{" 12#2 +Vin (r)}$ i(r) = Ei$ i(r)
!
Vout (r)
!
"(r) = fn |# ii$ (r) |2
Pot
entia
l mix
ing
- Vin, V
out !
new
Vin
!
" i |H |" j
!
{" i}i=1,..,Ntrial charge density trial wavefunctions
!
"(r) and
!
{" i}i=1,..,N
!
" i(r)" j (r)dr# = $ ij ,
!
"E < # breakno yes
!
{" i}i=1,..,N
![Page 7: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/7.jpg)
Parallelization in DFT codes Level 1: Parallel over k-points
!
{" i,k},i =1,...,N;k =1,nktot
• The number of processors, Ntot, is divided into nkg group, each group has Nk number of processors (Ntot=nkg*Nk)
• Each group of processors deal with nktot/nkg number of k points
![Page 8: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/8.jpg)
!
{" i,k}i=1,..,N
!
{" i,k}i=1,m
!
{" i,k}i=m+1,2m
!
{" i,k}i=m*(Ng#1)+1,N; ; …… ;
Processors 1 - Np
Processors Np+1 - 2Np
Processors N - Np+1 - N
Group 1 Group 2 Group Ng
…
• The number of processors, Nk, is divided into Ng group, each group has Np number of processors (Ntot=Ng*Np) • N wavefunctions are also divided into Ng groups, each with m wavefunctions • One group of processors deal with one group of wavefunctions
Parallelization in DFT codes Level 2: Parallel over bands
![Page 9: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/9.jpg)
Within each group of processors, the planewave basis is divided among the Np number of processors:
FFT
Divide the G-space into columns, and distribute them to the Np processors
Real space
Figures from http://hpcrd.lbl.gov/~linwang/PEtot/PEtot_parallel.html
Parallelization in DFT codes Level 3: Parallel over planewave basis set
!
" i,k (r) = Ci,Ge[i(k+G ).r]
G#
![Page 10: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/10.jpg)
• A planewave pseudopotential code – A commercial code from Univ. of Vienna
• Libraries used – BLAS, fft
• Parallel implementations – Over planewave basis set and bands – >1proc/atom scale – Flops 20-50% of peak (in real calculations)
• VASP use at NERSC – Used by 83 projects, 200 active users
VASP
http://cmp.univie.ac.at/vasp
![Page 11: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/11.jpg)
11
VASP: Performance vs threads
• When the number of threads increases, a little or no performance gain. Code runs slower.
• But in comparison to the flat MPI, at threads=3, VASP runs faster than the flat MPI on unpacked nodes by 20-25%
!"#!!"$!!!"$#!!"%!!!"%#!!"&!!!"&#!!"'!!!"'#!!"
$" %" &" (" $%" %'"
$''" )%" '*" %'" $%" ("
!"#$%&'(%
)*#+$,%-.%/0,$12'3456%/1'7'%
+$#',-./01234"
+$#',566,-778"
9:.;"6<7,566,-778,"=4>.?@A1"431A2"
9:.;"6<7,-./01234,=4>.?@A1"431A2"
Test case A154:
154 atoms 998 electrons Zn48O48C22S2H34 80x70x140 real-space
grids; 160x140x280 FFT
grids 4 kpoints
![Page 12: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/12.jpg)
12
VASP: Memory usage vs threads
• Memory usage is reduced when the number of threads increases
• At threads=3, the memory usage is reduced by 10% compared to that of threads=2
Test case A154:
!"
!#!$"
!#%"
!#%$"
!#&"
!#&$"
!#'"
%" &" '" (" %&" &)"
%))" *&" )+" &)" %&" ("
!"#
$%&'("
%')$%"'*+
,-'
./#0"%'$1'23%"4567!89'246:6'
,%$)-.//-0112"
,%$)-34563789"
![Page 13: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/13.jpg)
13
VASP: VASP runs slower when the number of threads increases
Threaded VASP at best (threads=2) is slightly slower (~12%) than the flat MPI
!"
#!!"
$!!!"
$#!!"
%!!!"
%#!!"
&!!!"
&#!!"
'!!!"
$" %" &" (" $%" %'"
)(*" &*'" %#(" $%*" ('" &%"
!"#$%&'(%
)*#+$,%-.%/0,$12'3456%/1'7'%
+((!,-./01234"
+((!,566,-778"
Test case A660:
660 atoms 2220 electrons C200H230N70Na20O120P20 240x240x486 real-
space grids; 480x380x972 FFT grids 1 kpoint (Gamma point) Gamma kpoint only
VASP
![Page 14: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/14.jpg)
14
VASP: Memory usage vs threads
Compare the memory usage for threads=2 and the flat MPI: For RMM-DIIS: there is a slight memory saving For Davidson: no memory saving at threads=2, slightly more
use of memory (<3%)
!"
!#!$"
!#%"
!#%$"
!#&"
!#&$"
!#'"
%" &" '" (" %&" &)"
*(+" '+)" &$(" %&+" ()" '&"
!"#
$%&'("
%')$%"'*+
,-'
./#0"%'$1'23%"4567!89'246:6'
,((!-./012345"
,((!-677-.889"
Test case A660:
![Page 15: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/15.jpg)
• A planewave pseudopotential code – An open software DEMOCRITOS National Simulation Center and SISSA with collaboration with many other institutes
• Libraries used – BLAS, fft
• Parallel implementations – Over k-points, planewave basis and bands – >1proc/atom scale
• QE use at NERSC – Used by 21 projects
Quantum Espresso
http://www.quantum-espresso.org
![Page 16: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/16.jpg)
16
QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI
At threads=2, QE runs faster than the flat MPI on half-packed nodes by 38%
!"
#!!"
$!!!"
$#!!"
%!!!"
%#!!"
&!!!"
$" %" &" '" $%" %("
$((!" )%!" (*!" %(!" $%!" '!"
!"#$%&'(%
)*#+$,%-.%/0,$12'3456%/1'7'%
+,-,'*'"
./01"23-"45"60/7890:;<="54=<>"
Test case GRIR686:
686 atoms 5174 electrons C200Ir486 180x180x216 FFT
grids 2 kpoints
![Page 17: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/17.jpg)
17
At threads=2, the memory usage is reduced by 64% when compared to the flat MPI
!"!#$"!#%"!#&"!#'"("
(#$"(#%"(#&"(#'"$"
(" $" )" &" ($" $%"
(%%!" *$!" %'!" $%!" ($!" &!"
!"#
$%&'("
%')$%"'*+
,-'
./#0"%'$1'23%"4567!89'246:"6'
+,-,&'&"
./01"23-"45"60/7890:;<="54=<>"
Test case GRIR686:
686 atoms 5174 electrons C200Ir486 180x180x216 FFT
grids 2 kpoints
QE: The OpenMP+MPI code uses less memory than the flat MPI
![Page 18: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/18.jpg)
18
At threads=2, QE runs faster than the flat MPI on half-packed nodes by 28%
!"
#!!"
$!!!"
$#!!"
%!!!"
%#!!"
&!!!"
$" %" &" '" $%" %("
$'&%" )$'" #((" %*%" $&'" ')"
!"#$%&'(%
)*#+$,%-.%/0,$12'3456%/1'7'%
+,-$!./0)"
1234"5.6"78"932:;<3=>?@"87@?A"
Test case CNT10POR8:
1532 atoms 5232 electrons C200Ir486 540x540x540 FFT
grids 1 kpoint (Gamma point)
QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI
![Page 19: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/19.jpg)
19
At threads=2, the memory usage is reduced by 30%
!"!!!#
!"$!!#
!"%!!#
!"&!!#
!"'!!#
("!!!#
("$!!#
("%!!#
(# $# )# &# ($# $%#
(&)$# '(&# *%%# $+$# ()&# &'#
!"#
$%&'("
%')$%"'*+
,-'
./#0"%'$1'23%"4567!89'246:6'
,-.(!/01'#
Test case CNT10POR8:
QE: The OpenMP+MPI code uses less memory than the flat MPI
![Page 20: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/20.jpg)
20
At threads=2, QE runs faster than the flat MPI on half-packed nodes by 22%
!"#!"
$!!"$#!"%!!"%#!"&!!"&#!"'!!"'#!"#!!"
$" %" &" ("
%))" $''" *(" ')"
!"#$%&'(%
)*#+$,%-.%/0,$12'3456%/1'7'%
+,-,./$$%"
/012"345"67"8109:;1<=>?"76?>@"
Test case AUSURF112:
112 atoms 5232 electrons C200Ir486 125x64x200 FFT grids 80x90x288 smooth
grids 2 k-points
QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI
![Page 21: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/21.jpg)
21
At threads=2, QE runs faster than the flat MPI on half-packed nodes by 38%
!"
!#!$"
!#%"
!#%$"
!#&"
!#&$"
%" &" '" ("
&))" %**" +(" *)"
!"#
$%&'("
%')$%"'*+
,-'
./#0"%'$1'23%"4567!89'246:6'
,-.-/0%%&"
Test case AUSURF112:
QE: The OpenMP+MPI code uses less memory than the flat MPI
![Page 22: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/22.jpg)
• Performance of VASP from using MPI+OpenMP programming model
– A low-effort thread implementation - linked with the multi-threaded BLAS libraries – Slight performance gains in the order of 20-25% – Addition of OpenMP directives in the source code should help this situation. – Slight memory savings – Many optional parameters that affect the performance of VASP, our results are not all.
Conclusions
![Page 23: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/23.jpg)
• Performance of QE from using MPI+OpenMP programming model
– OpenMP directives in the source code + linking to the multi-threaded libraries – Performance gains in the order of 40% in comparison to flat MPI, best performance achieved at threads=2 – Significant memory savings, 20-40% per core when compared to the flat MPI.
Conclusions --continued
![Page 24: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/24.jpg)
• OpenMP+MPI is a promising programming model on Hopper
– Other DFT and other MPI codes which can make use of multi-threaded BLAS routines.
Conclusions --continued
![Page 25: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow.](https://reader038.fdocuments.net/reader038/viewer/2022110112/5aea46277f8b9ac3618d9041/html5/thumbnails/25.jpg)
• NERSC user Wai-Yim Ching and Sefa Dag for providing VASP test cases • NERSC resources
Acknowledgement