Poster2 - MOLCAS · Title: Poster2 Created Date: 5/25/2012 1:57:20 AM

1
CASPT2! ... but can it fly? CASPT2! ... but can it fly? Victor P. Vysotskiy, Valera Veryazov • CASPT2 I/O Benchmarks Tests Results • CASPT2 BLAS3 DGEMM: GPU vs CPU • Molcas v8.0 • CASPT2 I/O Benchmarks Tests Results • CASPT2 BLAS3 DGEMM: GPU vs CPU • Molcas v8.0 Multiconfigurational second -order perturbation method CASPT2 is known as a reliable computational tool for the electronic structure calculations. The original CASPT2 code in MOLCAS has been developed in the beginning of 90s.The main development of the code focused on algorithmic improvements, for example, recent development allows to use RASSCF reference wavefunction. The change of hardware architecture was addressed much less. It is well know fact that a speed of a typical CASPT2 calculation is limited by storing and reading data, i.e. it is I/O-bound problem. Among CASPT2 scratch files, only the two-electron integrals or Cholesky vectors files are read sequentially for several times, while the rest files are accessed constantly and randomly. In other words, the CASPT2 I/O workload is dominated by random write and read operations, which is the worst case scenario for conventional HDD, due to the excessively high latency of spinning hard disks. One may expect that a caching mechanism of a underlying filesystem (FS) should improve the overall I/O performance as long as there is no large files and available memory is sufficient for buffering all needed data. However, the caching mechanism is not selective in a sense that it tries to buffer all opened/accessed files simultaneously/uniformly, regardless their sizes and I/O access patterns. Generally speaking, without any assumptions about certain FS and its caching mechanism, the best possible performance of the CASPT2 module can be obtained only by using an electronic data storage device with the lowest available latency and the best random I/O performance like, e.g.,Random Access Memory (RAM), or Solid State Device (SSD). Although nowadays most of the computers are equipped with large amount of memory, neither CASPT2 code itself, or operating system by caching I/O, can't use this memory in efficient way. In order to utilize RAM directly for I/O we have developed a new framework called as Files in Memory” (FiM). The key idea of FiM is to keep a scratch file in RAM entirely instead of using a HDD/SSD disk. In sharp contrast to FS caching, within FiM one has an explicit and transparent control on a housing data in RAM. The beauty of FiM that it is easy to use for both MOLCAS end user and developer: there is no need to change source code, one just needs to edit an external resource file! For I/O benchmarking were selected several typical CASPT2/RASPT2 jobs. In addition, the benchmark set was extended by adding one MCLR test. The ext3 FS was installed on all storage devices and their was mounted with the “noatime” option. The Lustre FS was tuned within “lfs -c 1 -s 1m”. command. FiM provides the best performance; CASPT2 : SSD outperforms HDD ~1.1-1.6x; MCLR: SSD outperforms HDD >10x; FS Caching remarkably improves I/O throughput speed by factor of 2; FiM over Lustre FS provides virtually the same performance as a local HDD; Within FiM it is now possible to run MOLCAS on a diskless HPC node/workstation without performance penalty; FiM is useful and powerful tool for data analysis, debugging. CASPT2 code spends up to 80% of the computational time in DGEMM. Intel MKL v10.2 on Intel Xeon E5520 (2.27 GHz) CUDA BLAS v4.1 on NVIDIA Tesla M2050. Pinned memory means that allocated memory pages remain in real RAM all the time. In Data reuse scenario the C matrix was resided on the GPU device The previous GPU hardware generation and corresponding CUDA libraries was 4x times slower than the current one. In particular, the N pinned and N paged crossing points for Tesla S1070& &cuBLAS v2 are 796 and 1845, respectively. Planning features and changes: FiM with data compression (in progress); Global Arrays free (MPI-2 and new Memory Allocator); Better support of many-core architectures, especially regarding the BLAS3 matrix operations. http://www.molcas.org C *) The reported time corresponds to the slowest cacheless computation on HDD, so-called “HDD (NO FS_CACHING)” FiM SSD (FS_CACHING, RAID0) HDD (FS_CACHING) SSD (NO FS_CACHING) HDD (NO FS_CACHING) Lustre PFS Relative CASPT2/MCLR Timings (%) Gflops/s 0 50 100 150 200 250 300 Matrix size n 512 1024 2048 4096 MKL 1 core (SEQ) MKL 4 cores (MT) MKL 8 cores (MT) cuBLAS data reuse PINNED cuBLAS data reuse PAGED cuBLAS PINNED cuBLAS PAGED Gflops/s 0 25 50 75 100 125 150 Matrix size n 128 256 512 N PAGED =455 N PINNED =224 In tegral (%) 0 20 40 60 80 100 Matrix size n 0 500 1000 1500 2000 2500 3000 A B1 PA GED PINNED Goal code refactoring Theoretical Chemistry, Chemical Center, P.O.B. 124, Lund 22100, Sweden, E-mail: [email protected]

Transcript of Poster2 - MOLCAS · Title: Poster2 Created Date: 5/25/2012 1:57:20 AM

Page 1: Poster2 - MOLCAS · Title: Poster2 Created Date: 5/25/2012 1:57:20 AM

CASPT2! ... but can it fly? CASPT2! ... but can it fly? Victor P. Vysotskiy, Valera Veryazov

• CASPT2 I/O

• Benchmarks Tests

Results

• CASPT2 BLAS3 DGEMM: GPU vs CPU

• Molcas v8.0

• CASPT2 I/O

• Benchmarks Tests

Results

• CASPT2 BLAS3 DGEMM: GPU vs CPU

• Molcas v8.0

Multiconfigurational second -order perturbation method CASPT2 is known as a reliable computational tool for the electronic structurecalculations. The original CASPT2 code in MOLCAS has been developed in the beginning of 90s.The main development of the codefocused on algorithmic improvements, for example, recent development allows to use RASSCF reference wavefunction. The change ofhardware architecture was addressed much less. It is well know fact that a speed of a typical CASPT2 calculation is limited by storing and reading data, i.e. it is I/O-bound problem. Among CASPT2 scratch files, only the two-electron integrals or Cholesky vectors files are readsequentially for several times, while the rest files are accessed constantly and randomly. In other words, the CASPT2 I/O workload is dominated by random write and read operations, which is the worst case scenario for conventional HDD, due to the excessively high latency of spinning hard disks. One may expect that a caching mechanism of a underlying filesystem (FS) should improve the overall I/O performanceas long as there is no large files and available memory is sufficient for buffering all needed data. However, the caching mechanism is not selective in a sense that it tries to buffer all opened/accessed files simultaneously/uniformly, regardless their sizes and I/O access patterns.Generally speaking, without any assumptions about certain FS and its caching mechanism, the best possible performance of the CASPT2 module can be obtained only by using an electronic data storage device with the lowest available latency and the best random I/O performancelike, e.g.,Random Access Memory (RAM), or Solid State Device (SSD).

Although nowadays most of the computers are equipped with large amount of memory, neither CASPT2 code itself, or operating system by caching I/O, can't use this memory in efficient way. In order to utilize RAM directly for I/O we have developed a new framework called as “Files in Memory” (FiM). The key idea of FiM is to keep a scratch file in RAM entirely instead of using a HDD/SSD disk. In sharp contrast to FS caching, within FiM one has an explicit and transparent control on a housing data in RAM. The beauty of FiM that it is easy to use for both MOLCAS end user and developer: there is no need to change source code, one just needs to edit an external resource file!

For I/O benchmarking were selected several typical CASPT2/RASPT2 jobs. In addition, the benchmark set was extended by adding one MCLR test.

The ext3 FS was installed on all storage devices and their was mounted with the “noatime” option. The Lustre FS was tuned within “lfs -c 1 -s 1m”. command.

FiM provides the best performance;

CASPT2 : SSD outperforms HDD ~1.1-1.6x; MCLR: SSD outperforms HDD >10x;

FS Caching remarkably improves I/Othroughput speed by factor of 2;

FiM over Lustre FS provides virtually the same performance as a local HDD;

Within FiM it is now possible to run MOLCASon a diskless HPC node/workstation withoutperformance penalty;

FiM is useful and powerful tool for data analysis,debugging.

CASPT2 code spends up to 80% of the computational time in DGEMM.

Intel MKL v10.2 on Intel Xeon E5520 (2.27 GHz)

CUDA BLAS v4.1 on NVIDIATesla M2050.

Pinned memory means thatallocated memory pages remainin real RAM all the time.

In Data reuse scenario the C matrix was resided on theGPU device

The previous GPU hardware generation and corresponding CUDA libraries was 4x times slower than the current one. In particular, the Npinned and Npagedcrossing points for Tesla S1070&&cuBLAS v2 are 796 and 1845,respectively.

Planning features and changes:

FiM with data compression (in progress); Global Arrays free (MPI-2 and new Memory Allocator); Better support of many-core architectures, especially regarding the BLAS3 matrix operations.

http://www.molcas.org

C

*)The reported time corresponds to the slowest cacheless computation on HDD, so-called “HDD (NO FS_CACHING)”

FiMSSD (FS_CACHING, RAID0)HDD (FS_CACHING)SSD (NO FS_CACHING)HDD (NO FS_CACHING)Lustre PFS

Rel

ativ

e C

ASP

T2/

MC

LR

Tim

ings

(%)

Gflo

ps/s

0

50

100

150

200

250

300

Matrix size n512 1024 2048 4096

MKL 1 core (SEQ)MKL 4 cores (MT)MKL 8 cores (MT)

cuBLAS data reuse PINNEDcuBLAS data reuse PAGEDcuBLAS PINNEDcuBLAS PAGED

Gflo

ps/s

0

25

50

75

100

125

150

Matrix size n128 256 512

NPAGED=455

NPINNED

=224

Inte

gral

(%)

0

20

40

60

80

100

Matrix size n0 500 1000 1500 2000 2500 3000

AB1

PAG

ED

PIN

NED

Goal

code refactoring

Theoretical Chemistry, Chemical Center, P.O.B. 124, Lund 22100, Sweden, E-mail: [email protected]