Parallelization of Coupled Cluster Code with OpenMP

34
Parallelization of Coupled Cluster Code with OpenMP Anil Kumar Bohare Department of Computer Science, University of Pune, Pune-7, India

description

I present this project thesis to All india NetApp Technical Paper comptition

Transcript of Parallelization of Coupled Cluster Code with OpenMP

Page 1: Parallelization of Coupled Cluster Code with OpenMP

Parallelization of Coupled Cluster Code with OpenMP

Anil Kumar BohareDepartment of Computer Science,

University of Pune,Pune-7, India

Page 2: Parallelization of Coupled Cluster Code with OpenMP

2

Multi-core architecture and its implications to software

This presentation has been made in OpenOffice.

Multi-core architectures have a single chip package that contains one or more dies with multiple execution cores on computational engines. The jobs are run on appropriate software threads simultaneously.

contd.

Page 3: Parallelization of Coupled Cluster Code with OpenMP

3

Multi-core architecture and its implications to software

Current computer architectures like multi-core processor on a single chip are increasingly relying on parallel programming techniques like Message Passing Interface (MPI), Open specifications for Multi-Processing (OpenMP) etc. to improve performance of applications, leading to developments in High Performance Computing (HPC).

Page 4: Parallelization of Coupled Cluster Code with OpenMP

4

Parallelization of Coupled Cluster Code

With the increase in popularity of Symmetric Multiprocessing(SMP) systems as the building blocks for high performance supercomputers, the need for applications that can utilize the multiple levels of parallel architecture clusters of SMPs have also increased.

This presentation describes parallelization of an important molecular dynamic application ‘Coupled Cluster Singles and Doubles (CCSD)’ on multi-core systems.

contd.

Page 5: Parallelization of Coupled Cluster Code with OpenMP

5

Parallelization of Coupled Cluster Code

To reduce the execution time of sequential CCSD code, we optimize and parallelize it for accelerating its execution on multi-core systems.

Page 6: Parallelization of Coupled Cluster Code with OpenMP

6

Agenda

Introduction

Problem & Theories

Areas of application

Performance Evaluation System

OpenMP implementation discussion

General performance recommendations

Advantages & Disadvantages

Performance Evaluations

Further improvement

Conclusion

References

Page 7: Parallelization of Coupled Cluster Code with OpenMP

7

Introduction / Background

Coupled-cluster (CC) methods are now widely used in quantum chemistry to calculate the electron correlation energy.

Common use is in ab initio quantum chemistry methods in the field of computational chemistry.

Technique is used for describing many-body systems.

Some of the most accurate calculations for small to medium sized molecules use this method.

Page 8: Parallelization of Coupled Cluster Code with OpenMP

8

Problem

CCSD project contains 5 different files

‘vbar’ is one of the many subroutines under focus. It: Computes the effective two-electron integrals which are CPU

intensive. Has iterative calculations. Has big and time consuming loops. Has up to 8 levels of nested loops. Takes approximately 12 minutes to execute in a sequential code.

contd.

Page 9: Parallelization of Coupled Cluster Code with OpenMP

9

Problem

Goal is to reduce this time by at-least 30% i.e. making it 7-8 minutes by applying OpenMP parallelization technique.

Page 10: Parallelization of Coupled Cluster Code with OpenMP

10

Parallelization: Description of the theory

Shared-memory architecture (SMP): These parallel machines are built up on a set of processors which have access to a common memory.

Distributed-memory architecture (Beuwolf clusters): each processor has its own private memory and information is interchanged through messages.

Wiki: MPI is a computer software protocol that allows many computers to communicate with one another.

In the last few years a new industry standard has evolved with the aim to serve the development of parallel programs on shared –memory machines: OpenMP.

Page 11: Parallelization of Coupled Cluster Code with OpenMP

11

OpenMP is an API (Application Program Interface) used to explicitly direct multi-threaded, shared memory parallelism.

OpenMP defines a portable programming interface based on directives, run time routines and environment variables.

OpenMP is a relatively new programming paradigm, which can easily deliver good parallel performance for small numbers (<16) of processors.

OpenMP is usually used on existing serial programs to achieve moderate parallelism with relatively little effort.

Page 12: Parallelization of Coupled Cluster Code with OpenMP

12

Use of OpenMP

OpenMP is used in applications with intense computational needs, such as video games, big science & engineering problems.

It can be used from very early programmers in school to scientists to parallel computing experts.

It is available to millions of programmers in every major(Fortran & C/C++) compiler.

Page 13: Parallelization of Coupled Cluster Code with OpenMP

13

System used

Supermicro computer node

Quad Core Dual CPU = 8 cores

Intel(R) Xeon(R) CPU X5472 @ 3.00GHz

8GB RAM

Red Hat Enterprise Linux WS release 4

Kernel: 2.6.9-42.Elsmp

Compiler: Intel ifort (IFORT) 11.0 20090131

The parallel CCSD implementation with OpenMP is compiled by Intel Fortran Compiler Version 11.0 with O3 optimization flag.

Page 14: Parallelization of Coupled Cluster Code with OpenMP

14

How to apply OpenMP?

Identify compute intensive loops

Scope of Data Parallelism

Use of PARALLEL DO directive

Reduction variables

Mutual Exclusion Synchronization - Critical Section

Use of Atomic directive

OpenMP Execution Model

Page 15: Parallelization of Coupled Cluster Code with OpenMP

15

Identify compute intensive loops

If you have big loops that dominate execution time, these are ideal target for OpenMP.

Divide loop iterations among threads: We will focus mainly on loop level parallelism in this presentation.

Make the loop iterations independent.. So they can safely execute in any order without loop-carried dependencies.

Place the appropriate OpenMP directives and test.

Page 16: Parallelization of Coupled Cluster Code with OpenMP

16

Scope of Data Parallelism

Shared variables are shared among all threads.

Private variables vary independently within threads.

By default, all variables declared outside a parallel block are shared except the loop index variable, which is private.

In the shared memory setup the private variables in each thread avoid dependencies and false sharing of data .

Page 17: Parallelization of Coupled Cluster Code with OpenMP

17

PARALLEL DO Directive

The first directive specifies that the loop immediately following should be executed in parallel.

For codes that spend the majority of their time executing the content of loops, the PARALLEL DO directive can result in significant increase in performance.

contd.

Page 18: Parallelization of Coupled Cluster Code with OpenMP

18

PARALLEL DO Directive

These are actual examples taken from OpenMP version of CCSD.

C$OMP PARALLEL

C$OMP DO SCHEDULE(STATIC,2)

C$OMP&PRIVATE(ib,ibsym,orbb,iab,iib,iq,iqsym,ibqsym,iaq,iiq,

$ig,igsym,orbg,iig,iag,ir,irsym,iir,iar,orbr,irgsym,

$kloop,kk,ak,rk,f4,vqgbr,imsloc,is,issym,iis,orbs,ias)

do 1020 ib=1,nocc

body of loop

continue 1020

Page 19: Parallelization of Coupled Cluster Code with OpenMP

19

Reduction variables

Variables that are used in collective operations over the elements of an array can be labeled as REDUCTION variables:

xsum=0

C$OMP PARALLEL DO REDUCTION(+:xsum)

do in=1,ntmax

xsum=xsum+baux(in)*t(in)

enddo

C$OMP END PARALLEL DO

Each processor has its own copy of xsum. After the parallel work is finished, the master thread collects the values generated by each processor and performs global reduction.

Page 20: Parallelization of Coupled Cluster Code with OpenMP

20

Mutual Exclusion Synchronization-Critical Section

Certain parallel programs may require that each processor executes a section of code, where it is critical that only one processor executes the code section at a time.

These regions can be marked with CRITICAL/END CRITICAL directivcs . Example

C$OMP CRITICAL(SECTION1)

call findk(orbq,orbr,orbb,orbg,iaq,iar,iab,iag,kgot,kmax)

C$OMP END CRITICAL(SECTION1)

Page 21: Parallelization of Coupled Cluster Code with OpenMP

21

Atomic Directive

The ATOMIC directives ensures that specific memory location is to be updated automatically, rather than exposing it to the possibility of multiple, simultaneous writing threads. Example

C$OMP ATOMIC

aims31(imsloc) = aims31(imsloc)-twoe*t(in1)

Page 22: Parallelization of Coupled Cluster Code with OpenMP

22

Problem solution:

Flow

Page 23: Parallelization of Coupled Cluster Code with OpenMP

23

Compilation & Execution Compile the OpenMP version of CCSD Code

anil@node:~# ifort -openmp ccsd_omp.F -o ccsd_omp.o

Set the OpenMP environment variables

anil@node:~# cat exports

export OMP_NUM_THREADS=2 or 4 or 8 (number of threads to be spawned while executing the specified loops)

export OMP_STACKSIZE=1G(Less size may result in segmentation fault)

anil@node:~#source exports

contd.

Page 24: Parallelization of Coupled Cluster Code with OpenMP

24

Compilation & Execution

Execute the OpenMP version of CCSD code

anil@node:~#date>run_time; time ./ccsd_omp.o; date>>run_time

Page 25: Parallelization of Coupled Cluster Code with OpenMP

25

OpenMP Execution Model

Page 26: Parallelization of Coupled Cluster Code with OpenMP

26

General performance recommendations

Be aware of the Amdahl's law Minimize serial code Remove dependencies among iterations

Be a ware of directives cost Parallelize outer loops Minimize the number of directives

Minimize synchronization- minimize the use of BARRIER,CRITICAL

Reduce False Sharing Make use of private data as much as possible.

Page 27: Parallelization of Coupled Cluster Code with OpenMP

27

Advantages

With multiple cores is that we could use them to extract thread level parallelism in a program and hence increase the performance of the sequential code.

Original source code is almost left untouched.

Can substantially reduce the execute time (upto 40%) of a given code resulting in power saving.

Designed to make programming threaded applications quicker, easier, and less error prone.

Page 28: Parallelization of Coupled Cluster Code with OpenMP

28

Disadvantages

OpenMP code will only run on SMP machines

When the processor must perform multiple tasks simultaneously, it will cause performance degradations

There can be several iterations of trials before the user gets expected timings from the OpenMP codes

Page 29: Parallelization of Coupled Cluster Code with OpenMP

29

Result

Number of cores Time1 11 minute 41 second2 10 minute 1 second4 8 minute 6 second8 7 minute 31 second

Reduce wall clock time by 4.16

Improved performance of vbar

by factor of: 35.663

Page 30: Parallelization of Coupled Cluster Code with OpenMP

30

Descriptive statistics

Graph shows that as number of cores increasing wall clock it reducing at 35.66% of total time to increase performance

Page 31: Parallelization of Coupled Cluster Code with OpenMP

31

Further improvement

This technique can be applicable to multi-level nested do loops which are highly complex and require more time.

This code can also benefit with hybrid approach i.e. outer loop is parallelized between processors using MPI and the inner loop is parallelized for processing elements inside each processor with OpenMP directives. Though this effectively means rewriting the complete code.

Page 32: Parallelization of Coupled Cluster Code with OpenMP

32

Conclusion

In this presentation, I parallelized and optimized the 'vbar' subroutine in CCSD Code.

I conducted a details performance characterization on 8-cores processor system.

Found the optimization technique such as SIMD (Single Instruction Multiple Data) is one of four Flynn's Taxonomy, effective.

Decreased the runtime linearly when adding more compute cores to the same problem

Multiple cores/CPUs dominate the future computer architectures; OpenMP would be very useful for parallelization of sequential applications, in these architectures

Page 33: Parallelization of Coupled Cluster Code with OpenMP

33

References

Barney, Blaise.”Introduction to Parallel Computing” . Lawrence Livermore National Laboratory. http://www.llnl.gov/computing/tutorials/parallel_comp/ The official for OpenMP www.openmp.orghttp://www.llnl.gov/computing/tutorials/openMP/

R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald, Parallel Programming in OpenMP. Morgan Kaufmann, 2000. http://www.nersc.gov/nusers/help/tutorials/openmp

MPI web pages at Argonne National Laboratory http://www-unix.mcs.anl.gov/mpi

Page 34: Parallelization of Coupled Cluster Code with OpenMP

34